Reading and writing data
Bio.jl has a unified interface for reading and writing files in a variety of formats. Reader and writer type names have a prefix of the file format. For example, files of a format X
can be read using XReader
and can be written using XWriter
. To initialize a reader/writer of X
, you can use one of the following syntaxes:
# reader open(::Type{XReader}, filepath::AbstractString, args...) XReader(stream::IO, args...) # writer open(::Type{XWriter}, filepath::AbstractString, args...) XWriter(stream::IO, args...)
For example, when reading a FASTA file, a reader for the FASTA file format can be initialized as:
using Bio.Seq # import FASTA reader = open(FASTA.Reader, "hg38.fa") # do something close(reader)
Reading by iteration
Readers in Bio.jl all read and return entries one at a time. The most convenient way to do this by iteration:
reader = open(BED.Reader, "input.bed") for record in reader # perform some operation on entry end close(reader)
In-place reading
Iterating through entries in a file is convenient, but for each entry in the file, the reader must allocate, and ultimately the garbage collector must spend time to deallocate it. For performance critical applications, a separate lower level parsing interface can be used that avoid unnecessary allocation by overwriting one entry. For files with a large number of small entries, this can greatly speed up reading.
Instead of looping over a reader stream read!
is called with a preallocated entry. Some care is necessary when using this interface because record
is completely overwritten on each iteration:
reader = open(BED.Reader, "input.bed") record = BED.Record() while !eof(reader) read!(reader, record) # perform some operation on `record` end close(reader)
Empty record types that correspond to the file format be found using eltype
, making it easy to allocate an empty record for any reader stream:
record = eltype(stream)()
Writing data
A FASTA file will be created as follows:
writer = open(FASTA.Writer, "out.fa") write(writer, FASTA.Record("seq1", dna"ACGTN")) write(writer, FASTA.Record("seq2", "AT rich", dna"TTATA")) close(writer)
Another way is using Julia's do-block syntax, which closes the data file after finished writing:
open(FASTA.Writer, "out.fa") do writer write(writer, FASTA.Record("seq1", dna"ACGTN")) write(writer, FASTA.Record("seq2", "AT rich", dna"TTATA")) end
Supported file formats
The following table summarizes supported file formats.
File format | Prefix | Module | Specification |
---|---|---|---|
FASTA | FASTA |
Bio.Seq |
https://en.wikipedia.org/wiki/FASTA_format |
FASTQ | FASTQ |
Bio.Seq |
https://en.wikipedia.org/wiki/FASTQ_format |
.2bit | TwoBit |
Bio.Seq |
http://genome.ucsc.edu/FAQ/FAQformat.html#format7 |
BED | BED |
Bio.Intervals |
https://genome.ucsc.edu/FAQ/FAQformat.html#format1 |
GFF3 | GFF3 |
Bio.Intervals |
https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md |
bigWig | BigWig |
Bio.Intervals |
https://doi.org/10.1093/bioinformatics/btq351 |
bigBed | BigBed |
Bio.Intervals |
https://doi.org/10.1093/bioinformatics/btq351 |
PDB | PDB |
Bio.Structure |
http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html |
SAM | SAM |
Bio.Align |
https://samtools.github.io/hts-specs/SAMv1.pdf |
BAM | BAM |
Bio.Align |
https://samtools.github.io/hts-specs/SAMv1.pdf |
VCF | VCF |
Bio.Var |
https://samtools.github.io/hts-specs/VCFv4.3.pdf |
BCF | BCF |
Bio.Var |
https://samtools.github.io/hts-specs/VCFv4.3.pdf |
FASTA
- Reader type:
FASTA.Reader
- Writer type:
FASTA.Writer
- Element type:
FASTA.Record
FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of records with identifier, description, and sequence. The template of a sequence record is:
>{identifier} {description}? {sequence}
Here is an example of a chromosomal sequence:
>chrI chromosome 1 CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG
Usually sequence records will be read sequentially from a file by iteration. But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:
reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai") chrIV = reader["chrIV"] # directly read chromosome 4
#
Bio.Seq.FASTA.Reader
— Type.
FASTA.Reader(input::IO; index=nothing)
Create a data reader of the FASTA file format.
Arguments
input
: data sourceindex=nothing
: filepath to a random access index (currently fai is supported)
#
Bio.Seq.FASTA.Writer
— Type.
FASTA.Writer(output::IO; width=70)
Create a data writer of the FASTA file format.
Arguments
output
: data sinkwidth=70
: wrapping width of sequence characters
#
Bio.Seq.FASTA.Record
— Type.
FASTA.Record()
Create an unfilled FASTA record.
FASTA.Record(data::Vector{UInt8})
Create a FASTA record object from data
.
This function verifies and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
FASTA.Record(str::AbstractString)
Create a FASTA record object from str
.
This function verifies and indexes fields for accessors.
FASTA.Record(identifier, sequence)
Create a FASTA record object from identifier
and sequence
.
FASTA.Record(identifier, description, sequence)
Create a FASTA record object from identifier
, description
and sequence
.
#
Bio.Seq.FASTA.identifier
— Function.
identifier(record::Record)::String
Get the sequence identifier of record
.
#
Bio.Seq.FASTA.description
— Function.
description(record::Record)::String
Get the description of record
.
#
Bio.Seq.FASTA.sequence
— Function.
sequence(::Type{S}, record::Record, [part::UnitRange{Int}])::S
Get the sequence of record
.
S
can be either a subtype of Bio.Seq.Sequence
or String
. If part
argument is given, it returns the specified part of the sequence.
sequence(record::Record, [part::UnitRange{Int}])
Get the sequence of record
.
This function infers the sequence type from the data. When it is wrong or unreliable, use sequence(::Type{S}, record::Record)
. If part
argument is given, it returns the specified part of the sequence.
FASTQ
- Reader type:
FASTQ.Reader
- Writer type:
FASTQ.Writer
- Element type:
FASTQ.Record
FASTQ is a text-based file format for representing DNA sequences along with qualities for each base. A FASTQ file stores a list of sequence records in the following format:
@{identifier} {description}? {sequence} + {quality}
Here is an example of one record from a FASTQ file:
@FSRRS4401BE7HA tcagTTAAGATGGGAT + ###EEEEEEEEE##E#
To read a file containing such records, one could use:
# The default base quality encoding is Sanger. reader = open(FASTQ.Reader, "reads.fastq") for record in reader # do something end close(reader) # If performance is important, in-place reading will be much faster. reader = open(FASTQ.Reader, "reads.fastq") record = FASTQ.Record() while !eof(reader) read!(reader, record) # do something end close(reader)
#
Bio.Seq.FASTQ.Reader
— Type.
FASTQ.Reader(input::IO; fill_ambiguous=nothing)
Create a data reader of the FASTQ file format.
Arguments
input
: data sourcefill_ambiguous=nothing
: fill ambiguous symbols with the given symbol
#
Bio.Seq.FASTQ.Writer
— Type.
FASTQ.Writer(output::IO; quality_header=false)
Create a data writer of the FASTQ file format.
Arguments
output
: data sinkquality_header=false
: output the title line at the third line just after '+'
#
Bio.Seq.FASTQ.Record
— Type.
FASTQ.Record()
Create an unfilled FASTQ record.
FASTQ.Record(data::Vector{UInt8})
Create a FASTQ record object from data
.
This function verifies and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
FASTQ.Record(str::AbstractString)
Create a FASTQ record object from str
.
This function verifies and indexes fields for accessors.
FASTQ.Record(identifier, sequence, quality; offset=33)
Create a FASTQ record from identifier
, sequence
and quality
.
FASTQ.Record(identifier, description, sequence, quality; offset=33)
Create a FASTQ record from identifier
, description
, sequence
and quality
.
#
Bio.Seq.FASTQ.identifier
— Function.
identifier(record::Record)::String
Get the sequence identifier of record
.
#
Bio.Seq.FASTQ.description
— Function.
description(record::Record)::String
Get the description of record
.
#
Bio.Seq.FASTQ.sequence
— Function.
sequence(::Type{S}, record::Record, [part::UnitRange{Int}])::S
Get the sequence of record
.
S
can be either a subtype of Bio.Seq.Sequence
or String
. If part
argument is given, it returns the specified part of the sequence.
sequence(record::Record, [part::UnitRange{Int}])::Bio.Seq.DNASequence
Get the sequence of record
.
#
Bio.Seq.FASTQ.quality
— Function.
quality(record::Record, [offset::Integer=33, [part::UnitRange]])::Vector{UInt8}
Get the base quality of record
.
quality(record::Record, encoding_name::Symbol, [part::UnitRange])::Vector{UInt8}
Get the base quality of record
by decoding with encoding_name
.
The encoding_name
can be either :sanger
, :solexa
, :illumina13
, :illumina15
, or :illumina18
.
.2bit
- Reader type:
TwoBit.Reader{T<:IO}
- Writer type:
TwoBit.Writer{T<:IO}
- Element type:
TwoBit.Record
.2bit is a binary file format designed for storing a genome consists of multiple chromosomal sequences. The reading speed is often an order of magnitude faster than that of FASTA and the file size is smaller. However, since the .2bit file format is specialized for genomic sequences, it cannot store either RNA or amino acid sequences.
Like FASTA, the .2bit reader supports random access using an index included in the header section of a .2bit file:
reader = open(TwoBit.Reader, "sacCer.2bit") # load a random access index in the header chrIV = reader["chrIV"] # directly read chromosome 4
#
Bio.Seq.TwoBit.Reader
— Type.
TwoBit.Reader(input::IO)
Create a data reader of the 2bit file format.
Arguments
input
: data source
#
Bio.Seq.TwoBit.Writer
— Type.
TwoBitWriter(output::IO, names::AbstractVector)
Create a data writer of the 2bit file format.
Arguments
output
: data sinknames
: a vector of sequence names written tooutput
#
Bio.Seq.TwoBit.Record
— Type.
TwoBit.Record()
Create an unfilled 2bit record.
#
Bio.Seq.TwoBit.seqnames
— Function.
seqnames(reader::Reader)::Vector{String}
Get the sequence names.
Sequences are stored in this order.
#
Bio.Seq.TwoBit.sequence
— Function.
sequence([::Type{S},] record::Record)::S
Get the sequence of record
as S
.
S
can be either Bio.Seq.ReferenceSequence
or Bio.Seq.DNASequence
. If S
is omitted, the default type is Bio.Seq.ReferenceSequence
.
#
Bio.Seq.TwoBit.maskedblocks
— Function.
maskedblocks(record::Record)::Vector{UnitRange{Int}}
Get the masked blocks.
ABIF
- Reader type:
AbifReader{T<:IO}
ABIF is a binary file format for storing data produced by sequencers, those developed by Applied Biosystems, Inc. When the file is opened, we save all the existing tags, so we can read only the tags that are needed.
reader = open(AbifReader, "3100.ab1") # load a random access data = reader["DATA"] # directly read all existing `DATA` Tags data = reader[1] # directly read Tag at index data = tags(reader) # return all existing tags # iterator by all tags for (key, value) in getindex(reader, data) end
Bio.Seq.AbifReader Bio.Seq.tags Bio.Seq.elements
BED
- Reader type:
BED.Reader
- Writer type:
BED.Writer
- Element type:
BED.Record
BED is a text-based file format for representing genomic annotations like genes, transcripts, and so on. A BED file has tab-delimited and variable-length fields; the first three fields denoting a genomic interval are mandatory.
This is an example of RNA transcripts:
chr9 68331023 68424451 NM_015110 0 + chr9 68456943 68486659 NM_001206 0 -
#
Bio.Intervals.BED.Reader
— Type.
BED.Reader(input::IO)
Create a data reader of the BED file format.
Arguments:
input
: data source
#
Bio.Intervals.BED.Writer
— Type.
BED.Writer(output::IO)
Create a data writer of the BED file format.
Arguments:
output
: data sink
#
Bio.Intervals.BED.Record
— Type.
BED.Record()
Create an unfilled BED record.
BED.Record(data::Vector{UInt8})
Create a BED record object from data
.
This function verifies and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
BED.Record(str::AbstractString)
Create a BED record object from str
.
This function verifies and indexes fields for accessors.
#
Bio.Intervals.BED.chrom
— Function.
chrom(record::Record)::String
Get the chromosome name of record
.
#
Bio.Intervals.BED.chromstart
— Function.
chromstart(record::Record)::Int
Get the starting position of record
.
Note that the first base is numbered 1.
#
Bio.Intervals.BED.chromend
— Function.
chromend(record::Record)::Int
Get the end position of record
.
#
Bio.Intervals.BED.name
— Function.
name(record::Record)::String
Get the name of record
.
#
Bio.Intervals.BED.score
— Function.
score(record::Record)::Int
Get the score between 0 and 1000.
#
Bio.Intervals.BED.strand
— Function.
strand(record::Record)::Bio.Intervals.Strand
Get the strand of record
.
#
Bio.Intervals.BED.thickstart
— Function.
thickstart(record::Record)::Int
Get the starting position at which record
is drawn thickly.
Note that the first base is numbered 1.
#
Bio.Intervals.BED.thickend
— Function.
thickend(record::Record)::Int
Get the end position at which record
is drawn thickly.
#
Bio.Intervals.BED.itemrgb
— Function.
itemrgb(record::Record)::ColorTypes.RGB
Get the RGB value of record
.
The return type is defined in ColorTypes.jl.
#
Bio.Intervals.BED.blockcount
— Function.
blockcount(record::Record)::Int
Get the number of blocks (exons) in record
.
#
Bio.Intervals.BED.blocksizes
— Function.
blocksizes(record::Record)::Vector{Int}
Get the block (exon) sizes of record
.
#
Bio.Intervals.BED.blockstarts
— Function.
blockstarts(record::Record)::Vector{Int}
Get the block (exon) starts of record
.
Note that the first base is numbered 1.
GFF3
- Reader type:
Bio.Intervals.GFF3.Reader
- Writer type:
Bio.Intervals.GFF3.Writer
- Element type:
Bio.Intervals.GFF3.Record
GFF3 is a text-based file format for representing genomic annotations. The major difference from BED is that is GFF3 is more structured and can include sequences in the FASTA file format.
#
Bio.Intervals.GFF3.Reader
— Type.
GFF3Reader(input::IO; save_directives::Bool=false)
Create a reader for data in GFF3 format.
Arguments:
input
: data sourcesave_directives=false
: if true, store directive lines, which can be accessed with thedirectives
function
#
Bio.Intervals.GFF3.Writer
— Type.
GFF3.Writer(output::IO)
Create a data writer of the GFF3 file format.
Arguments:
output
: data sink
#
Bio.Intervals.GFF3.Record
— Type.
GFF3.Record()
Create an unfilled GFF3 record.
GFF3.Record(data::Vector{UInt8})
Create a GFF3 record object from data
. This function verifies and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
GFF3.Record(str::AbstractString)
Create a GFF3 record object from str
. This function verifies and indexes fields for accessors.
#
Bio.Intervals.GFF3.isfeature
— Function.
isfeature(record::Record)::Bool
Test if record
is a feature record.
#
Bio.Intervals.GFF3.isdirective
— Function.
isdirective(record::Record)::Bool
Test if record
is a directive record.
#
Bio.Intervals.GFF3.iscomment
— Function.
iscomment(record::Record)::Bool
Test if record
is a comment record.
#
Bio.Intervals.GFF3.seqid
— Function.
seqid(record::Record)::String
Get the sequence ID of record
.
#
Bio.Intervals.GFF3.source
— Function.
source(record::Record)::String
Get the source of record
.
#
Bio.Intervals.GFF3.featuretype
— Function.
featuretype(record::Record)::String
Get the type of record
.
#
Bio.Intervals.GFF3.seqstart
— Function.
seqstart(record::Record)::Int
Get the start coordinate of record
.
#
Bio.Intervals.GFF3.seqend
— Function.
seqend(record::Record)::Int
Get the end coordinate of record
.
#
Bio.Intervals.GFF3.score
— Function.
score(record::Record)::Float64
Get the score of record
#
Bio.Intervals.GFF3.strand
— Function.
strand(record::Record)::Bio.Intervals.Strand
Get the strand of record
.
#
Bio.Intervals.GFF3.phase
— Function.
phase(record::Record)::Int
Get the phase of record
.
#
Bio.Intervals.GFF3.attributes
— Function.
attributes(record::Record)::Vector{Pair{String,Vector{String}}}
Get the attributes of record
.
attributes(record::Record, key::String)::Vector{String}
Get the attributes of record
with key
.
#
Bio.Intervals.GFF3.content
— Function.
content(record::Record)::String
Get the content of record
. Leading '#' letters are removed.
bigWig
- Reader type:
BigWig.Reader
- Writer type:
BigWig.Writer
- Element type:
BigWig.Record
bigWig is a binary file format for associating a floating point number with each base in the genome. bigWig files are indexed to quickly fetch specific regions.
#
Bio.Intervals.BigWig.Reader
— Type.
BigWig.Reader(input::IO)
Create a reader for bigWig file format.
Note that input
must be seekable.
#
Bio.Intervals.BigWig.chromlist
— Function.
chromlist(reader::BigWig.Reader)::Vector{Tuple{String,Int}}
Get the (name, length)
pairs of chromosomes/contigs.
#
Bio.Intervals.BigWig.values
— Function.
values(reader::BigWig.Reader, interval::Interval)::Vector{Float32}
Get a vector of values within interval
from reader
.
This function fills missing values with NaN32
.
values(reader::BigWig.Reader, chrom::AbstractString, range::UnitRange)::Vector{Float32}
Get a vector of values within range
of chrom
from reader
.
This function fills missing values with NaN32
.
#
Bio.Intervals.BigWig.Writer
— Type.
BigWig.Writer(output::IO, chromlist; binsize=64, datatype=:bedgraph)
Create a data writer of the bigWig file format.
Arguments
output
: data sinkchromlist
: chromosome list with lengthbinsize=64
: size of a zoom with the highest resolutiondatatype=:bedgraph
: encoding of values (:bedgraph
,:varstep
or:fixedstep
)
Examples
output = open("data.bw", "w") writer = BigWig.Writer(output, [("chr1", 12345), ("chr2", 9100)]) write(writer, ("chr1", 501, 600, 1.0)) write(writer, ("chr2", 301, 450, 3.0)) close(writer)
#
Bio.Intervals.BigWig.Record
— Type.
BigWig.Record()
Create an unfilled bigWig record.
#
Bio.Intervals.BigWig.chrom
— Function.
chrom(record::Record)::String
Get the chromosome name of record
.
#
Bio.Intervals.BigWig.chromid
— Function.
chromid(record::Record)::UInt32
Get the chromosome ID of record
.
#
Bio.Intervals.BigWig.chromstart
— Function.
chromstart(record::Record)::Int
Get the start position of record
.
#
Bio.Intervals.BigWig.chromend
— Function.
chromend(record::Record)::Int
Get the end position of record
.
#
Bio.Intervals.BigWig.value
— Function.
value(record::Record)::Float32
Get the value of record
.
bigBed
- Reader type:
BigBed.Reader
- Writre type:
BigBed.Writer
- Element type:
BigBed.Record
bigBed is a binary file format for representing genomic annotations and often created from BED files. bigBed files are indexed to quickly fetch specific regions.
#
Bio.Intervals.BigBed.Reader
— Type.
BigBed.Reader(input::IO)
Create a reader for bigBed file format.
Note that input
must be seekable.
#
Bio.Intervals.BigBed.chromlist
— Function.
chromlist(reader::BigBed.Reader)::Vector{Tuple{String,Int}}
Get the (name, length)
pairs of chromosomes/contigs.
#
Bio.Intervals.BigBed.Writer
— Type.
BigBed.Writer(output::IO, chromlist; binsize=64)
Create a data writer of the bigBed file format.
Arguments
output
: data sinkchromlist
: chromosome list with lengthbinsize=64
: size of a zoom with the highest resolution
Examples
output = open("data.bb", "w") writer = BigBed.Writer(output, [("chr1", 12345), ("chr2", 9100)]) write(writer, ("chr1", 101, 150, "gene 1")) write(writer, ("chr2", 211, 250, "gene 2")) close(writer)
#
Bio.Intervals.BigBed.Record
— Type.
BigBed.Record()
Create an unfilled bigBed record.
#
Bio.Intervals.BigBed.chrom
— Function.
chrom(record::Record)::String
Get the chromosome name of record
.
#
Bio.Intervals.BigBed.chromid
— Function.
chromid(record::Record)::UInt32
Get the chromosome ID of record
.
#
Bio.Intervals.BigBed.chromstart
— Function.
chromstart(record::Record)::Int
Get the start position of record
.
#
Bio.Intervals.BigBed.chromend
— Function.
chromend(record::Record)::Int
Get the end position of record
.
#
Bio.Intervals.BigBed.name
— Function.
name(record::Record)::String
Get the name of record
.
#
Bio.Intervals.BigBed.score
— Function.
score(record::Record)::Int
Get the score between 0 and 1000.
#
Bio.Intervals.BigBed.strand
— Function.
strand(record::Record)::Bio.Intervals.Strand
Get the strand of record
.
#
Bio.Intervals.BigBed.thickstart
— Function.
thickstart(record::Record)::Int
Get the starting position at which record
is drawn thickly.
Note that the first base is numbered 1.
#
Bio.Intervals.BigBed.thickend
— Function.
thickend(record::Record)::Int
Get the end position at which record
is drawn thickly.
#
Bio.Intervals.BigBed.itemrgb
— Function.
itemrgb(record::Record)::ColorTypes.RGB
Get the RGB value of record
.
The return type is defined in ColorTypes.jl.
#
Bio.Intervals.BigBed.blockcount
— Function.
blockcount(record::Record)::Int
Get the number of blocks (exons) in record
.
#
Bio.Intervals.BigBed.blocksizes
— Function.
blocksizes(record::Record)::Vector{Int}
Get the block (exon) sizes of record
.
#
Bio.Intervals.BigBed.blockstarts
— Function.
blockstarts(record::Record)::Vector{Int}
Get the block (exon) starts of record
.
Note that the first base is numbered 1.
#
Bio.Intervals.BigBed.optionals
— Function.
optionals(record::Record)::Vector{String}
Get optional fields as strings.
PDB
PDB is a text-based file format for representing 3D macromolecular structures. This has different reader interfaces from other file formats. Please consult the Bio.Structure chapter for details.
SAM
- Reader type:
SAM.Reader
- Writer type:
SAM.Writer
- Element type:
SAM.Record
SAM is a text-based file format for representing sequence alignments.
#
Bio.Align.SAM.Reader
— Type.
SAM.Reader(input::IO)
Create a data reader of the SAM file format.
Arguments
input
: data source
#
Bio.Align.SAM.header
— Function.
header(reader::Reader)::Header
Get the header of reader
.
#
Bio.Align.SAM.Header
— Type.
SAM.Header()
Create an empty header.
#
Base.find
— Method.
find(header::Header, key::AbstractString)::Vector{MetaInfo}
Find metainfo objects satisfying SAM.tag(metainfo) == key
.
#
Bio.Align.SAM.Writer
— Type.
Writer(output::IO, header::Header=Header())
Create a data writer of the SAM file format.
Arguments
output
: data sinkheader=Header()
: SAM header object
#
Bio.Align.SAM.MetaInfo
— Type.
MetaInfo(str::AbstractString)
Create a SAM metainfo from str
.
Examples
julia> SAM.MetaInfo("@CO some comment") Bio.Align.SAM.MetaInfo: tag: CO value: some comment julia> SAM.MetaInfo("@SQ SN:chr1 LN:12345") Bio.Align.SAM.MetaInfo: tag: SQ value: SN=chr1 LN=12345
MetaInfo(tag::AbstractString, value)
Create a SAM metainfo with tag
and value
.
tag
is a two-byte ASCII string. If tag
is "CO"
, value
must be a string; otherwise, value
is an iterable object with key and value pairs.
Examples
julia> SAM.MetaInfo("CO", "some comment") Bio.Align.SAM.MetaInfo: tag: CO value: some comment julia> string(ans) "@CO some comment" julia> SAM.MetaInfo("SQ", ["SN" => "chr1", "LN" => 12345]) Bio.Align.SAM.MetaInfo: tag: SQ value: SN=chr1 LN=12345 julia> string(ans) "@SQ SN:chr1 LN:12345"
#
Bio.Align.SAM.iscomment
— Function.
iscomment(metainfo::MetaInfo)::Bool
Test if metainfo
is a comment (i.e. its tag is "CO").
#
Bio.Align.SAM.tag
— Function.
tag(metainfo::MetaInfo)::String
Get the tag of metainfo
.
#
Bio.Align.SAM.value
— Function.
value(metainfo::MetaInfo)::String
Get the value of metainfo
as a string.
#
Bio.Align.SAM.keyvalues
— Function.
keyvalues(metainfo::MetaInfo)::Vector{Pair{String,String}}
Get the values of metainfo
as string pairs.
#
Bio.Align.SAM.Record
— Type.
SAM.Record()
Create an unfilled SAM record.
SAM.Record(data::Vector{UInt8})
Create a SAM record from data
. This function verifies the format and indexes fields for accessors. Note that the ownership of data
is transferred to a new record object.
SAM.Record(str::AbstractString)
Create a SAM record from str
. This function verifies the format and indexes fields for accessors.
#
Bio.Align.SAM.flag
— Function.
flag(record::Record)::UInt16
Get the bitwise flag of record
.
#
Bio.Align.SAM.ismapped
— Function.
ismapped(record::Record)::Bool
Test if record
is mapped.
#
Bio.Align.SAM.isprimary
— Function.
isprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
#
Bio.Align.SAM.refname
— Function.
refname(record::Record)::String
Get the reference sequence name of record
.
#
Bio.Align.SAM.position
— Function.
position(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
#
Bio.Align.SAM.rightposition
— Function.
rightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
#
Bio.Align.SAM.isnextmapped
— Function.
isnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
#
Bio.Align.SAM.nextrefname
— Function.
nextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
#
Bio.Align.SAM.nextposition
— Function.
nextposition(record::Record)::Int
Get the position of the mate/next read of record
.
#
Bio.Align.SAM.mappingquality
— Function.
mappingquality(record::Record)::UInt8
Get the mapping quality of record
.
#
Bio.Align.SAM.cigar
— Function.
cigar(record::Record)::String
Get the CIGAR string of record
.
#
Bio.Align.SAM.alignment
— Function.
alignment(record::Record)::Bio.Align.Alignment
Get the alignment of record
.
#
Bio.Align.SAM.alignlength
— Function.
alignlength(record::Record)::Int
Get the alignment length of record
.
#
Bio.Align.SAM.tempname
— Function.
tempname(record::Record)::String
Get the query template name of record
.
#
Bio.Align.SAM.templength
— Function.
templength(record::Record)::Int
Get the template length of record
.
#
Bio.Align.SAM.sequence
— Function.
sequence(record::Record)::Bio.Seq.DNASequence
Get the segment sequence of record
.
sequence(::Type{String}, record::Record)::String
Get the segment sequence of record
as String
.
#
Bio.Align.SAM.seqlength
— Function.
seqlength(record::Record)::Int
Get the sequence length of record
.
#
Bio.Align.SAM.quality
— Function.
quality(record::Record)::Vector{UInt8}
Get the Phred-scaled base quality of record
.
quality(::Type{String}, record::Record)::String
Get the ASCII-encoded base quality of record
.
#
Bio.Align.SAM.auxdata
— Function.
auxdata(record::Record)::Dict{String,Any}
Get the auxiliary data (optional fields) of record
.
This module provides 16-bit flags defined in the SAM specs:
Flag | Bit | Description |
---|---|---|
SAM.FLAG_PAIRED |
0x0001 |
template having multiple segments in sequencing |
SAM.FLAG_PROPER_PAIR |
0x0002 |
each segment properly aligned according to the aligner |
SAM.FLAG_UNMAP |
0x0004 |
segment unmapped |
SAM.FLAG_MUNMAP |
0x0008 |
next segment in the template unmapped |
SAM.FLAG_REVERSE |
0x0010 |
SEQ being reverse complemented |
SAM.FLAG_MREVERSE |
0x0020 |
SEQ of the next segment in the template being reverse complemented |
SAM.FLAG_READ1 |
0x0040 |
the first segment in the template |
SAM.FLAG_READ2 |
0x0080 |
the last segment in the template |
SAM.FLAG_SECONDARY |
0x0100 |
secondary alignment |
SAM.FLAG_QCFAIL |
0x0200 |
not passing filters, such as platform/vendor quality controls |
SAM.FLAG_DUP |
0x0400 |
PCR or optical duplicate |
SAM.FLAG_SUPPLEMENTARY |
0x0800 |
supplementary alignment |
BAM
- Reader type:
BAM.Reader
- Writer type:
BAM.Writer
- Element type:
BAM.Record
BAM is a binary counterpart of the SAM file format.
When writing data in the BAM file format, the underlying output stream needs to be wrapped with a BGZFStream
object provided from BGZFStreams.jl.
Flags and the header type are defined in the SAM module.
#
Bio.Align.BAM.Reader
— Type.
BAM.Reader(input::IO; index=nothing)
Create a data reader of the BAM file format.
Arguments
input
: data sourceindex=nothing
: filepath to a random access index (currently bai is Supported)
#
Bio.Align.BAM.header
— Function.
header(reader::Reader; fillSQ::Bool=false)::SAM.Header
Get the header of reader
.
If fillSQ
is true
, this function fills missing "SQ" metainfo in the header.
#
Bio.Align.BAM.Writer
— Type.
BAM.Writer(output::BGZFStream, header::SAM.Header)
Create a data writer of the BAM file format.
Arguments
output
: data sinkheader
: SAM header object
#
Bio.Align.BAM.Record
— Type.
BAM.Record()
Create an unfilled BAM record.
#
Bio.Align.BAM.flag
— Function.
flag(record::Record)::UInt16
Get the bitwise flag of record
.
#
Bio.Align.BAM.ismapped
— Function.
ismapped(record::Record)::Bool
Test if record
is mapped.
#
Bio.Align.BAM.isprimary
— Function.
isprimary(record::Record)::Bool
Test if record
is a primary line of the read.
This is equivalent to flag(record) & 0x900 == 0
.
#
Bio.Align.BAM.refid
— Function.
refid(record::Record)::Int
Get the reference sequence ID of record
.
The ID is 1-based (i.e. the first sequence is 1) and is 0 for a record without a mapping position.
See also: BAM.rname
#
Bio.Align.BAM.refname
— Function.
refname(record::Record)::String
Get the reference sequence name of record
.
See also: BAM.refid
#
Bio.Align.BAM.position
— Function.
position(record::Record)::Int
Get the 1-based leftmost mapping position of record
.
#
Bio.Align.BAM.rightposition
— Function.
rightposition(record::Record)::Int
Get the 1-based rightmost mapping position of record
.
#
Bio.Align.BAM.isnextmapped
— Function.
isnextmapped(record::Record)::Bool
Test if the mate/next read of record
is mapped.
#
Bio.Align.BAM.nextrefid
— Function.
nextrefid(record::Record)::Int
Get the next/mate reference sequence ID of record
.
#
Bio.Align.BAM.nextrefname
— Function.
nextrefname(record::Record)::String
Get the reference name of the mate/next read of record
.
#
Bio.Align.BAM.nextposition
— Function.
nextposition(record::Record)::Int
Get the 1-based leftmost mapping position of the next/mate read of record
.
#
Bio.Align.BAM.mappingquality
— Function.
mappingquality(record::Record)::UInt8
Get the mapping quality of record
.
#
Bio.Align.BAM.cigar
— Function.
cigar(record::Record)::String
Get the CIGAR string of record
.
See also BAM.cigar_rle
.
#
Bio.Align.BAM.cigar_rle
— Function.
cigar_rle(record::Record)::Tuple{Vector{Bio.Align.Operation},Vector{Int}}
Get a run-length encoded tuple (ops, lens)
of the CIGAR string in record
.
See also BAM.cigar
.
#
Bio.Align.BAM.alignment
— Function.
alignment(record::Record)::Bio.Align.Alignment
Get the alignment of record
.
#
Bio.Align.BAM.alignlength
— Function.
alignlength(record::Record)::Int
Get the alignment length of record
.
#
Bio.Align.BAM.tempname
— Function.
tempname(record::Record)::String
Get the query template name of record
.
#
Bio.Align.BAM.templength
— Function.
templength(record::Record)::Int
Get the template length of record
.
#
Bio.Align.BAM.sequence
— Function.
sequence(record::Record)::Bio.Seq.DNASequence
Get the segment sequence of record
.
#
Bio.Align.BAM.seqlength
— Function.
seqlength(record::Record)::Int
Get the sequence length of record
.
#
Bio.Align.BAM.quality
— Function.
quality(record::Record)::Vector{UInt8}
Get the base quality of record
.
#
Bio.Align.BAM.auxdata
— Function.
auxdata(record::Record)::BAM.AuxData
Get the auxiliary data of record
.
VCF
- Reader type:
VCF.Reader
- Writer type:
VCF.Writer{T<:IO}
- Element type:
VCF.Record
VCF is a text-based file format for representing genetic variations.
#
Bio.Var.VCF.Reader
— Type.
VCF.Reader(input::IO)
Create a data reader of the VCF file format.
Arguments
input
: data source
#
Bio.Var.VCF.Writer
— Type.
VCF.Writer(output::IO, header::VCF.Header)
Create a data writer of the VCF file format.
Arguments
output
: data sinkheader
: VCF header object
#
Bio.Var.VCF.Record
— Type.
VCF.Record()
Create an unfilled VCF record.
VCF.Record(data::Vector{UInt8})
Create a VCF object from data
containing a VCF record. This function verifies the format and indexes fields for accessors. Note that the ownership of data
is transferred to a new record.
VCF.Record(str::AbstractString)
Create a VCF object from str
containing a VCF record. This function verifies the format and indexes fields for accessors.
#
Bio.Var.VCF.chrom
— Function.
chrom(record::Record)::String
Get the chromosome name of record
.
#
Bio.Var.VCF.pos
— Function.
pos(record::Record)::Int
Get the reference position of record
.
#
Bio.Var.VCF.id
— Function.
id(record::Record)::Vector{String}
Get the identifiers of record
.
#
Bio.Var.VCF.ref
— Function.
ref(record::Record)::String
Get the reference bases of record
.
#
Bio.Var.VCF.alt
— Function.
alt(record::Record)::Vector{String}
Get the alternate bases of record
.
#
Bio.Var.VCF.qual
— Function.
qual(record::Record)::Float64
Get the quality score of record
.
#
Bio.Var.VCF.filter
— Function.
filter(record::Record)::Vector{String}
Get the filter status of record
.
#
Bio.Var.VCF.info
— Function.
info(record::Record)::Vector{Pair{String,String}}
Get the additional information of record
.
info(record::Record, key::String)::String
Get the additional information of record
with key
. Keys without corresponding values return an empty string.
#
Bio.Var.VCF.format
— Function.
format(record::Record)::Vector{String}
Get the genotype format of record
.
BCF
- Reader type:
BCF.Reader{T<:IO}
- Writer type:
BCF.Writer{T<:IO}
- Element type:
BCF.Record
BCF is a binary counterpart of the VCF file format.
#
Bio.Var.BCF.Reader
— Type.
BCF.Reader(input::IO)
Create a data reader of the BCF file format.
Arguments
input
: data source
#
Bio.Var.BCF.Writer
— Type.
BCF.Writer(output::IO, header::VCF.Header)
Create a data writer of the BCF file format.
Arguments
output
: data sinkheader
: VCF header object
#
Bio.Var.BCF.Record
— Type.
BCF.Record()
Create an unfilled BCF record.
#
Bio.Var.BCF.chrom
— Function.
chrom(record::Record)::Int
Get the chromosome index of record
.
#
Bio.Var.BCF.pos
— Function.
pos(record::Record)::Int
Get the reference position of record
.
Note that the position of the first base is 1 (i.e. 1-based coordinate).
#
Bio.Var.BCF.rlen
— Function.
rlen(record::Record)::Int
Get the length of record
projected onto the reference sequence.
#
Bio.Var.BCF.qual
— Function.
qual(record::Record)::Float32
Get the quality score of record
.
Note that 0x7F800001
(signaling NaN) is interpreted as a missing value.
#
Bio.Var.BCF.ref
— Function.
ref(record::Record)::String
Get the reference bases of record
.
#
Bio.Var.BCF.alt
— Function.
alt(record::Record)::Vector{String}
Get the alternate bases of record
.
#
Bio.Var.BCF.filter
— Function.
filter(record::Record)::Vector{Int}
Get the filter indexes of record
.
#
Bio.Var.BCF.info
— Function.
info(record::Record, [simplify::Bool=true])::Vector{Tuple{Int,Any}}
Get the additional information of record
.
When simplify
is true
, a vector with a single element is converted to the element itself and an empty vector of the void type is converted to nothing
.
info(record::Record, key::Integer, [simplify::Bool=true])
Get the additional information of record
with key
.
When simplify
is true
, a vector with a single element is converted to the element itself and an empty vector of the void type is converted to nothing
.
#
Bio.Var.BCF.genotype
— Function.
genotype(record::Record)::Vector{Tuple{Int,Vector{Any}}}
Get the genotypes of record
.
BCF genotypes are encoded by field, not by sample like VCF.