Reading and writing data

Bio.jl has a unified interface for reading and writing files in a variety of formats. Reader and writer type names have a prefix of the file format. For example, files of a format X can be read using XReader and can be written using XWriter. To initialize a reader/writer of X, you can use one of the following syntaxes:

# reader
open(::Type{XReader}, filepath::AbstractString, args...)
XReader(stream::IO, args...)

# writer
open(::Type{XWriter}, filepath::AbstractString, args...)
XWriter(stream::IO, args...)

For example, when reading a FASTA file, a reader for the FASTA file format can be initialized as:

using Bio.Seq  # import FASTA
reader = open(FASTA.Reader, "hg38.fa")
# do something
close(reader)

Reading by iteration

Readers in Bio.jl all read and return entries one at a time. The most convenient way to do this by iteration:

reader = open(BED.Reader, "input.bed")
for record in reader
    # perform some operation on entry
end
close(reader)

In-place reading

Iterating through entries in a file is convenient, but for each entry in the file, the reader must allocate, and ultimately the garbage collector must spend time to deallocate it. For performance critical applications, a separate lower level parsing interface can be used that avoid unnecessary allocation by overwriting one entry. For files with a large number of small entries, this can greatly speed up reading.

Instead of looping over a reader stream read! is called with a preallocated entry. Some care is necessary when using this interface because record is completely overwritten on each iteration:

reader = open(BED.Reader, "input.bed")
record = BED.Record()
while !eof(reader)
    read!(reader, record)
    # perform some operation on `record`
end
close(reader)

Empty record types that correspond to the file format be found using eltype, making it easy to allocate an empty record for any reader stream:

record = eltype(stream)()

Writing data

A FASTA file will be created as follows:

writer = open(FASTA.Writer, "out.fa")
write(writer, FASTA.Record("seq1", dna"ACGTN"))
write(writer, FASTA.Record("seq2", "AT rich", dna"TTATA"))
close(writer)

Another way is using Julia's do-block syntax, which closes the data file after finished writing:

open(FASTA.Writer, "out.fa") do writer
    write(writer, FASTA.Record("seq1", dna"ACGTN"))
    write(writer, FASTA.Record("seq2", "AT rich", dna"TTATA"))
end

Supported file formats

The following table summarizes supported file formats.

File format Prefix Module Specification
FASTA FASTA Bio.Seq https://en.wikipedia.org/wiki/FASTA_format
FASTQ FASTQ Bio.Seq https://en.wikipedia.org/wiki/FASTQ_format
.2bit TwoBit Bio.Seq http://genome.ucsc.edu/FAQ/FAQformat.html#format7
BED BED Bio.Intervals https://genome.ucsc.edu/FAQ/FAQformat.html#format1
GFF3 GFF3 Bio.Intervals https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
bigWig BigWig Bio.Intervals https://doi.org/10.1093/bioinformatics/btq351
bigBed BigBed Bio.Intervals https://doi.org/10.1093/bioinformatics/btq351
PDB PDB Bio.Structure http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
SAM SAM Bio.Align https://samtools.github.io/hts-specs/SAMv1.pdf
BAM BAM Bio.Align https://samtools.github.io/hts-specs/SAMv1.pdf
VCF VCF Bio.Var https://samtools.github.io/hts-specs/VCFv4.3.pdf
BCF BCF Bio.Var https://samtools.github.io/hts-specs/VCFv4.3.pdf

FASTA

FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of records with identifier, description, and sequence. The template of a sequence record is:

>{identifier} {description}?
{sequence}

Here is an example of a chromosomal sequence:

>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

Usually sequence records will be read sequentially from a file by iteration. But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:

reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai")
chrIV = reader["chrIV"]  # directly read chromosome 4

# Bio.Seq.FASTA.ReaderType.

FASTA.Reader(input::IO; index=nothing)

Create a data reader of the FASTA file format.

Arguments

source

# Bio.Seq.FASTA.WriterType.

FASTA.Writer(output::IO; width=70)

Create a data writer of the FASTA file format.

Arguments

source

# Bio.Seq.FASTA.RecordType.

FASTA.Record()

Create an unfilled FASTA record.

source

FASTA.Record(data::Vector{UInt8})

Create a FASTA record object from data.

This function verifies and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source

FASTA.Record(str::AbstractString)

Create a FASTA record object from str.

This function verifies and indexes fields for accessors.

source

FASTA.Record(identifier, sequence)

Create a FASTA record object from identifier and sequence.

source

FASTA.Record(identifier, description, sequence)

Create a FASTA record object from identifier, description and sequence.

source

# Bio.Seq.FASTA.identifierFunction.

identifier(record::Record)::String

Get the sequence identifier of record.

source

# Bio.Seq.FASTA.descriptionFunction.

description(record::Record)::String

Get the description of record.

source

# Bio.Seq.FASTA.sequenceFunction.

sequence(::Type{S}, record::Record, [part::UnitRange{Int}])::S

Get the sequence of record.

S can be either a subtype of Bio.Seq.Sequence or String. If part argument is given, it returns the specified part of the sequence.

source

sequence(record::Record, [part::UnitRange{Int}])

Get the sequence of record.

This function infers the sequence type from the data. When it is wrong or unreliable, use sequence(::Type{S}, record::Record). If part argument is given, it returns the specified part of the sequence.

source

FASTQ

FASTQ is a text-based file format for representing DNA sequences along with qualities for each base. A FASTQ file stores a list of sequence records in the following format:

@{identifier} {description}?
{sequence}
+
{quality}

Here is an example of one record from a FASTQ file:

@FSRRS4401BE7HA
tcagTTAAGATGGGAT
+
###EEEEEEEEE##E#

To read a file containing such records, one could use:

# The default base quality encoding is Sanger.
reader = open(FASTQ.Reader, "reads.fastq")
for record in reader
    # do something
end
close(reader)

# If performance is important, in-place reading will be much faster.
reader = open(FASTQ.Reader, "reads.fastq")
record = FASTQ.Record()
while !eof(reader)
    read!(reader, record)
    # do something
end
close(reader)

# Bio.Seq.FASTQ.ReaderType.

FASTQ.Reader(input::IO; fill_ambiguous=nothing)

Create a data reader of the FASTQ file format.

Arguments

source

# Bio.Seq.FASTQ.WriterType.

FASTQ.Writer(output::IO; quality_header=false)

Create a data writer of the FASTQ file format.

Arguments

source

# Bio.Seq.FASTQ.RecordType.

FASTQ.Record()

Create an unfilled FASTQ record.

source

FASTQ.Record(data::Vector{UInt8})

Create a FASTQ record object from data.

This function verifies and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source

FASTQ.Record(str::AbstractString)

Create a FASTQ record object from str.

This function verifies and indexes fields for accessors.

source

FASTQ.Record(identifier, sequence, quality; offset=33)

Create a FASTQ record from identifier, sequence and quality.

source

FASTQ.Record(identifier, description, sequence, quality; offset=33)

Create a FASTQ record from identifier, description, sequence and quality.

source

# Bio.Seq.FASTQ.identifierFunction.

identifier(record::Record)::String

Get the sequence identifier of record.

source

# Bio.Seq.FASTQ.descriptionFunction.

description(record::Record)::String

Get the description of record.

source

# Bio.Seq.FASTQ.sequenceFunction.

sequence(::Type{S}, record::Record, [part::UnitRange{Int}])::S

Get the sequence of record.

S can be either a subtype of Bio.Seq.Sequence or String. If part argument is given, it returns the specified part of the sequence.

source

sequence(record::Record, [part::UnitRange{Int}])::Bio.Seq.DNASequence

Get the sequence of record.

source

# Bio.Seq.FASTQ.qualityFunction.

quality(record::Record, [offset::Integer=33, [part::UnitRange]])::Vector{UInt8}

Get the base quality of record.

source

quality(record::Record, encoding_name::Symbol, [part::UnitRange])::Vector{UInt8}

Get the base quality of record by decoding with encoding_name.

The encoding_name can be either :sanger, :solexa, :illumina13, :illumina15, or :illumina18.

source

.2bit

.2bit is a binary file format designed for storing a genome consists of multiple chromosomal sequences. The reading speed is often an order of magnitude faster than that of FASTA and the file size is smaller. However, since the .2bit file format is specialized for genomic sequences, it cannot store either RNA or amino acid sequences.

Like FASTA, the .2bit reader supports random access using an index included in the header section of a .2bit file:

reader = open(TwoBit.Reader, "sacCer.2bit")  # load a random access index in the header
chrIV = reader["chrIV"]                      # directly read chromosome 4

# Bio.Seq.TwoBit.ReaderType.

TwoBit.Reader(input::IO)

Create a data reader of the 2bit file format.

Arguments

source

# Bio.Seq.TwoBit.WriterType.

TwoBitWriter(output::IO, names::AbstractVector)

Create a data writer of the 2bit file format.

Arguments

source

# Bio.Seq.TwoBit.RecordType.

TwoBit.Record()

Create an unfilled 2bit record.

source

# Bio.Seq.TwoBit.seqnamesFunction.

seqnames(reader::Reader)::Vector{String}

Get the sequence names.

Sequences are stored in this order.

source

# Bio.Seq.TwoBit.sequenceFunction.

sequence([::Type{S},] record::Record)::S

Get the sequence of record as S.

S can be either Bio.Seq.ReferenceSequence or Bio.Seq.DNASequence. If S is omitted, the default type is Bio.Seq.ReferenceSequence.

source

# Bio.Seq.TwoBit.maskedblocksFunction.

maskedblocks(record::Record)::Vector{UnitRange{Int}}

Get the masked blocks.

source

ABIF

ABIF is a binary file format for storing data produced by sequencers, those developed by Applied Biosystems, Inc. When the file is opened, we save all the existing tags, so we can read only the tags that are needed.

reader = open(AbifReader, "3100.ab1")       # load a random access
data   = reader["DATA"]                     # directly read all existing `DATA` Tags
data   = reader[1]                          # directly read Tag at index
data  = tags(reader)                       # return all existing tags

# iterator by all tags
for (key, value) in getindex(reader, data)
end
Bio.Seq.AbifReader
Bio.Seq.tags
Bio.Seq.elements

BED

BED is a text-based file format for representing genomic annotations like genes, transcripts, and so on. A BED file has tab-delimited and variable-length fields; the first three fields denoting a genomic interval are mandatory.

This is an example of RNA transcripts:

chr9    68331023    68424451    NM_015110   0   +
chr9    68456943    68486659    NM_001206   0   -

# Bio.Intervals.BED.ReaderType.

BED.Reader(input::IO)

Create a data reader of the BED file format.

Arguments:

source

# Bio.Intervals.BED.WriterType.

BED.Writer(output::IO)

Create a data writer of the BED file format.

Arguments:

source

# Bio.Intervals.BED.RecordType.

BED.Record()

Create an unfilled BED record.

source

BED.Record(data::Vector{UInt8})

Create a BED record object from data.

This function verifies and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source

BED.Record(str::AbstractString)

Create a BED record object from str.

This function verifies and indexes fields for accessors.

source

# Bio.Intervals.BED.chromFunction.

chrom(record::Record)::String

Get the chromosome name of record.

source

# Bio.Intervals.BED.chromstartFunction.

chromstart(record::Record)::Int

Get the starting position of record.

Note that the first base is numbered 1.

source

# Bio.Intervals.BED.chromendFunction.

chromend(record::Record)::Int

Get the end position of record.

source

# Bio.Intervals.BED.nameFunction.

name(record::Record)::String

Get the name of record.

source

# Bio.Intervals.BED.scoreFunction.

score(record::Record)::Int

Get the score between 0 and 1000.

source

# Bio.Intervals.BED.strandFunction.

strand(record::Record)::Bio.Intervals.Strand

Get the strand of record.

source

# Bio.Intervals.BED.thickstartFunction.

thickstart(record::Record)::Int

Get the starting position at which record is drawn thickly.

Note that the first base is numbered 1.

source

# Bio.Intervals.BED.thickendFunction.

thickend(record::Record)::Int

Get the end position at which record is drawn thickly.

source

# Bio.Intervals.BED.itemrgbFunction.

itemrgb(record::Record)::ColorTypes.RGB

Get the RGB value of record.

The return type is defined in ColorTypes.jl.

source

# Bio.Intervals.BED.blockcountFunction.

blockcount(record::Record)::Int

Get the number of blocks (exons) in record.

source

# Bio.Intervals.BED.blocksizesFunction.

blocksizes(record::Record)::Vector{Int}

Get the block (exon) sizes of record.

source

# Bio.Intervals.BED.blockstartsFunction.

blockstarts(record::Record)::Vector{Int}

Get the block (exon) starts of record.

Note that the first base is numbered 1.

source

GFF3

GFF3 is a text-based file format for representing genomic annotations. The major difference from BED is that is GFF3 is more structured and can include sequences in the FASTA file format.

# Bio.Intervals.GFF3.ReaderType.

GFF3Reader(input::IO; save_directives::Bool=false)

Create a reader for data in GFF3 format.

Arguments:

source

# Bio.Intervals.GFF3.WriterType.

GFF3.Writer(output::IO)

Create a data writer of the GFF3 file format.

Arguments:

source

# Bio.Intervals.GFF3.RecordType.

GFF3.Record()

Create an unfilled GFF3 record.

source

GFF3.Record(data::Vector{UInt8})

Create a GFF3 record object from data. This function verifies and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source

GFF3.Record(str::AbstractString)

Create a GFF3 record object from str. This function verifies and indexes fields for accessors.

source

# Bio.Intervals.GFF3.isfeatureFunction.

isfeature(record::Record)::Bool

Test if record is a feature record.

source

# Bio.Intervals.GFF3.isdirectiveFunction.

isdirective(record::Record)::Bool

Test if record is a directive record.

source

# Bio.Intervals.GFF3.iscommentFunction.

iscomment(record::Record)::Bool

Test if record is a comment record.

source

# Bio.Intervals.GFF3.seqidFunction.

seqid(record::Record)::String

Get the sequence ID of record.

source

# Bio.Intervals.GFF3.sourceFunction.

source(record::Record)::String

Get the source of record.

source

# Bio.Intervals.GFF3.featuretypeFunction.

featuretype(record::Record)::String

Get the type of record.

source

# Bio.Intervals.GFF3.seqstartFunction.

seqstart(record::Record)::Int

Get the start coordinate of record.

source

# Bio.Intervals.GFF3.seqendFunction.

seqend(record::Record)::Int

Get the end coordinate of record.

source

# Bio.Intervals.GFF3.scoreFunction.

score(record::Record)::Float64

Get the score of record

source

# Bio.Intervals.GFF3.strandFunction.

strand(record::Record)::Bio.Intervals.Strand

Get the strand of record.

source

# Bio.Intervals.GFF3.phaseFunction.

phase(record::Record)::Int

Get the phase of record.

source

# Bio.Intervals.GFF3.attributesFunction.

attributes(record::Record)::Vector{Pair{String,Vector{String}}}

Get the attributes of record.

source

attributes(record::Record, key::String)::Vector{String}

Get the attributes of record with key.

source

# Bio.Intervals.GFF3.contentFunction.

content(record::Record)::String

Get the content of record. Leading '#' letters are removed.

source

bigWig

bigWig is a binary file format for associating a floating point number with each base in the genome. bigWig files are indexed to quickly fetch specific regions.

# Bio.Intervals.BigWig.ReaderType.

BigWig.Reader(input::IO)

Create a reader for bigWig file format.

Note that input must be seekable.

source

# Bio.Intervals.BigWig.chromlistFunction.

chromlist(reader::BigWig.Reader)::Vector{Tuple{String,Int}}

Get the (name, length) pairs of chromosomes/contigs.

source

# Bio.Intervals.BigWig.valuesFunction.

values(reader::BigWig.Reader, interval::Interval)::Vector{Float32}

Get a vector of values within interval from reader.

This function fills missing values with NaN32.

source

values(reader::BigWig.Reader, chrom::AbstractString, range::UnitRange)::Vector{Float32}

Get a vector of values within range of chrom from reader.

This function fills missing values with NaN32.

source

# Bio.Intervals.BigWig.WriterType.

BigWig.Writer(output::IO, chromlist; binsize=64, datatype=:bedgraph)

Create a data writer of the bigWig file format.

Arguments

Examples

output = open("data.bw", "w")
writer = BigWig.Writer(output, [("chr1", 12345), ("chr2", 9100)])
write(writer, ("chr1", 501, 600, 1.0))
write(writer, ("chr2", 301, 450, 3.0))
close(writer)

source

# Bio.Intervals.BigWig.RecordType.

BigWig.Record()

Create an unfilled bigWig record.

source

# Bio.Intervals.BigWig.chromFunction.

chrom(record::Record)::String

Get the chromosome name of record.

source

# Bio.Intervals.BigWig.chromidFunction.

chromid(record::Record)::UInt32

Get the chromosome ID of record.

source

# Bio.Intervals.BigWig.chromstartFunction.

chromstart(record::Record)::Int

Get the start position of record.

source

# Bio.Intervals.BigWig.chromendFunction.

chromend(record::Record)::Int

Get the end position of record.

source

# Bio.Intervals.BigWig.valueFunction.

value(record::Record)::Float32

Get the value of record.

source

bigBed

bigBed is a binary file format for representing genomic annotations and often created from BED files. bigBed files are indexed to quickly fetch specific regions.

# Bio.Intervals.BigBed.ReaderType.

BigBed.Reader(input::IO)

Create a reader for bigBed file format.

Note that input must be seekable.

source

# Bio.Intervals.BigBed.chromlistFunction.

chromlist(reader::BigBed.Reader)::Vector{Tuple{String,Int}}

Get the (name, length) pairs of chromosomes/contigs.

source

# Bio.Intervals.BigBed.WriterType.

BigBed.Writer(output::IO, chromlist; binsize=64)

Create a data writer of the bigBed file format.

Arguments

Examples

output = open("data.bb", "w")
writer = BigBed.Writer(output, [("chr1", 12345), ("chr2", 9100)])
write(writer, ("chr1", 101, 150, "gene 1"))
write(writer, ("chr2", 211, 250, "gene 2"))
close(writer)

source

# Bio.Intervals.BigBed.RecordType.

BigBed.Record()

Create an unfilled bigBed record.

source

# Bio.Intervals.BigBed.chromFunction.

chrom(record::Record)::String

Get the chromosome name of record.

source

# Bio.Intervals.BigBed.chromidFunction.

chromid(record::Record)::UInt32

Get the chromosome ID of record.

source

# Bio.Intervals.BigBed.chromstartFunction.

chromstart(record::Record)::Int

Get the start position of record.

source

# Bio.Intervals.BigBed.chromendFunction.

chromend(record::Record)::Int

Get the end position of record.

source

# Bio.Intervals.BigBed.nameFunction.

name(record::Record)::String

Get the name of record.

source

# Bio.Intervals.BigBed.scoreFunction.

score(record::Record)::Int

Get the score between 0 and 1000.

source

# Bio.Intervals.BigBed.strandFunction.

strand(record::Record)::Bio.Intervals.Strand

Get the strand of record.

source

# Bio.Intervals.BigBed.thickstartFunction.

thickstart(record::Record)::Int

Get the starting position at which record is drawn thickly.

Note that the first base is numbered 1.

source

# Bio.Intervals.BigBed.thickendFunction.

thickend(record::Record)::Int

Get the end position at which record is drawn thickly.

source

# Bio.Intervals.BigBed.itemrgbFunction.

itemrgb(record::Record)::ColorTypes.RGB

Get the RGB value of record.

The return type is defined in ColorTypes.jl.

source

# Bio.Intervals.BigBed.blockcountFunction.

blockcount(record::Record)::Int

Get the number of blocks (exons) in record.

source

# Bio.Intervals.BigBed.blocksizesFunction.

blocksizes(record::Record)::Vector{Int}

Get the block (exon) sizes of record.

source

# Bio.Intervals.BigBed.blockstartsFunction.

blockstarts(record::Record)::Vector{Int}

Get the block (exon) starts of record.

Note that the first base is numbered 1.

source

# Bio.Intervals.BigBed.optionalsFunction.

optionals(record::Record)::Vector{String}

Get optional fields as strings.

source

PDB

PDB is a text-based file format for representing 3D macromolecular structures. This has different reader interfaces from other file formats. Please consult the Bio.Structure chapter for details.

SAM

SAM is a text-based file format for representing sequence alignments.

# Bio.Align.SAM.ReaderType.

SAM.Reader(input::IO)

Create a data reader of the SAM file format.

Arguments

source

# Bio.Align.SAM.headerFunction.

header(reader::Reader)::Header

Get the header of reader.

source

# Bio.Align.SAM.HeaderType.

SAM.Header()

Create an empty header.

source

# Base.findMethod.

find(header::Header, key::AbstractString)::Vector{MetaInfo}

Find metainfo objects satisfying SAM.tag(metainfo) == key.

source

# Bio.Align.SAM.WriterType.

Writer(output::IO, header::Header=Header())

Create a data writer of the SAM file format.

Arguments

source

# Bio.Align.SAM.MetaInfoType.

MetaInfo(str::AbstractString)

Create a SAM metainfo from str.

Examples

julia> SAM.MetaInfo("@CO    some comment")
Bio.Align.SAM.MetaInfo:
    tag: CO
  value: some comment

julia> SAM.MetaInfo("@SQ    SN:chr1 LN:12345")
Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=chr1 LN=12345

source

MetaInfo(tag::AbstractString, value)

Create a SAM metainfo with tag and value.

tag is a two-byte ASCII string. If tag is "CO", value must be a string; otherwise, value is an iterable object with key and value pairs.

Examples

julia> SAM.MetaInfo("CO", "some comment")
Bio.Align.SAM.MetaInfo:
    tag: CO
  value: some comment

julia> string(ans)
"@CO    some comment"

julia> SAM.MetaInfo("SQ", ["SN" => "chr1", "LN" => 12345])
Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=chr1 LN=12345

julia> string(ans)
"@SQ    SN:chr1 LN:12345"

source

# Bio.Align.SAM.iscommentFunction.

iscomment(metainfo::MetaInfo)::Bool

Test if metainfo is a comment (i.e. its tag is "CO").

source

# Bio.Align.SAM.tagFunction.

tag(metainfo::MetaInfo)::String

Get the tag of metainfo.

source

# Bio.Align.SAM.valueFunction.

value(metainfo::MetaInfo)::String

Get the value of metainfo as a string.

source

# Bio.Align.SAM.keyvaluesFunction.

keyvalues(metainfo::MetaInfo)::Vector{Pair{String,String}}

Get the values of metainfo as string pairs.

source

# Bio.Align.SAM.RecordType.

SAM.Record()

Create an unfilled SAM record.

source

SAM.Record(data::Vector{UInt8})

Create a SAM record from data. This function verifies the format and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source

SAM.Record(str::AbstractString)

Create a SAM record from str. This function verifies the format and indexes fields for accessors.

source

# Bio.Align.SAM.flagFunction.

flag(record::Record)::UInt16

Get the bitwise flag of record.

source

# Bio.Align.SAM.ismappedFunction.

ismapped(record::Record)::Bool

Test if record is mapped.

source

# Bio.Align.SAM.isprimaryFunction.

isprimary(record::Record)::Bool

Test if record is a primary line of the read.

This is equivalent to flag(record) & 0x900 == 0.

source

# Bio.Align.SAM.refnameFunction.

refname(record::Record)::String

Get the reference sequence name of record.

source

# Bio.Align.SAM.positionFunction.

position(record::Record)::Int

Get the 1-based leftmost mapping position of record.

source

# Bio.Align.SAM.rightpositionFunction.

rightposition(record::Record)::Int

Get the 1-based rightmost mapping position of record.

source

# Bio.Align.SAM.isnextmappedFunction.

isnextmapped(record::Record)::Bool

Test if the mate/next read of record is mapped.

source

# Bio.Align.SAM.nextrefnameFunction.

nextrefname(record::Record)::String

Get the reference name of the mate/next read of record.

source

# Bio.Align.SAM.nextpositionFunction.

nextposition(record::Record)::Int

Get the position of the mate/next read of record.

source

# Bio.Align.SAM.mappingqualityFunction.

mappingquality(record::Record)::UInt8

Get the mapping quality of record.

source

# Bio.Align.SAM.cigarFunction.

cigar(record::Record)::String

Get the CIGAR string of record.

source

# Bio.Align.SAM.alignmentFunction.

alignment(record::Record)::Bio.Align.Alignment

Get the alignment of record.

source

# Bio.Align.SAM.alignlengthFunction.

alignlength(record::Record)::Int

Get the alignment length of record.

source

# Bio.Align.SAM.tempnameFunction.

tempname(record::Record)::String

Get the query template name of record.

source

# Bio.Align.SAM.templengthFunction.

templength(record::Record)::Int

Get the template length of record.

source

# Bio.Align.SAM.sequenceFunction.

sequence(record::Record)::Bio.Seq.DNASequence

Get the segment sequence of record.

source

sequence(::Type{String}, record::Record)::String

Get the segment sequence of record as String.

source

# Bio.Align.SAM.seqlengthFunction.

seqlength(record::Record)::Int

Get the sequence length of record.

source

# Bio.Align.SAM.qualityFunction.

quality(record::Record)::Vector{UInt8}

Get the Phred-scaled base quality of record.

source

quality(::Type{String}, record::Record)::String

Get the ASCII-encoded base quality of record.

source

# Bio.Align.SAM.auxdataFunction.

auxdata(record::Record)::Dict{String,Any}

Get the auxiliary data (optional fields) of record.

source

This module provides 16-bit flags defined in the SAM specs:

Flag Bit Description
SAM.FLAG_PAIRED 0x0001 template having multiple segments in sequencing
SAM.FLAG_PROPER_PAIR 0x0002 each segment properly aligned according to the aligner
SAM.FLAG_UNMAP 0x0004 segment unmapped
SAM.FLAG_MUNMAP 0x0008 next segment in the template unmapped
SAM.FLAG_REVERSE 0x0010 SEQ being reverse complemented
SAM.FLAG_MREVERSE 0x0020 SEQ of the next segment in the template being reverse complemented
SAM.FLAG_READ1 0x0040 the first segment in the template
SAM.FLAG_READ2 0x0080 the last segment in the template
SAM.FLAG_SECONDARY 0x0100 secondary alignment
SAM.FLAG_QCFAIL 0x0200 not passing filters, such as platform/vendor quality controls
SAM.FLAG_DUP 0x0400 PCR or optical duplicate
SAM.FLAG_SUPPLEMENTARY 0x0800 supplementary alignment

BAM

BAM is a binary counterpart of the SAM file format.

When writing data in the BAM file format, the underlying output stream needs to be wrapped with a BGZFStream object provided from BGZFStreams.jl.

Flags and the header type are defined in the SAM module.

# Bio.Align.BAM.ReaderType.

BAM.Reader(input::IO; index=nothing)

Create a data reader of the BAM file format.

Arguments

source

# Bio.Align.BAM.headerFunction.

header(reader::Reader; fillSQ::Bool=false)::SAM.Header

Get the header of reader.

If fillSQ is true, this function fills missing "SQ" metainfo in the header.

source

# Bio.Align.BAM.WriterType.

BAM.Writer(output::BGZFStream, header::SAM.Header)

Create a data writer of the BAM file format.

Arguments

source

# Bio.Align.BAM.RecordType.

BAM.Record()

Create an unfilled BAM record.

source

# Bio.Align.BAM.flagFunction.

flag(record::Record)::UInt16

Get the bitwise flag of record.

source

# Bio.Align.BAM.ismappedFunction.

ismapped(record::Record)::Bool

Test if record is mapped.

source

# Bio.Align.BAM.isprimaryFunction.

isprimary(record::Record)::Bool

Test if record is a primary line of the read.

This is equivalent to flag(record) & 0x900 == 0.

source

# Bio.Align.BAM.refidFunction.

refid(record::Record)::Int

Get the reference sequence ID of record.

The ID is 1-based (i.e. the first sequence is 1) and is 0 for a record without a mapping position.

See also: BAM.rname

source

# Bio.Align.BAM.refnameFunction.

refname(record::Record)::String

Get the reference sequence name of record.

See also: BAM.refid

source

# Bio.Align.BAM.positionFunction.

position(record::Record)::Int

Get the 1-based leftmost mapping position of record.

source

# Bio.Align.BAM.rightpositionFunction.

rightposition(record::Record)::Int

Get the 1-based rightmost mapping position of record.

source

# Bio.Align.BAM.isnextmappedFunction.

isnextmapped(record::Record)::Bool

Test if the mate/next read of record is mapped.

source

# Bio.Align.BAM.nextrefidFunction.

nextrefid(record::Record)::Int

Get the next/mate reference sequence ID of record.

source

# Bio.Align.BAM.nextrefnameFunction.

nextrefname(record::Record)::String

Get the reference name of the mate/next read of record.

source

# Bio.Align.BAM.nextpositionFunction.

nextposition(record::Record)::Int

Get the 1-based leftmost mapping position of the next/mate read of record.

source

# Bio.Align.BAM.mappingqualityFunction.

mappingquality(record::Record)::UInt8

Get the mapping quality of record.

source

# Bio.Align.BAM.cigarFunction.

cigar(record::Record)::String

Get the CIGAR string of record.

See also BAM.cigar_rle.

source

# Bio.Align.BAM.cigar_rleFunction.

cigar_rle(record::Record)::Tuple{Vector{Bio.Align.Operation},Vector{Int}}

Get a run-length encoded tuple (ops, lens) of the CIGAR string in record.

See also BAM.cigar.

source

# Bio.Align.BAM.alignmentFunction.

alignment(record::Record)::Bio.Align.Alignment

Get the alignment of record.

source

# Bio.Align.BAM.alignlengthFunction.

alignlength(record::Record)::Int

Get the alignment length of record.

source

# Bio.Align.BAM.tempnameFunction.

tempname(record::Record)::String

Get the query template name of record.

source

# Bio.Align.BAM.templengthFunction.

templength(record::Record)::Int

Get the template length of record.

source

# Bio.Align.BAM.sequenceFunction.

sequence(record::Record)::Bio.Seq.DNASequence

Get the segment sequence of record.

source

# Bio.Align.BAM.seqlengthFunction.

seqlength(record::Record)::Int

Get the sequence length of record.

source

# Bio.Align.BAM.qualityFunction.

quality(record::Record)::Vector{UInt8}

Get the base quality of record.

source

# Bio.Align.BAM.auxdataFunction.

auxdata(record::Record)::BAM.AuxData

Get the auxiliary data of record.

source

VCF

VCF is a text-based file format for representing genetic variations.

# Bio.Var.VCF.ReaderType.

VCF.Reader(input::IO)

Create a data reader of the VCF file format.

Arguments

source

# Bio.Var.VCF.WriterType.

VCF.Writer(output::IO, header::VCF.Header)

Create a data writer of the VCF file format.

Arguments

source

# Bio.Var.VCF.RecordType.

VCF.Record()

Create an unfilled VCF record.

source

VCF.Record(data::Vector{UInt8})

Create a VCF object from data containing a VCF record. This function verifies the format and indexes fields for accessors. Note that the ownership of data is transferred to a new record.

source

VCF.Record(str::AbstractString)

Create a VCF object from str containing a VCF record. This function verifies the format and indexes fields for accessors.

source

# Bio.Var.VCF.chromFunction.

chrom(record::Record)::String

Get the chromosome name of record.

source

# Bio.Var.VCF.posFunction.

pos(record::Record)::Int

Get the reference position of record.

source

# Bio.Var.VCF.idFunction.

id(record::Record)::Vector{String}

Get the identifiers of record.

source

# Bio.Var.VCF.refFunction.

ref(record::Record)::String

Get the reference bases of record.

source

# Bio.Var.VCF.altFunction.

alt(record::Record)::Vector{String}

Get the alternate bases of record.

source

# Bio.Var.VCF.qualFunction.

qual(record::Record)::Float64

Get the quality score of record.

source

# Bio.Var.VCF.filterFunction.

filter(record::Record)::Vector{String}

Get the filter status of record.

source

# Bio.Var.VCF.infoFunction.

info(record::Record)::Vector{Pair{String,String}}

Get the additional information of record.

source

info(record::Record, key::String)::String

Get the additional information of record with key. Keys without corresponding values return an empty string.

source

# Bio.Var.VCF.formatFunction.

format(record::Record)::Vector{String}

Get the genotype format of record.

source

BCF

BCF is a binary counterpart of the VCF file format.

# Bio.Var.BCF.ReaderType.

BCF.Reader(input::IO)

Create a data reader of the BCF file format.

Arguments

source

# Bio.Var.BCF.WriterType.

BCF.Writer(output::IO, header::VCF.Header)

Create a data writer of the BCF file format.

Arguments

source

# Bio.Var.BCF.RecordType.

BCF.Record()

Create an unfilled BCF record.

source

# Bio.Var.BCF.chromFunction.

chrom(record::Record)::Int

Get the chromosome index of record.

source

# Bio.Var.BCF.posFunction.

pos(record::Record)::Int

Get the reference position of record.

Note that the position of the first base is 1 (i.e. 1-based coordinate).

source

# Bio.Var.BCF.rlenFunction.

rlen(record::Record)::Int

Get the length of record projected onto the reference sequence.

source

# Bio.Var.BCF.qualFunction.

qual(record::Record)::Float32

Get the quality score of record.

Note that 0x7F800001 (signaling NaN) is interpreted as a missing value.

source

# Bio.Var.BCF.refFunction.

ref(record::Record)::String

Get the reference bases of record.

source

# Bio.Var.BCF.altFunction.

alt(record::Record)::Vector{String}

Get the alternate bases of record.

source

# Bio.Var.BCF.filterFunction.

filter(record::Record)::Vector{Int}

Get the filter indexes of record.

source

# Bio.Var.BCF.infoFunction.

info(record::Record, [simplify::Bool=true])::Vector{Tuple{Int,Any}}

Get the additional information of record.

When simplify is true, a vector with a single element is converted to the element itself and an empty vector of the void type is converted to nothing.

source

info(record::Record, key::Integer, [simplify::Bool=true])

Get the additional information of record with key.

When simplify is true, a vector with a single element is converted to the element itself and an empty vector of the void type is converted to nothing.

source

# Bio.Var.BCF.genotypeFunction.

genotype(record::Record)::Vector{Tuple{Int,Vector{Any}}}

Get the genotypes of record.

BCF genotypes are encoded by field, not by sample like VCF.

source