Reading and writing data

Bio.jl has a unified interface for reading and writing files in a variety of formats. Reader and writer type names have a prefix of the file format. For example, files of a format X can be read using XReader and can be written using XWriter. To initialize a reader/writer of X, you can use one of the following syntaxes:

# reader
open(::Type{XReader}, filepath::AbstractString, args...)
XReader(stream::IO, args...)

# writer
open(::Type{XWriter}, filepath::AbstractString, args...)
XWriter(stream::IO, args...)

For example, when reading a FASTA file, a reader for the FASTA file format can be initialized as:

using Bio.Seq  # import FASTA
reader = open(FASTA.Reader, "hg38.fa")
# do something
close(reader)

Reading by iteration

Readers in Bio.jl all read and return entries one at a time. The most convenient way to do this by iteration:

reader = open(BED.Reader, "input.bed")
for record in reader
    # perform some operation on entry
end
close(reader)

In-place reading

Iterating through entries in a file is convenient, but for each entry in the file, the reader must allocate, and ultimately the garbage collector must spend time to deallocate it. For performance critical applications, a separate lower level parsing interface can be used that avoid unnecessary allocation by overwriting one entry. For files with a large number of small entries, this can greatly speed up reading.

Instead of looping over a reader stream read! is called with a preallocated entry. Some care is necessary when using this interface because record is completely overwritten on each iteration:

reader = open(BED.Reader, "input.bed")
record = BED.Record()
while !eof(reader)
    read!(reader, record)
    # perform some operation on `record`
end
close(reader)

Empty record types that correspond to the file format be found using eltype, making it easy to allocate an empty record for any reader stream:

record = eltype(stream)()

Writing data

A FASTA file will be created as follows:

writer = open(FASTA.Writer, "out.fa")
write(writer, FASTA.Record("seq1", dna"ACGTN"))
write(writer, FASTA.Record("seq2", "AT rich", dna"TTATA"))
close(writer)

Another way is using Julia's do-block syntax, which closes the data file after finished writing:

open(FASTA.Writer, "out.fa") do writer
    write(writer, FASTA.Record("seq1", dna"ACGTN"))
    write(writer, FASTA.Record("seq2", "AT rich", dna"TTATA"))
end

Supported file formats

The following table summarizes supported file formats.

File format	Prefix	Module	Specification
FASTA	`FASTA`	`Bio.Seq`	https://en.wikipedia.org/wiki/FASTA_format
FASTQ	`FASTQ`	`Bio.Seq`	https://en.wikipedia.org/wiki/FASTQ_format
.2bit	`TwoBit`	`Bio.Seq`	http://genome.ucsc.edu/FAQ/FAQformat.html#format7
BED	`BED`	`Bio.Intervals`	https://genome.ucsc.edu/FAQ/FAQformat.html#format1
GFF3	`GFF3`	`Bio.Intervals`	https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md
bigWig	`BigWig`	`Bio.Intervals`	https://doi.org/10.1093/bioinformatics/btq351
bigBed	`BigBed`	`Bio.Intervals`	https://doi.org/10.1093/bioinformatics/btq351
PDB	`PDB`	`Bio.Structure`	http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html
SAM	`SAM`	`Bio.Align`	https://samtools.github.io/hts-specs/SAMv1.pdf
BAM	`BAM`	`Bio.Align`	https://samtools.github.io/hts-specs/SAMv1.pdf
VCF	`VCF`	`Bio.Var`	https://samtools.github.io/hts-specs/VCFv4.3.pdf
BCF	`BCF`	`Bio.Var`	https://samtools.github.io/hts-specs/VCFv4.3.pdf

FASTA

Reader type: FASTA.Reader
Writer type: FASTA.Writer
Element type: FASTA.Record

FASTA is a text-based file format for representing biological sequences. A FASTA file stores a list of records with identifier, description, and sequence. The template of a sequence record is:

>{identifier} {description}?
{sequence}

Here is an example of a chromosomal sequence:

>chrI chromosome 1
CCACACCACACCCACACACCCACACACCACACCACACACCACACCACACC
CACACACACACATCCTAACACTACCCTAACACAGCCCTAATCTAACCCTG

Usually sequence records will be read sequentially from a file by iteration. But if the FASTA file has an auxiliary index file formatted in fai, the reader supports random access to FASTA records, which would be useful when accessing specific parts of a huge genome sequence:

reader = open(FASTAReader, "sacCer.fa", index="sacCer.fa.fai")
chrIV = reader["chrIV"]  # directly read chromosome 4

# Bio.Seq.FASTA.Reader — Type.

FASTA.Reader(input::IO; index=nothing)

Create a data reader of the FASTA file format.

Arguments

input: data source
index=nothing: filepath to a random access index (currently fai is supported)

Flag	Bit	Description
`SAM.FLAG_PAIRED`	`0x0001`	template having multiple segments in sequencing
`SAM.FLAG_PROPER_PAIR`	`0x0002`	each segment properly aligned according to the aligner
`SAM.FLAG_UNMAP`	`0x0004`	segment unmapped
`SAM.FLAG_MUNMAP`	`0x0008`	next segment in the template unmapped
`SAM.FLAG_REVERSE`	`0x0010`	SEQ being reverse complemented
`SAM.FLAG_MREVERSE`	`0x0020`	SEQ of the next segment in the template being reverse complemented
`SAM.FLAG_READ1`	`0x0040`	the first segment in the template
`SAM.FLAG_READ2`	`0x0080`	the last segment in the template
`SAM.FLAG_SECONDARY`	`0x0100`	secondary alignment
`SAM.FLAG_QCFAIL`	`0x0200`	not passing filters, such as platform/vendor quality controls
`SAM.FLAG_DUP`	`0x0400`	PCR or optical duplicate
`SAM.FLAG_SUPPLEMENTARY`	`0x0800`	supplementary alignment

Reading and writing data

Reading by iteration

In-place reading

Writing data

Supported file formats

FASTA

FASTQ

.2bit

ABIF

BED

GFF3

bigWig

bigBed

PDB

SAM

BAM

VCF

BCF