SAM and BAM

Introduction

The XAM package offers high-performance tools for SAM and BAM file formats, which are the most popular file formats.

If you have questions about the SAM and BAM formats or any of the terminology used when discussing these formats, see the published specification, which is maintained by the samtools group.

A very very simple SAM file looks like the following:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;
r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;
r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1

Where the first two lines are part of the "header", and the following lines are "records". Each record describes how a read aligns to some reference sequence. Sometimes one record describes one read, but there are other cases like chimeric reads and split alignments, where multiple records apply to one read. In the example above, r003 is a chimeric read, and r004 is a split alignment, and r001 are mate pair reads. Again, we refer you to the official specification for more details.

A BAM file stores this same information but in a binary and compressible format that does not make for pretty printing here!

Reading SAM and BAM files

A typical script iterating over all records in a file looks like below:

using XAM

# Open a BAM file.
reader = open(BAM.Reader, "data.bam")

# Iterate over BAM records.
for record in reader
    # `record` is a BAM.Record object.
    if BAM.ismapped(record)
        # Print the mapped position.
        println(BAM.refname(record), ':', BAM.position(record))
    end
end

# Close the BAM file.
close(reader)

The size of a BAM file is often extremely large. The iterator interface demonstrated above allocates an object for each record and that may be a bottleneck of reading data from a BAM file. In-place reading reuses a pre-allocated object for every record and less memory allocation happens in reading:

reader = open(BAM.Reader, "data.bam")
record = BAM.Record()
while !eof(reader)
    empty!(record)
    read!(reader, record)
    # do something
end

SAM and BAM Headers

Both SAM.Reader and BAM.Reader implement the header function, which returns a SAM.Header object. To extract certain information out of the headers, you can use the findall method on the header to extract information according to SAM/BAM tag. Again we refer you to the specification for full details of all the different tags that can occur in headers, and what they mean.

Below is an example of extracting all the info about the reference sequences from the BAM header. In SAM/BAM, any description of a reference sequence is stored in the header, under a tag denoted SQ (think reference SeQuence!).

julia> reader = open(SAM.Reader, "data.sam");

julia> findall(SAM.header(reader), "SQ")
7-element Array{Bio.Align.SAM.MetaInfo,1}:
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr1 LN=30427671
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr2 LN=19698289
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr3 LN=23459830
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr4 LN=18585056
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr5 LN=26975502
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=chloroplast LN=154478
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=mitochondria LN=366924

In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.

SAM and BAM Records

SAM.Record

The XAM package supports the following accessors for SAM.Record types.

Missing docstring.

Missing docstring for XAM.SAM.flags. Check Documenter's build log for details.

XAM.ismappedFunction
ismapped(record::XAMRecord)::Bool

Query whether the segment is mapped.

source
XAM.isprimaryalignmentFunction
isprimaryalignment(record::XAMRecord)::Bool

Query whether the read alignment is considered to be the primary alignment. This is primary line of the read and is equivalent to flags(record) & 0x900 == 0.

source
XAM.SAM.refnameFunction
refname(record::Record)::String

Get the reference sequence name of record.

source
XAM.SAM.positionFunction
position(record::Record)::Int

Get the 1-based leftmost mapping position of record.

source
XAM.isnextmappedFunction
isnextmapped(record::XAMRecord)::Bool

Query whether the next segment in the template is mapped.

source
XAM.SAM.nextrefnameFunction
nextrefname(record::Record)::String

Get the reference name of the mate/next read of record.

source
XAM.SAM.sequenceFunction
sequence(record::Record)::BioSequences.LongDNA{4}

Get the segment sequence of record.

source
sequence(::Type{String}, record::Record)::String

Get the segment sequence of record as String.

source
XAM.SAM.qualityFunction
quality(record::Record)::Vector{UInt8}

Get the Phred-scaled base quality of record.

source
quality(::Type{String}, record::Record)::String

Get the ASCII-encoded base quality of record.

source
XAM.SAM.auxdataFunction
auxdata(record::Record)::Dict{String,Any}

Get the auxiliary data (optional fields) of record.

source

BAM.Record

The XAM package supports the following accessors for BAM.Record types.

Missing docstring.

Missing docstring for XAM.BAM.flags. Check Documenter's build log for details.

Missing docstring.

Missing docstring for XAM.BAM.ismapped. Check Documenter's build log for details.

Missing docstring.

Missing docstring for XAM.BAM.isprimaryalignment. Check Documenter's build log for details.

XAM.BAM.refidFunction
refid(record::Record)::Int

Get the reference sequence ID of record.

The ID is 1-based (i.e. the first sequence is 1) and is 0 for a record without a mapping position.

See also: BAM.rname

source
XAM.BAM.refnameFunction
refname(record::Record)::String

Get the reference sequence name of record.

See also: BAM.refid

source
XAM.BAM.reflenFunction
reflen(record::Record)::Int

Get the length of the reference sequence this record applies to.

source
XAM.BAM.positionFunction
position(record::Record)::Int

Get the 1-based leftmost mapping position of record.

source
Missing docstring.

Missing docstring for XAM.BAM.isnextmapped. Check Documenter's build log for details.

XAM.BAM.nextrefidFunction
nextrefid(record::Record)::Int

Get the next/mate reference sequence ID of record.

source
XAM.BAM.nextrefnameFunction
nextrefname(record::Record)::String

Get the reference name of the mate/next read of record.

source
XAM.BAM.nextpositionFunction
nextposition(record::Record)::Int

Get the 1-based leftmost mapping position of the next/mate read of record.

source
XAM.BAM.cigarFunction
cigar(record::Record)::String

Get the CIGAR string of record.

Note that in the BAM specification, the field called cigar typically stores the cigar string of the record. However, this is not always true, sometimes the true cigar is very long, and due to some constraints of the BAM format, the actual cigar string is stored in an extra tag: CG:B,I, and the cigar field stores a pseudo-cigar string.

Calling this method with checkCG set to true (default) this method will always yield the true cigar string, because this is probably what you want the vast majority of the time.

If you have a record that stores the true cigar in a CG:B,I tag, but you still want to access the pseudo-cigar that is stored in the cigar field of the BAM record, then you can set checkCG to false.

See also BAM.cigar_rle.

source
XAM.BAM.sequenceFunction
sequence(record::Record)::BioSequences.LongDNA{4}

Get the segment sequence of record.

source

Accessing auxiliary data

SAM and BAM records support the storing of optional data fields associated with tags.

Tagged auxiliary data follows a format of TAG:TYPE:VALUE. TAG is a two-letter string, and each tag can only appear once per record. TYPE is a single case-sensetive letter which defined the format of VALUE.

TypeDescription
'A'Printable character
'i'Signed integer
'f'Single-precision floating number
'Z'Printable string, including space
'H'Byte array in Hex format
'B'Integer of numeric array

For more information about these tags and their types we refer you to the SAM/BAM specification and the additional optional fields specification document.

There are some tags that are reserved, predefined standard tags, for specific uses.

To access optional fields stored in tags, you use getindex indexing syntax on the record object. Note that accessing optional tag fields will result in type instability in Julia. This is because the type of the optional data is not known until run-time, as the tag is being read. This can have a significant impact on performance. To limit this, if the user knows the type of a value in advance, specifying it as a type annotation will alleviate the problem:

Below is an example of looping over records in a bam file and using indexing syntax to get the data stored in the "NM" tag. Note the UInt8 type assertion to alleviate type instability.

for record in open(BAM.Reader, "data.bam")
    nm = record["NM"]::UInt8
    # do something
end

Getting records in a range

The XAM package supports the BAI index to fetch records in a specific range from a BAM file. Samtools provides index subcommand to create an index file (.bai) from a sorted BAM file.

$ samtools index -b SRR1238088.sort.bam
$ ls SRR1238088.sort.bam*
SRR1238088.sort.bam     SRR1238088.sort.bam.bai

The method eachoverlap(reader, chrom, range) returns an iterator of BAM records overlapping the query interval:

reader = open(BAM.Reader, "SRR1238088.sort.bam", index="SRR1238088.sort.bam.bai")
for record in eachoverlap(reader, "Chr2", 10000:11000)
    # `record` is a BAM.Record object
    # ...
end
close(reader)

Getting records overlapping genomic features

The eachoverlap method also accepts the Interval type defined in GenomicFeatures.jl.

This allows you to do things like first read in the genomic features from a GFF3 file, and then for each feature, iterate over all the BAM records that overlap with that feature.

using GenomicFeatures
using GFF3
using XAM

# Load genomic features from a GFF3 file.
features = open(collect, GFF3.Reader, "TAIR10_GFF3_genes.gff")

# Keep mRNA features.
filter!(x -> GFF3.featuretype(x) == "mRNA", features)

# Open a BAM file and iterate over records overlapping mRNA transcripts.
reader = open(BAM.Reader, "SRR1238088.sort.bam", index = "SRR1238088.sort.bam.bai")
for feature in features
    for record in eachoverlap(reader, feature)
        # `record` overlaps `feature`.
        # ...
    end
end
close(reader)

Writing files

In order to write a BAM or SAM file, you must first create a SAM.Header.

A SAM.Header is constructed from a vector of SAM.MetaInfo objects.

For example, to create the following simple header:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
julia> a = SAM.MetaInfo("HD", ["VN" => 1.6, "SO" => "coordinate"])
SAM.MetaInfo:
    tag: HD
  value: VN=1.6 SO=coordinate

julia> b = SAM.MetaInfo("SQ", ["SN" => "ref", "LN" => 45])
SAM.MetaInfo:
    tag: SQ
  value: SN=ref LN=45

julia> h = SAM.Header([a, b])
SAM.Header(SAM.MetaInfo[SAM.MetaInfo:
    tag: HD
  value: VN=1.6 SO=coordinate, SAM.MetaInfo:
    tag: SQ
  value: SN=ref LN=45])

Then to create the writer for a SAM file, construct a SAM.Writer using the header and an IO type:

julia> samw = SAM.Writer(open("my-data.sam", "w"), h)
SAM.Writer(IOStream(<file my-data.sam>))

To make a BAM Writer is slightly different, as you need to use a specific stream type from the https://github.com/BioJulia/BGZFStreams.jl package:

julia> using BGZFStreams

julia> bamw = BAM.Writer(BGZFStream(open("my-data.bam", "w"), "w"))
BAM.Writer(BGZFStreams.BGZFStream{IOStream}(<mode=write>))

Once you have a BAM or SAM writer, you can use the write method to write BAM.Records or SAM.Records to file:

julia> write(bamw, rec) # Here rec is a `BAM.Record`
330780