SAM and BAM

Introduction

The XAM package offers high-performance tools for SAM and BAM file formats, which are the most popular file formats.

If you have questions about the SAM and BAM formats or any of the terminology used when discussing these formats, see the published specification, which is maintained by the samtools group.

A very very simple SAM file looks like the following:

@HD VN:1.6 SO:coordinate
@SQ SN:ref LN:45
r001   99 ref  7 30 8M2I4M1D3M = 37  39 TTAGATAAAGGATACTG *
r002    0 ref  9 30 3S6M1P1I4M *  0   0 AAAAGATAAGGATA    *
r003    0 ref  9 30 5S6M       *  0   0 GCCTAAGCTAA       * SA:Z:ref,29,-,6H5M,17,0;
r004    0 ref 16 30 6M14N5M    *  0   0 ATAGCTTCAGC       *
r003 2064 ref 29 17 6H5M       *  0   0 TAGGC             * SA:Z:ref,9,+,5S6M,30,1;
r001  147 ref 37 30 9M         =  7 -39 CAGCGGCAT         * NM:i:1

Where the first two lines are part of the "header", and the following lines are "records". Each record describes how a read aligns to some reference sequence. Sometimes one record describes one read, but there are other cases like chimeric reads and split alignments, where multiple records apply to one read. In the example above, r003 is a chimeric read, and r004 is a split alignment, and r001 are mate pair reads. Again, we refer you to the official specification for more details.

A BAM file stores this same information but in a binary and compressible format that does not make for pretty printing here!

Reading SAM and BAM files

A typical script iterating over all records in a file looks like below:

using XAM

# Open a BAM file.
reader = open(BAM.Reader, "data.bam")

# Iterate over BAM records.
for record in reader
    # `record` is a BAM.Record object.
    if BAM.ismapped(record)
        # Print the mapped position.
        println(BAM.refname(record), ':', BAM.position(record))
    end
end

# Close the BAM file.
close(reader)

The size of a BAM file is often extremely large. The iterator interface demonstrated above allocates an object for each record and that may be a bottleneck of reading data from a BAM file. In-place reading reuses a pre-allocated object for every record and less memory allocation happens in reading:

reader = open(BAM.Reader, "data.bam")
record = BAM.Record()
while !eof(reader)
    empty!(record)
    read!(reader, record)
    # do something
end

SAM and BAM Headers

Both SAM.Reader and BAM.Reader implement the header function, which returns a SAM.Header object. To extract certain information out of the headers, you can use the findall method on the header to extract information according to SAM/BAM tag. Again we refer you to the specification for full details of all the different tags that can occur in headers, and what they mean.

Below is an example of extracting all the info about the reference sequences from the BAM header. In SAM/BAM, any description of a reference sequence is stored in the header, under a tag denoted SQ (think reference SeQuence!).

julia> reader = open(SAM.Reader, "data.sam");

julia> findall(SAM.header(reader), "SQ")
7-element Array{Bio.Align.SAM.MetaInfo,1}:
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr1 LN=30427671
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr2 LN=19698289
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr3 LN=23459830
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr4 LN=18585056
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=Chr5 LN=26975502
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=chloroplast LN=154478
 Bio.Align.SAM.MetaInfo:
    tag: SQ
  value: SN=mitochondria LN=366924

In the above we can see there were 7 sequences in the reference: 5 chromosomes, one chloroplast sequence, and one mitochondrial sequence.

SAM and BAM Records

SAM.Record

The XAM package supports the following accessors for SAM.Record types.

Missing docstring.

Missing docstring for XAM.SAM.flags. Check Documenter's build log for details.

XAM.ismapped — Function

ismapped(record::XAMRecord)::Bool

Query whether the segment is mapped.

Type	Description
'A'	Printable character
'i'	Signed integer
'f'	Single-precision floating number
'Z'	Printable string, including space
'H'	Byte array in Hex format
'B'	Integer of numeric array