Alignment representation

Overview

Types related to alignment representation introduced in this chapter are indispensable concepts to use this package. Specifically, Alignment, AlignmentAnchor and Operation are the most fundamental types of this package to represent an alignment of two sequences.

Representing alignments

The Alignment type can represent a wide variety of global or local sequence alignments while facilitating efficient transformation of sequences coordinates. Alignments are always relative to a possibly unspecified reference sequence and represent a series of edit operations performed on that reference to transform it to the query sequence. An edit operation is, for example, matching, insertion, or deletion. All operations defined in BioAlignments.jl are described in the Alignment operations section.

An alignment is defined as a series of AlignmentAnchor objects. An "anchor" specifies an edit operation (op) together with its end-point coordinates in the query (seqpos), reference (refpos) and alignment (alnpos) sequences:

struct AlignmentAnchor
    seqpos::Int
    refpos::Int
    alnpos::Int
    op::Operation
end

The start of the sequence range to which the given operation is applied is determined by the sequence coordinates of the previous anchor in the series.

Every alignment starts with a special OP_START operation which is used to give the position in the reference and query prior to the start of the alignment, or 0, if the alignment starts at position 1.

For example, consider the following alignment and its representation as a series of anchors:

Alignment representation

An Alignment object can be created from a series of anchors:

julia> Alignment([
           AlignmentAnchor(0,  4, 0, OP_START),
           AlignmentAnchor(4,  8, 4, OP_MATCH),
           AlignmentAnchor(4, 12, 8, OP_DELETE)
       ])
Alignment:
  aligned range:
    seq: 0-4
    ref: 4-12
  CIGAR string: 4M4D

Alignment operations

Alignment operations follow closely from those used in the SAM/BAM format and are stored in the Operation bitstype.

OperationOperation TypeDescription
OP_MATCHmatchnon-specific match
OP_INSERTinsertinsertion into reference sequence
OP_DELETEdeletedeletion from reference sequence
OP_SKIPdelete(typically long) deletion from the reference, e.g. due to RNA splicing
OP_SOFT_CLIPinsertsequence removed from the beginning or end of the query sequence but stored
OP_HARD_CLIPinsertsequence removed from the beginning or end of the query sequence and not stored
OP_PADspecialsilent deletion from padded reference (not present in query or reference)
OP_SEQ_MATCHmatchmatch operation with matching sequence positions
OP_SEQ_MISMATCHmatchmatch operation with mismatching sequence positions
OP_BACKspecialnot currently supported, but present for SAM/BAM compatibility
OP_STARTspecialindicate the start of an alignment within the reference and query sequence

Each operation has its own one-letter representation, which is the same as those defined in the SAM file format.

julia> convert(Operation, 'M')  # Char => Operation
OP_MATCH

julia> convert(Char, OP_MATCH)  # Operation => Char
'M': ASCII/Unicode U+004D (category Lu: Letter, uppercase)

julia> ismatchop(OP_MATCH)
true

See the Operations section in the references for more details.

Aligned sequences

A sequence aligned to another sequence is represented by the AlignedSequence type, which is a pair of the aligned sequence and an Alignment object.

The following example creates an aligned sequence object from a sequence and an alignment:

julia> AlignedSequence(  # pass an Alignment object
           dna"ACGTAT",
           Alignment([
               AlignmentAnchor(0, 0, 0, OP_START),
               AlignmentAnchor(3, 3, 3, OP_MATCH),
               AlignmentAnchor(6, 3, 6, OP_INSERT)
           ])
       )
···---
ACGTAT

julia> AlignedSequence(  # or pass a vector of anchors
           dna"ACGTAT",
           [
               AlignmentAnchor(0, 0, 0, OP_START),
               AlignmentAnchor(3, 3, 3, OP_MATCH),
               AlignmentAnchor(6, 3, 6, OP_INSERT)
           ]
       )
···---
ACGTAT

If you already have an aligned sequence with gap symbols, it can be converted to an AlignedSequence object by passing a reference sequence with it:

julia> seq = dna"ACGT--AAT--"
11nt DNA Sequence:
ACGT--AAT--

julia> ref = dna"ACGTTTAT-GG"
11nt DNA Sequence:
ACGTTTAT-GG

julia> AlignedSequence(seq, ref)
········-··
ACGT--AAT--

You can get the underlying alignment and sequence back out of an AlignedSequence by using the alignment and sequence functions.

julia> aln = AlignedSequence(dna"ACGT--AAT--", dna"ACGTTTAT-GG")
········-··
ACGT--AAT--

julia> alignment(aln)
Alignment:
  aligned range:
    seq: 0-7
    ref: 0-10
  CIGAR string: 4=2D1=1X1I2D

julia> sequence(aln)
7nt DNA Sequence:
ACGTAAT