Bio.Var: Biological Variation.

Identifying and counting sequence sites

You can identify and count the number of various types of site in a nucleotide sequence, or pair/set of aligned nucleotide sequences.

Different types of site

The abstract Site types

Site
Mutation

The concrete types of site case

The types of mutations that can currently be counted are summarized below:

Gap
Ambiguous
Certain
Match
Mismatch
Conserved
Mutated
Transition
Transversion

The `count_sites` method

count_sites

count_sites(Match, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Mismatch, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Conserved, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Mutated, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Transition, [rna"AUCGAUCG", rna"ACCGAUCG"])

count_sites(Transversion, [rna"AUCGAUCG", rna"ACCGAUCG"])

File formats for representing genetic variation

This module supports some common file formats to read and write genetic variations.

VCF and BCF file formats

VCF and BCF file formats can be read using VCFReader and BCFReader, respectively:

reader = open(VCFReader, "example.vcf")
for record in reader
    # do something
end
close(reader)

A reader first reads the header section of a file and creates a VCFHeader object. The header function is provided to access the header object of a reader:

julia> header(reader)
Bio.Var.VCFHeader:
  metainfo tags: fileformat fileDate source reference contig phasing INFO FILTER FORMAT
     sample IDs: NA00001 NA00002 NA00003

julia> find(header(reader), "FORMAT")
4-element Array{Bio.Var.VCFMetaInfo,1}:
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="GT" Number="1" Type="String" Description="Genotype"
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="GQ" Number="1" Type="Integer" Description="Genotype Quality"
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="DP" Number="1" Type="Integer" Description="Read Depth"
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="HQ" Number="2" Type="Integer" Description="Haplotype Quality"

VCFMetaInfo objects in the header support the following accessor functions:

Accessor	Description
`metainfotag`	tag string
`metainfoval`	value string
`keys`	keys of fields between '<' and '>'
`values`	values of fields between '<' and '>'
`[<key>]`	value of a field with `key`

julia> metainfo = VCFMetaInfo("##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">")
Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="GT" Number="1" Type="String" Description="Genotype"

julia> metainfotag(metainfo)
"FORMAT"

julia> metainfoval(metainfo)
"<ID=GT,Number=1,Type=String,Description=\"Genotype\">"

julia> keys(metainfo)
4-element Array{String,1}:
 "ID"
 "Number"
 "Type"
 "Description"

julia> metainfo["ID"]
"GT"

VCFRecord and BCFRecord objects support the following accessor functions (see the docstring of each accessor for the details):

Accessor	Description
`chromosome`	chromosome name
`leftposition`	reference position
`identifier`	unique identifiers
`reference`	reference bases
`alternate`	alternate bases
`quality`	Phred-scaled quality score
`filter_`	filter status
`information`	additional information
`infokeys`	keys of additional information
`format`	genotype format
`genotype`	genotype information

julia> record = VCFRecord("20\t14370\trs6054257\tG\tA\t29\tPASS\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51")

Bio.Var.VCFRecord:
   chromosome: 20
     position: 14370
   identifier: rs6054257
    reference: G
    alternate: A
      quality: 29.0
       filter: PASS
  information: NS=3 DP=14 AF=0.5 DB H2
       format: GT GQ DP HQ
     genotype: [1] 0|0 48 1 51,51 [2] 1|0 48 8 51,51

julia> chromosome(record)
Nullable{String}("20")

julia> leftposition(record)
Nullable{Int64}(14370)

julia> identifier(record)
1-element Array{String,1}:
 "rs6054257"

julia> reference(record)
Nullable{String}("G")

julia> alternate(record)
1-element Array{String,1}:
 "A"

julia> quality(record)
Nullable{Float64}(29.0)

julia> filter_(record)
1-element Array{String,1}:
 "PASS"

julia> information(record)
5-element Array{Pair{String,String},1}:
 "NS"=>"3"
 "DP"=>"14"
 "AF"=>"0.5"
 "DB"=>""
 "H2"=>""

julia> information(record, "AF")
"0.5"

julia> format(record)
4-element Array{String,1}:
 "GT"
 "GQ"
 "DP"
 "HQ"

julia> genotype(record)
2-element Array{Array{String,1},1}:
 String["0|0","48","1","51,51"]
 String["1|0","48","8","51,51"]

julia> genotype(record, 1)
4-element Array{String,1}:
 "0|0"
 "48"
 "1"
 "51,51"

julia> genotype(record, 1:2, "GT")
2-element Array{String,1}:
 "0|0"
 "1|0"

MASH Distances

MASH distances, based on MinHash sketches of genome sequences can provide rapid genome-scale sequence comparisons when sequence distance (not specific mutations) are all that's required.

A MinHash sketch is made by taking the s smallest hash values for kmers of length k for a given sequence. The genome distance for two genomes is then essentially the Jaccard index of the minhashes, with some additional modification to account for the size of the kmers used.

You can generate a MinHash sketch using the minhash() function in Bio.seq.

using Bio.Seq

seq1 = dna"AAATAAGGCACAACTATGCAT"
sketch1 = minhash(seq, 5, 10)

Then, if you have MinHash sketches with the same parameters for two sequences, you can determine the MASH distance between them.

seq2 = dna"AATTAACGCACGGACTGCGGTAAT"
sketch2 = minhash(seq, 5, 10)

using Bio.Var

mashdistance(sketch1, sketch2)

For more information on what size kmers and what size sketches are appropriate for your use-case, see Odnov et. al. in Genome Biology.