Bio.Var: Biological Variation.

Identifying and counting sequence sites

You can identify and count the number of various types of site in a nucleotide sequence, or pair/set of aligned nucleotide sequences.

Different types of site

The abstract Site types

Site
Mutation

The concrete types of site case

The types of mutations that can currently be counted are summarized below:

Gap
Ambiguous
Certain
Match
Mismatch
Conserved
Mutated
Transition
Transversion

The count_sites method

count_sites
count_sites(Match, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Mismatch, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Conserved, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Mutated, [dna"ATCGATCG", dna"ACCGATCG"])

count_sites(Transition, [rna"AUCGAUCG", rna"ACCGAUCG"])

count_sites(Transversion, [rna"AUCGAUCG", rna"ACCGAUCG"])

File formats for representing genetic variation

This module supports some common file formats to read and write genetic variations.

VCF and BCF file formats

VCF and BCF file formats can be read using VCFReader and BCFReader, respectively:

reader = open(VCFReader, "example.vcf")
for record in reader
    # do something
end
close(reader)

A reader first reads the header section of a file and creates a VCFHeader object. The header function is provided to access the header object of a reader:

julia> header(reader)
Bio.Var.VCFHeader:
  metainfo tags: fileformat fileDate source reference contig phasing INFO FILTER FORMAT
     sample IDs: NA00001 NA00002 NA00003

julia> find(header(reader), "FORMAT")
4-element Array{Bio.Var.VCFMetaInfo,1}:
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="GT" Number="1" Type="String" Description="Genotype"
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="GQ" Number="1" Type="Integer" Description="Genotype Quality"
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="DP" Number="1" Type="Integer" Description="Read Depth"
 Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="HQ" Number="2" Type="Integer" Description="Haplotype Quality"

VCFMetaInfo objects in the header support the following accessor functions:

Accessor Description
metainfotag tag string
metainfoval value string
keys keys of fields between '<' and '>'
values values of fields between '<' and '>'
[<key>] value of a field with key
julia> metainfo = VCFMetaInfo("##FORMAT=<ID=GT,Number=1,Type=String,Description=\"Genotype\">")
Bio.Var.VCFMetaInfo:
    tag: FORMAT
  value: ID="GT" Number="1" Type="String" Description="Genotype"

julia> metainfotag(metainfo)
"FORMAT"

julia> metainfoval(metainfo)
"<ID=GT,Number=1,Type=String,Description=\"Genotype\">"

julia> keys(metainfo)
4-element Array{String,1}:
 "ID"
 "Number"
 "Type"
 "Description"

julia> metainfo["ID"]
"GT"

VCFRecord and BCFRecord objects support the following accessor functions (see the docstring of each accessor for the details):

Accessor Description
chromosome chromosome name
leftposition reference position
identifier unique identifiers
reference reference bases
alternate alternate bases
quality Phred-scaled quality score
filter_ filter status
information additional information
infokeys keys of additional information
format genotype format
genotype genotype information
julia> record = VCFRecord("20\t14370\trs6054257\tG\tA\t29\tPASS\tNS=3;DP=14;AF=0.5;DB;H2\tGT:GQ:DP:HQ\t0|0:48:1:51,51\t1|0:48:8:51,51")

Bio.Var.VCFRecord:
   chromosome: 20
     position: 14370
   identifier: rs6054257
    reference: G
    alternate: A
      quality: 29.0
       filter: PASS
  information: NS=3 DP=14 AF=0.5 DB H2
       format: GT GQ DP HQ
     genotype: [1] 0|0 48 1 51,51 [2] 1|0 48 8 51,51

julia> chromosome(record)
Nullable{String}("20")

julia> leftposition(record)
Nullable{Int64}(14370)

julia> identifier(record)
1-element Array{String,1}:
 "rs6054257"

julia> reference(record)
Nullable{String}("G")

julia> alternate(record)
1-element Array{String,1}:
 "A"

julia> quality(record)
Nullable{Float64}(29.0)

julia> filter_(record)
1-element Array{String,1}:
 "PASS"

julia> information(record)
5-element Array{Pair{String,String},1}:
 "NS"=>"3"
 "DP"=>"14"
 "AF"=>"0.5"
 "DB"=>""
 "H2"=>""

julia> information(record, "AF")
"0.5"

julia> format(record)
4-element Array{String,1}:
 "GT"
 "GQ"
 "DP"
 "HQ"

julia> genotype(record)
2-element Array{Array{String,1},1}:
 String["0|0","48","1","51,51"]
 String["1|0","48","8","51,51"]

julia> genotype(record, 1)
4-element Array{String,1}:
 "0|0"
 "48"
 "1"
 "51,51"

julia> genotype(record, 1:2, "GT")
2-element Array{String,1}:
 "0|0"
 "1|0"

MASH Distances

MASH distances, based on MinHash sketches of genome sequences can provide rapid genome-scale sequence comparisons when sequence distance (not specific mutations) are all that's required.

A MinHash sketch is made by taking the s smallest hash values for kmers of length k for a given sequence. The genome distance for two genomes is then essentially the Jaccard index of the minhashes, with some additional modification to account for the size of the kmers used.

You can generate a MinHash sketch using the minhash() function in Bio.seq.

using Bio.Seq

seq1 = dna"AAATAAGGCACAACTATGCAT"
sketch1 = minhash(seq, 5, 10)

Then, if you have MinHash sketches with the same parameters for two sequences, you can determine the MASH distance between them.

seq2 = dna"AATTAACGCACGGACTGCGGTAAT"
sketch2 = minhash(seq, 5, 10)

using Bio.Var

mashdistance(sketch1, sketch2)

For more information on what size kmers and what size sketches are appropriate for your use-case, see Odnov et. al. in Genome Biology.