edit

GFF3

Description

GFF3 is a text-based file format for representing genomic annotations. The major difference from BED is that GFF3 is more structured and can include sequences in the FASTA file format.

I/O tools for GFF3 are provided from the GenomicFeatures.GFF3 module, which exports following three types:

  • Reader type: GFF3.Reader
  • Writer type: GFF3.Writer
  • Element type: GFF3.Record

A GFF3 file may contain directives and/or comments in addition to genomic features. These lines are skipped by default but you can control the behavior by passing keyword arguments to GFF3.Reader. See the docstring for details.

Examples

Here is a common workflow to iterate over all records in a GFF3 file:

# Import the GFF3 module.
using GenomicFeatures

# Open a GFF3 file.
reader = open(GFF3.Reader, "data.gff3")

# Iterate over records.
for record in reader
    # Do something on record (see Accessors section).
    seqid = GFF3.seqid(reader)
    # ...
end

# Finally, close the reader
close(reader)

If you are interested in directives (which starts with '#') in addition to genomic features, you need to pass skip_directives=false when initializing a GFF3 constructor:

# Set skip_directives to true (this is set to false by default).
reader = GFF3.Reader(open("data.gff3"), skip_directives=false)
for record in record
    # Branch by record type.
    if GFF3.isfeature(record)
        # ...
    elseif GFF3.isdirective(record)
        # ...
    else
        # This never happens.
        assert(false)
    end
end
close(reader)

GenomicFeatures.jl supports tabix to retrieve records overlapping with a specific interval. First you need to create a block compression file from a GFF3 file using bgzip and then index it using tabix.

cat data.gff3 | grep -v "^#" | sort -k1,1 -k4,4n | bgzip >data.gff3.bgz
tabix data.gff3.bgz  # this creates data.gff3.bgz.tbi

Then you can read the block compression file as follows:

# Read the block compression gzip file.
reader = GFF3.Reader("data.gff3.bgz")
for record in eachoverlap(reader, Interval("chr1", 250_000, 300_000))
    # Each record overlap the query interval.
    # ...
end

API

# GenomicFeatures.GFF3.ReaderType.

GFF3.Reader(input::IO;
            index=nothing,
            save_directives::Bool=false,
            skip_features::Bool=false,
            skip_directives::Bool=true,
            skip_comments::Bool=true)

GFF3.Reader(input::AbstractString;
            index=:auto,
            save_directives::Bool=false,
            skip_features::Bool=false,
            skip_directives::Bool=true,
            skip_comments::Bool=true)

Create a reader for data in GFF3 format.

The first argument specifies the data source. When it is a filepath that ends with .bgz, it is considered to be block compression file format (BGZF) and the function will try to find a tabix index file (.tbi) and read it if any. See http://www.htslib.org/doc/tabix.html for bgzip and tabix tools.

Arguments

  • input: data source (IO object or filepath)
  • index: path to a tabix file
  • save_directives: flag to save directive records (which can be accessed with GFF3.directives)
  • skip_features: flag to skip feature records
  • skip_directives: flag to skip directive records
  • skip_comments: flag to skip comment records

source

# GenomicFeatures.GFF3.directivesFunction.

Return all directives that preceded the last GFF entry parsed as an array of strings.

Directives at the end of the file can be accessed by calling close(reader) and then directives(reader).

source

# GenomicFeatures.GFF3.hasfastaFunction.

Return true if the GFF3 stream is at its end and there is trailing FASTA data.

source

# GenomicFeatures.GFF3.getfastaFunction.

Return a BioSequences.FASTA.Reader initialized to parse trailing FASTA data.

Throws an exception if there is no trailing FASTA, which can be checked using hasfasta.

source

# GenomicFeatures.GFF3.WriterType.

GFF3.Writer(output::IO)

Create a data writer of the GFF3 file format.

Arguments:

  • output: data sink

source

# GenomicFeatures.GFF3.RecordType.

GFF3.Record()

Create an unfilled GFF3 record.

source

GFF3.Record(data::Vector{UInt8})

Create a GFF3 record object from data. This function verifies and indexes fields for accessors. Note that the ownership of data is transferred to a new record object.

source

GFF3.Record(str::AbstractString)

Create a GFF3 record object from str. This function verifies and indexes fields for accessors.

source

# GenomicFeatures.GFF3.isfeatureFunction.

isfeature(record::Record)::Bool

Test if record is a feature record.

source

# GenomicFeatures.GFF3.isdirectiveFunction.

isdirective(record::Record)::Bool

Test if record is a directive record.

source

# GenomicFeatures.GFF3.iscommentFunction.

iscomment(record::Record)::Bool

Test if record is a comment record.

source

# GenomicFeatures.GFF3.seqidFunction.

seqid(record::Record)::String

Get the sequence ID of record.

source

# GenomicFeatures.GFF3.sourceFunction.

source(record::Record)::String

Get the source of record.

source

# GenomicFeatures.GFF3.featuretypeFunction.

featuretype(record::Record)::String

Get the type of record.

source

# GenomicFeatures.GFF3.seqstartFunction.

seqstart(record::Record)::Int

Get the start coordinate of record.

source

# GenomicFeatures.GFF3.seqendFunction.

seqend(record::Record)::Int

Get the end coordinate of record.

source

# GenomicFeatures.GFF3.scoreFunction.

score(record::Record)::Float64

Get the score of record

source

# GenomicFeatures.GFF3.strandFunction.

strand(record::Record)::GenomicFeatures.Strand

Get the strand of record.

source

# GenomicFeatures.GFF3.phaseFunction.

phase(record::Record)::Int

Get the phase of record.

source

# GenomicFeatures.GFF3.attributesFunction.

attributes(record::Record)::Vector{Pair{String,Vector{String}}}

Get the attributes of record.

source

attributes(record::Record, key::String)::Vector{String}

Get the attributes of record with key.

source

# GenomicFeatures.GFF3.contentFunction.

content(record::Record)::String

Get the content of record. Leading '#' letters are removed.

source