FASTQ formatted files

NB: First read the overview in the sidebar

FASTQ is a text-based file format for representing DNA sequences along with qualities for each base. A FASTQ file stores a list of sequence records in the following format:

The template of a sequence record is:

@{description}
{sequence}
+{description}?
{qualities}

Where the "identifier" is the first part of the description up to the first whitespace (or the entire description if there is no whitespace)

The description may optionally be present on the third line, and if so, must be identical to the description on the first line.

Here is an example of one record from a FASTQ file:

@FSRRS4401BE7HA
tcagTTAAGATGGGAT
+
###EEEEEEEEE##E#

Where:

  • identifier is "FSRRS4401BE7HA"
  • description is also "FSRRS4401BE7HA"
  • sequence is "tcagTTAAGATGGGAT"
  • quality is "###EEEEEEEEE##E#"

The FASTQRecord

FASTQRecords optionally have the description repeated on the third line. This can be toggled with quality_header!(::Record, ::Bool):

julia> record = parse(FASTQRecord, "@ILL01\nCCCGC\n+\nKM[^d");

julia> print(record)
@ILL01
CCCGC
+
KM[^d

julia> quality_header!(record, true); print(record)
@ILL01
CCCGC
+ILL01
KM[^d
FASTX.FASTQ.RecordType
FASTQ.Record

Mutable struct representing a FASTQ record as parsed from a FASTQ file. The content of the record can be queried with the following functions: identifier, description, sequence, quality FASTQ records are un-typed, i.e. they are agnostic to what kind of data they contain.

See also: FASTQ.Reader, FASTQ.Writer

Examples

julia> rec = parse(FASTQRecord, "@ill r1\nGGC\n+\njjk");

julia> identifier(rec)
"ill"

julia> description(rec)
"ill r1"

julia> sequence(rec)
"GGC"

julia> show(collect(quality_scores(rec)))
Int8[73, 73, 74]

julia> typeof(description(rec)) == typeof(sequence(rec)) <: AbstractString
true
source

Qualities

Unlike FASTARecords, a FASTQRecord contain quality scores, see the example above.

The quality string can be obtained using the quality method:

julia> record = parse(FASTQRecord, "@ILL01\nCCCGC\n+\nKM[^d");

julia> quality(record)
"KM[^d"

Qualities are numerical values that are encoded by ASCII characters. Unfortunately, multiple encoding schemes exist, although PHRED+33 is the most common. The scores can be obtained using the quality_scores function, which returns an iterator of PHRED+33 scores:

julia> collect(quality_scores(record))
5-element Vector{Int8}:
 42
 44
 58
 61
 67

If you want to decode the qualities using another scheme, you can use one of the predefined QualityEncoding objects. For example, Illumina v 1.3 used PHRED+64:

julia> collect(quality_scores(record, FASTQ.ILLUMINA13_QUAL_ENCODING))
5-element Vector{Int8}:
 11
 13
 27
 30
 36

Alternatively, quality_scores accept a name of the known quality encodings:

julia> (collect(quality_scores(record, FASTQ.ILLUMINA13_QUAL_ENCODING)) ==
        collect(quality_scores(record, :illumina13)))
true

Lastly, you can create your own:

FASTX.FASTQ.QualityEncodingType
QualityEncoding(range::StepRange{Char}, offset::Integer)

FASTQ quality encoding scheme. QualityEncoding objects are used to interpret the quality scores of FASTQ records. range is a range of allowed ASCII chars in the encoding, e.g. '!':'~' for the most common encoding scheme. The offset is the ASCII offset, i.e. a character with ASCII value x encodes the value x - offset.

See also: quality_scores

Examples

julia> read = parse(FASTQ.Record, "@hdr\nAGA\n+\nabc");

julia> qe = QualityEncoding('a':'z', 16); # hypothetical encoding

julia> collect(quality_scores(read, qe)) == [Int8(i) - 16 for i in "abc"]
true
source

Reference:

FASTX.FASTQ.qualityFunction
quality([T::Type{String, StringView}], record::FASTQ.Record, [part::UnitRange])

Get the ASCII quality of record at positions part as type T. If not passed, T defaults to StringView. If not passed, part defaults to the entire quality string.

Examples

julia> rec = parse(FASTQ.Record, "@hdr\nUAGUCU\n+\nCCDFFG");

julia> qual = quality(rec)
"CCDFFG"

julia> qual isa AbstractString
true
source
FASTX.FASTQ.quality_scoresFunction
quality_scores(record::FASTQ.Record, [encoding::QualityEncoding], [part::UnitRange])

Get an iterator of PHRED base quality scores of record at positions part. This iterator is corrupted if the record is mutated. By default, part is the whole sequence. By default, the encoding is PHRED33 Sanger encoding, but may be specified with a QualityEncoding object

source
quality(record::Record, encoding_name::Symbol, [part::UnitRange])::Vector{UInt8}

Get an iterator of base quality of the slice part of record's quality.

The encoding_name can be either :sanger, :solexa, :illumina13, :illumina15, or :illumina18.

source
FASTX.FASTQ.quality_header!Function
quality_header!(record::Record, x::Bool)

Set whether the record repeats its header on the quality comment line, i.e. the line with +.

Examples

julia> record = parse(FASTQ.Record, "@A B\nT\n+\nJ");

julia> string(record)
"@A B\nT\n+\nJ"

julia> quality_header!(record, true);

julia> string(record)
"@A B\nT\n+A B\nJ"
source

FASTQReader and FASTQWriter

FASTQWriter can optionally be passed the keyword quality_header to control whether or not to print the description on the third line (the one with +). By default this is nothing, meaning that it will print the second header, if present in the record itself.

If set to a Bool value, the Writer will override the Records, without changing the records themselves.

Reference:

FASTX.FASTQModule
FASTA

Module under FASTX with code related to FASTA files.

source
FASTX.FASTQ.ReaderType
FASTQ.Reader(input::IO; copy::Bool=true)

Create a buffered data reader of the FASTQ file format. The reader is a BioGenerics.IO.AbstractReader, a stateful iterator of FASTQ.Record. Readers take ownership of the underlying IO. Mutating or closing the underlying IO not using the reader is undefined behaviour. Closing the Reader also closes the underlying IO.

See more examples in the FASTX documentation.

See also: FASTQ.Record, FASTQ.Writer

Arguments

  • input: data source
  • copy::Bool: iterating returns fresh copies instead of the same Record. Set to false for improved performance, but be wary that iterating mutates records.

Examples

julia> rdr = FASTQReader(IOBuffer("@readname\nGGCC\n+\njk;]"));

julia> record = first(rdr); close(rdr);

julia> identifier(record)
"readname"

julia> sequence(record)
"GGCC"

julia> show(collect(quality_scores(record))) # phred 33 encoding by default
Int8[73, 74, 26, 60]
source
FASTX.FASTQ.WriterType
FASTQ.Writer(output::IO; quality_header::Union{Nothing, Bool}=nothing)

Create a data writer of the FASTQ file format. The writer is a BioGenerics.IO.AbstractWriter. Writers take ownership of the underlying IO. Mutating or closing the underlying IO not using the writer is undefined behaviour. Closing the writer also closes the underlying IO.

See more examples in the FASTX documentation.

See also: FASTQ.Record, FASTQ.Reader

Arguments

  • output: Data sink to write to
  • quality_header: Whether to print second header on the + line. If nothing (default), check the individual Record objects for whether they contain a second header.

Examples

julia> FASTQ.Writer(open("some_file.fq", "w")) do writer
    write(writer, record) # a FASTQ.Record
end
source
FASTX.FASTQ.validate_fastqFunction
validate_fastq(io::IO) >: Nothing

Check if io is a valid FASTQ file. Return nothing if it is, and an instance of another type if not.

Examples

julia> validate_fastq(IOBuffer("@i1 r1\nuuag\n+\nHJKI")) === nothing
true

julia> validate_fastq(IOBuffer("@i1 r1\nu;ag\n+\nHJKI")) === nothing
false
source