Building Read Datastores

To build a read datastore you first need to decide what sort of read data you have.

The process of building these datastores is consistent, but for each datastore there are datastore specific options the constructor will accept:

ReadDatastores.PairedReadsType
PairedReads{A}(rdrx::FASTQ.Reader, rdry::FASTQ.Reader, outfile::String, name::Union{String,Symbol}, minsize::Integer, maxsize::Integer, fragsize::Integer, orientation::PairedReadOrientation) where {A<:DNAAlphabet}

Construct a Paired Read Datastore from a pair of FASTQ file readers.

Paired-end sequencing reads typically come in the form of two FASTQ files, often named according to a convention *_R1.fastq and *_R2.fastq. One file contains all the "left" sequences of each pair, and the other contains all the "right" sequences of each pair. The first read pair is made of the first record in each file.

Arguments

  • rdrx::FASTQ.Reader: The reader of the *_R1.fastq file.
  • rdxy::FASTQ.Reader: The reader of the *_R2.fastq file.
  • outfile::String: A prefix for the datastore's filename, the full filename will include a ".prseq" extension, which will be added automatically.
  • name::String: A string denoting a default name for your datastore. Naming datastores is useful for downstream applications.
  • minsize::Integer: A minimum read length (in base pairs). When building the datastore, if any pair of reads has one or both reads shorter than this cutoff, then the pair will be discarded.
  • maxsize::Integer: A maximum read length (in base pairs). When building the datastore, if any read has a greater length, it will be resized to this maximum length and added to the datastore.
  • fragsize::Integer: The average fragment length of the paired end library that was sequenced. This value is entirely optional, but may be important for downstream applications.
  • orientation::PairedReadOrientation: The orientation of the reads. Set it to FwRv for building a datastore from a sequenced paired end library, and set it to RvFw if you are building the datastore from reads sequenced from a long mate pair library.

Examples

julia> using FASTX, ReadDatastores

julia> fwq = open(FASTQ.Reader, "test/ecoli_tester_R1.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)

julia> rvq = open(FASTQ.Reader, "test/ecoli_tester_R2.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)

julia> ds = PairedReads{DNAAlphabet{2}}(fwq, rvq, "ecoli-test-paired", "my-ecoli-test", 250, 300, 0, FwRv)
[ Info: Building paired read datastore from FASTQ files
[ Info: Writing paired reads to datastore
[ Info: Done writing paired read sequences to datastore
[ Info: 0 read pairs were discarded due to a too short sequence
[ Info: 14 reads were truncated to 300 base pairs
[ Info: Created paired sequence datastore with 10 sequence pairs
Paired Read Datastore 'my-ecoli-test': 20 reads (10 pairs)
source
ReadDatastores.LongReadsType
LongReads{A}(rdr::FASTQ.Reader, outfile::String, name::String, min_size::Integer) where {A<:DNAAlphabet}

Construct a Long Read Datastore from a FASTQ file reader.

Arguments

  • rdr::FASTQ.Reader: The reader of the fastq formatted file.
  • outfile::String: A prefix for the datastore's filename, the full filename will include a ".loseq" extension, which will be added automatically.
  • name::String: A string denoting a default name for your datastore. Naming datastores is useful for downstream applications.
  • min_size::Integer: A minimum read length (in base pairs). When building the datastore, if any read sequence is shorter than this cutoff, then the read will be discarded.

Examples

julia> using FASTX, ReadDatastores

julia> longrdr = open(FASTQ.Reader, "test/human_nanopore_tester_2D.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)

julia> ds = LongReads{DNAAlphabet{2}}(longrdr, "human-nanopore-tester", "nanopore-test", 0)
[ Info: Building long read datastore from FASTQ file
[ Info: Writing long reads to datastore
[ Info: Done writing paired read sequences to datastore
[ Info: 0 reads were discarded due to a too short sequence
[ Info: Writing index to datastore
[ Info: Built long read datastore with 10 reads
Long Read Datastore 'nanopore-test': 10 reads
source
ReadDatastores.LinkedReadsType
LinkedReads{A}(fwq::FASTQ.Reader, rvq::FASTQ.Reader, outfile::String, name::String, format::LinkedReadsFormat, max_read_len::Integer, chunksize::Int = 1000000) where {A<:DNAAlphabet}

Construct a Linked Read Datastore from a pair of FASTQ file readers.

Paired sequencing reads typically come in the form of two FASTQ files, often named according to a convention *_R1.fastq and *_R2.fastq. One file contains all the "left" sequences of each pair, and the other contains all the "right" sequences of each pair. The first read pair is made of the first record in each file.

Arguments

  • rdrx::FASTQ.Reader: The reader of the *_R1.fastq file.
  • rdxy::FASTQ.Reader: The reader of the *_R2.fastq file.
  • outfile::String: A prefix for the datastore's filename, the full filename will include a ".prseq" extension, which will be added automatically.
  • name::String: A string denoting a default name for your datastore. Naming datastores is useful for downstream applications.
  • format::LinkedReadsFormat: Specify the linked reads format of your fastq files. If you have plain 10x files, set format to Raw10x. If you have 10x reads output in the UCDavis format, set the format to UCDavis10x.
  • chunksize::Int = 1000000: How many read pairs to process per disk batch during the tag sorting step of construction.

Examples

julia> using FASTX, ReadDatastores

julia> fqa = open(FASTQ.Reader, "test/10x_tester_R1.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)

julia> fqb = open(FASTQ.Reader, "test/10x_tester_R2.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)

julia> ds = LinkedReads{DNAAlphabet{2s}}(fqa, fqb, "10xtest", "ucdavis-test", UCDavis10x, 250)
[ Info: Building tag sorted chunks of 1000000 pairs
[ Info: Dumping 83 tag-sorted read pairs to chunk 0
[ Info: Dumped
[ Info: Processed 83 read pairs so far
[ Info: Finished building tag sorted chunks
[ Info: Performing merge from disk
[ Info: Leaving space for 83 read_tag entries
[ Info: Chunk 0 is finished
[ Info: Finished merge from disk
[ Info: Writing 83 read tag entries
[ Info: Created linked sequence datastore with 83 sequence pairs
Linked Read Datastore 'ucdavis-test': 166 reads (83 pairs)
source

Loading Read Datastores

Once you have built a datastore, you can open it in other projects again using a Base.open method:

julia> ds = open(PairedReads{DNAAlphabet{2}}, "ecoli-test-paired.prseq", "my-ecoli-pe")
Paired Read Datastore 'my-ecoli-pe': 20 reads (10 pairs)

The open method takes a ReadDatastore type, the filename of the datastore, and, optionally a name for the datastore, if you omit the name, the datastore will use the default name that was specified on construction.

You can also use do block notation when opening a datastore and that will ensure that the underlying stream of the reader will close.

Using the macros and string literals

If you try to open a datastore with the open method as above, and you provide the wrong type you will get an error.

This will protect you from trying to open a long read datastore as a paired read datastore and such, but it's not always convenient.

For example, if you have a paired read datastore storing reads in 2-bit format and you tried to open it as a PairedReads{DNAAlphabet{4}} type you will still get an error.

This is obviously correct behaviour, you don't want to be loading sequences using a different encoding to the one they were saved with!

However, in a practical setting this will get annoying: maybe you want to use some long reads you put in a datastore a while ago but don't remember if your datastore file is a LongReads{DNAAlphabet{4}} or a LongReads{DNAAlphabet{2}}. Or maybe you get a somefile.prseq file from a colleague, and from the extension, you deduce it is paired reads but even then that's not guaranteed.

To solve this problem a few convenience macros are provided for you, so you can load datastores without specifying the datastore type, yet still avoid type uncertainty in your generated code.

The macro @openreads accepts a filepath, and optionally a datastore name. The macro is evaluated and expanded before you julia code is compiled. During that time, the header of the datastore file is peeked at, and the correct ReadDatastore subtype is deduced, and a correctly typed open statement is generated.

For example if the file myreads.prseq was a 2-bit encoded paired read datastore, and you had the following statement in your script: ds = @openreads "myreads.prseq"

The statement would be expanded to: ds = open(PairedReads{DNAAlphabet{2}}, "myreads.prseq")

You can also open a datastore using a string literal e.g. reads"myreads.prseq". When you do this, type type of the datastore is detected as with @openreads, however, rather than returning the expression ds = open(PairedReads{DNAAlphabet{2}}, "myreads.prseq"), as @openreads does, the open is executed and the macro returns the value of ds.

Note

In order for @openreads or the literals to work properly in any script, the datastore file must exist prior to running the script. This is unavoidable because macros are evaluated and expanded first before the resulting expanded code is compiled and run.

So creating a datastore file and loading it again using @openreads within the same script will not work, and @openreads will try to peek at the file and deduce its contents before the script can generate the datastore file. You will get an error telling you the file can't be found / does not exist.

In an interactive setting, in which statements are entered, compiled and run by the REPL one at a time, this should rarely be a problem.

In summary: If in doubt about the datastore type of a file, simply use @openreads