Building Read Datastores
To build a read datastore you first need to decide what sort of read data you have.
The process of building these datastores is consistent, but for each datastore there are datastore specific options the constructor will accept:
ReadDatastores.PairedReads
— TypePairedReads{A}(rdrx::FASTQ.Reader, rdry::FASTQ.Reader, outfile::String, name::Union{String,Symbol}, minsize::Integer, maxsize::Integer, fragsize::Integer, orientation::PairedReadOrientation) where {A<:DNAAlphabet}
Construct a Paired Read Datastore from a pair of FASTQ file readers.
Paired-end sequencing reads typically come in the form of two FASTQ files, often named according to a convention *_R1.fastq
and *_R2.fastq
. One file contains all the "left" sequences of each pair, and the other contains all the "right" sequences of each pair. The first read pair is made of the first record in each file.
Arguments
rdrx::FASTQ.Reader
: The reader of the*_R1.fastq
file.rdxy::FASTQ.Reader
: The reader of the*_R2.fastq
file.outfile::String
: A prefix for the datastore's filename, the full filename will include a ".prseq" extension, which will be added automatically.name::String
: A string denoting a default name for your datastore. Naming datastores is useful for downstream applications.minsize::Integer
: A minimum read length (in base pairs). When building the datastore, if any pair of reads has one or both reads shorter than this cutoff, then the pair will be discarded.maxsize::Integer
: A maximum read length (in base pairs). When building the datastore, if any read has a greater length, it will be resized to this maximum length and added to the datastore.fragsize::Integer
: The average fragment length of the paired end library that was sequenced. This value is entirely optional, but may be important for downstream applications.orientation::PairedReadOrientation
: The orientation of the reads. Set it toFwRv
for building a datastore from a sequenced paired end library, and set it toRvFw
if you are building the datastore from reads sequenced from a long mate pair library.
Examples
julia> using FASTX, ReadDatastores
julia> fwq = open(FASTQ.Reader, "test/ecoli_tester_R1.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)
julia> rvq = open(FASTQ.Reader, "test/ecoli_tester_R2.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)
julia> ds = PairedReads{DNAAlphabet{2}}(fwq, rvq, "ecoli-test-paired", "my-ecoli-test", 250, 300, 0, FwRv)
[ Info: Building paired read datastore from FASTQ files
[ Info: Writing paired reads to datastore
[ Info: Done writing paired read sequences to datastore
[ Info: 0 read pairs were discarded due to a too short sequence
[ Info: 14 reads were truncated to 300 base pairs
[ Info: Created paired sequence datastore with 10 sequence pairs
Paired Read Datastore 'my-ecoli-test': 20 reads (10 pairs)
ReadDatastores.LongReads
— TypeLongReads{A}(rdr::FASTQ.Reader, outfile::String, name::String, min_size::Integer) where {A<:DNAAlphabet}
Construct a Long Read Datastore from a FASTQ file reader.
Arguments
rdr::FASTQ.Reader
: The reader of the fastq formatted file.outfile::String
: A prefix for the datastore's filename, the full filename will include a ".loseq" extension, which will be added automatically.name::String
: A string denoting a default name for your datastore. Naming datastores is useful for downstream applications.min_size::Integer
: A minimum read length (in base pairs). When building the datastore, if any read sequence is shorter than this cutoff, then the read will be discarded.
Examples
julia> using FASTX, ReadDatastores
julia> longrdr = open(FASTQ.Reader, "test/human_nanopore_tester_2D.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)
julia> ds = LongReads{DNAAlphabet{2}}(longrdr, "human-nanopore-tester", "nanopore-test", 0)
[ Info: Building long read datastore from FASTQ file
[ Info: Writing long reads to datastore
[ Info: Done writing paired read sequences to datastore
[ Info: 0 reads were discarded due to a too short sequence
[ Info: Writing index to datastore
[ Info: Built long read datastore with 10 reads
Long Read Datastore 'nanopore-test': 10 reads
ReadDatastores.LinkedReads
— TypeLinkedReads{A}(fwq::FASTQ.Reader, rvq::FASTQ.Reader, outfile::String, name::String, format::LinkedReadsFormat, max_read_len::Integer, chunksize::Int = 1000000) where {A<:DNAAlphabet}
Construct a Linked Read Datastore from a pair of FASTQ file readers.
Paired sequencing reads typically come in the form of two FASTQ files, often named according to a convention *_R1.fastq
and *_R2.fastq
. One file contains all the "left" sequences of each pair, and the other contains all the "right" sequences of each pair. The first read pair is made of the first record in each file.
Arguments
rdrx::FASTQ.Reader
: The reader of the*_R1.fastq
file.rdxy::FASTQ.Reader
: The reader of the*_R2.fastq
file.outfile::String
: A prefix for the datastore's filename, the full filename will include a ".prseq" extension, which will be added automatically.name::String
: A string denoting a default name for your datastore. Naming datastores is useful for downstream applications.format::LinkedReadsFormat
: Specify the linked reads format of your fastq files. If you have plain 10x files, set format toRaw10x
. If you have 10x reads output in the UCDavis format, set the format toUCDavis10x
.chunksize::Int = 1000000
: How many read pairs to process per disk batch during the tag sorting step of construction.
Examples
julia> using FASTX, ReadDatastores
julia> fqa = open(FASTQ.Reader, "test/10x_tester_R1.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)
julia> fqb = open(FASTQ.Reader, "test/10x_tester_R2.fastq")
FASTX.FASTQ.Reader{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(BioGenerics.Automa.State{TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}}(TranscodingStreams.TranscodingStream{TranscodingStreams.Noop,IOStream}(<mode=idle>), 1, 1, false), nothing)
julia> ds = LinkedReads{DNAAlphabet{2s}}(fqa, fqb, "10xtest", "ucdavis-test", UCDavis10x, 250)
[ Info: Building tag sorted chunks of 1000000 pairs
[ Info: Dumping 83 tag-sorted read pairs to chunk 0
[ Info: Dumped
[ Info: Processed 83 read pairs so far
[ Info: Finished building tag sorted chunks
[ Info: Performing merge from disk
[ Info: Leaving space for 83 read_tag entries
[ Info: Chunk 0 is finished
[ Info: Finished merge from disk
[ Info: Writing 83 read tag entries
[ Info: Created linked sequence datastore with 83 sequence pairs
Linked Read Datastore 'ucdavis-test': 166 reads (83 pairs)
Loading Read Datastores
Once you have built a datastore, you can open it in other projects again using a Base.open
method:
julia> ds = open(PairedReads{DNAAlphabet{2}}, "ecoli-test-paired.prseq", "my-ecoli-pe")
Paired Read Datastore 'my-ecoli-pe': 20 reads (10 pairs)
The open method takes a ReadDatastore
type, the filename of the datastore, and, optionally a name for the datastore, if you omit the name, the datastore will use the default name that was specified on construction.
You can also use do
block notation when opening a datastore and that will ensure that the underlying stream of the reader will close.
Using the macros and string literals
If you try to open a datastore with the open
method as above, and you provide the wrong type you will get an error.
This will protect you from trying to open a long read datastore as a paired read datastore and such, but it's not always convenient.
For example, if you have a paired read datastore storing reads in 2-bit format and you tried to open it as a PairedReads{DNAAlphabet{4}}
type you will still get an error.
This is obviously correct behaviour, you don't want to be loading sequences using a different encoding to the one they were saved with!
However, in a practical setting this will get annoying: maybe you want to use some long reads you put in a datastore a while ago but don't remember if your datastore file is a LongReads{DNAAlphabet{4}}
or a LongReads{DNAAlphabet{2}}
. Or maybe you get a somefile.prseq
file from a colleague, and from the extension, you deduce it is paired reads but even then that's not guaranteed.
To solve this problem a few convenience macros are provided for you, so you can load datastores without specifying the datastore type, yet still avoid type uncertainty in your generated code.
The macro @openreads
accepts a filepath, and optionally a datastore name. The macro is evaluated and expanded before you julia code is compiled. During that time, the header of the datastore file is peeked at, and the correct ReadDatastore subtype is deduced, and a correctly typed open
statement is generated.
For example if the file myreads.prseq
was a 2-bit encoded paired read datastore, and you had the following statement in your script: ds = @openreads "myreads.prseq"
The statement would be expanded to: ds = open(PairedReads{DNAAlphabet{2}}, "myreads.prseq")
You can also open a datastore using a string literal e.g. reads"myreads.prseq"
. When you do this, type type of the datastore is detected as with @openreads
, however, rather than returning the expression ds = open(PairedReads{DNAAlphabet{2}}, "myreads.prseq")
, as @openreads
does, the open
is executed and the macro returns the value of ds
.
In order for @openreads
or the literals to work properly in any script, the datastore file must exist prior to running the script. This is unavoidable because macros are evaluated and expanded first before the resulting expanded code is compiled and run.
So creating a datastore file and loading it again using @openreads
within the same script will not work, and @openreads
will try to peek at the file and deduce its contents before the script can generate the datastore file. You will get an error telling you the file can't be found / does not exist.
In an interactive setting, in which statements are entered, compiled and run by the REPL one at a time, this should rarely be a problem.
In summary: If in doubt about the datastore type of a file, simply use @openreads