TriageTools Wiki

Tools for partitioning and prioritizing fastq data

Brought to you by: tkonopka

TriageSequence

Triage by sequence

The sequence-based tool partitions input reads according to whether or not they are similar to a specified target sequence. This is useful for performing targeted analysis of small genomic regions. It can save considerable amounts of disk space and computing resources compared to a traditional approach in which all data is processessed by an aligner indiscriminantly of whether it is needed in downstream analysis.

The tool requires the input of the desired target sequence in fasta format. The tool scans this target sequence and records all the subsequences of length seedlength (default 14). Later, the tool scans the input data and determines how many of the reads' bases, termed hits match the target. A read is classified as similar to the sequence if its hits exceed a threshold (default 38).

The algorithm works similarly to hash-table based aligners like GSNAP or BLAT, but only keeps track of existence of subsequences and not of their localization in a genome. The tool can classify raw data quickly primarily because it defers the computationally difficult task of local alignment to other tools.

Singe-end samples

To process a single-end sample:

java -jar triagetools.jar sequence --target mytarget.fa.gz 
    -i allreads.txt.gz -o myreads.txt.gz

This will create one output file myreads-hits.txt.gz. By default, reads that are classified as not similar to the target are omitted from the output. To obtain also this information, add the --all option:

java -jar triagetools.jar sequence --target mytarget.fa.gz --all 
    -i allreads.txt.gz -o myreads.txt.gz

This will create two output files myreads-hits.txt.gz and myreads-nothits.txt.gz.

Two important options for sequence-based triage are --seedlength and --hits. These determine the criteria that a read has to meet into order to be classified as a hit. To obtain a smaller result set, use a higher values for both

java -jar triagetools.jar sequence --target mytarget.fa.gz 
    --seedlength 14 --hits 50
    -i allreads.txt.gz -o myreads.txt.gz

This will use subsequences of length 14 as seeds and classify a read as a hit if more than 50 bases, in contigs at least 14 long, are identical with some part of the target sequence.

It is also possible to specify the hits as a fraction of the readlength, e.g.

java -jar triagetools.jar sequence --target mytarget.fa.gz 
    --seedlength 14 --hits 0.7
    -i allreads.txt.gz -o myreads.txt.gz

Here, at least 70% of a read's bases must be identical with the target.

It is possible to specify multiple target files in a comma separated list:

java -jar triagetools.jar sequence --target mytarget1.fa.gz,mytarget2.fa.gz 
    -i allreads.txt.gz -o myreads.txt.gz

The individual files are used to construct separate indexes. Multiple target files thus increase memory use, but provide a means of targeting larger regions. Note the output file myreads-hits.txt.gz will contain reads that are similar to target sequences in either target file.

Note: multiple target files should be separated with commas and without spaces.

Paired-end samples

Paired-end samples can be processed by using two -i and two -o options, e.g.

java -jar triagetools.jar sequence --target mytarget.fa.gz 
    -i allreads_1.txt.gz -i allreads_2.txt.gz 
    -o myreads_1.txt.gz -o myreads_2.txt.gz

The hits parameter can again be set to a fraction, e.g.

java -jar triagetools.jar sequence --target mytarget.fa.gz --hits 1.1
    -i allreads_1.txt.gz -i allreads_2.txt.gz 
    -o myreads_1.txt.gz -o myreads_2.txt.gz

In this case, the tool will compute the length of the shorter mate and multiply by the hits fraction. This will give the absolute number of hits required to classify the read pair as interesting. These hits can arise from either mate, or a combination of the two. In the case above, since 1.1 is greater than 1, the hits will necessarily have to arise from a combination of the mates.