Menu

TriageCommon

Tomasz

Common options


Data formats

The default file input/output format is fastq with four lines per read (an id, a sequence, a quality id, and a quality string). The tools can also handle data with two (an id and a sequence), three (an id and two paired sequences), or six (an id, two paired sequences, a quality id, and two paired quality strings) lines per read.

An example where input and output data are both in the same format:

java -jar triagetools.jar parts --parts 0.05 --chunksize 6 -i allreads.txt.gz -o myreads.txt.gz

The above shows an approach for processing paired reads alternative to the one described in the individual tools' pages.

In some cases, it may be desirable to obtain output in a different format that the input, e.g.

java -jar triagetools.jar parts --parts 1 --chunksizeout 2 -i allreads.txt.gz -o myreads.txt.gz

This command essentially converts a fastq file into a fasta file.

Note: the regex tool works slightly differently from all the others and does not support changes between input and output formats.


Compression

All tools can read/write from/to gzip (extension gz), bzip2 (extension bz2), or plain text (all other extensions) files.


Using standard input and standard output

All tools can also read/write from/to standard input/output. Indeed stdin and stdout are defaults for all -i and -o. All the tools can therefore be chained with other commands by piping or their output can be manually redirected into files.

Standard input can be used by either omiting the relevant option or by declaring it as stdin. For example, the following two commands are equivalent:

cat allreads.txt | java -jar triagetools.jar length --length 50 -o myreads.txt.gz
cat allreads.txt | java -jar triagetools.jar length --length 50 -i stdin -o myreads.txt.gz

Standard output is specified in an analogous manner, e.g.

java -jar triagetools.jar length --length 50 -i allreads.txt.gz > myreads.txt
java -jar triagetools.jar length --length 50 -i allreads.txt.gz -o stdout > myreads.txt

Note: some tools require more than one input or output stream. Stdin and stdout can be used even in such situations, although with great care when interpreting the results.


Multi-threading

As of version 0.2.0, all tools can be used in single-thread or multi-threaded mode, which can be set using the --multithreaded option. The two commands below perform the same computation, the first in single-threaded and the second in multi-threaded mode.

java -jar triagetools.jar duplicates --multithreaded false 
        -i allreads.txt.gz -o myreads.txt.gz 
java -jar triagetools.jar duplicates --multithreaded true 
        -i allreads.txt.gz -o myreads.txt.gz

Multi-threading is implemented so that the outputs from these commands are exactly the same. In particular, the ordering of reads in the output files are identical.

Internally, parallelism is implemented in two ways: via a read-classify-write chain, and by processing (i.e. compressing) paired output files in parallel.

In practice, the most interesting performance benefits can be observed when processing paired reads and outputing into separate compressed files. Speedup factors over single-threaded mode can be about two in such cases. In situations where the output is very small or reads are single-ended, multi-threading may not necessarily deliver speedier execution times.


Want the latest updates on software, tech news, and AI?
Get latest updates about software, tech news, and AI from SourceForge directly in your inbox once a month.