Faster2
Faster2 is an extensible C++11 framework and program for efficient access and extraction of DNA/RNA and protein sequences from FASTA and FASTQ files. It works with large file collections of raw as well as compressed data, and is based on the set of filters that can be organized into a pipeline. Faster2 performs input data indexing in order to accelerate all supported operations. It can be easily customized and extended with new filters, and its pipeline building sub-system can be incorporated into other tools. Faster2 is not a database system nor a data analytics tool. Its sole purpose is to simplify tedious operations that are part of everyday tasks performed routinely by bioinformaticians and computational biologists, and yet often require writing specialized text-processing scripts.
Faster2 is written in C++11 and extensively uses the Boost library:
The latest version of Faster2 is 0.31 2014-09-09, and it can be obtained from
here.
The tarball provides a README file that explains how
to install Faster2.
If you are a Linux user you can download our 64bit executable from here.
You can download pre-release updates from GitLab.
In this tutorial I will first show you how to download and install Faster2, and then how
to use some of its basic filters. At the end I will explain how to write a simple filter
for Faster2 and how to add it to the main program.
Let us start with downloading and compiling Faster2. In your favorite
terminal enter the directory that will be our working location. For
example, I will use folder tmp
in my home directory:
$ cd
$ cd tmp/
Now run wget
to get the latest Faster2 archive:
$ wget http://www.jzola.org/faster2/faster2-current.tar.bz2
$ ls -la
total 20
drwxr-xr-x 2 zola zola 4096 xxx x xx:xx .
drwxr-xr-x 68 zola zola 4096 xxx x xx:xx ..
-rw-r--r-- 1 zola zola 10571 xxx x xx:xx faster2-current.tar.bz2
The next step is to uncompress the archive with tar
and bzip2
:
$ tar xfjv faster2-current.tar.bz2
faster2/
faster2/bio/
faster2/bio/fastx_iterator.hpp
faster2/AbstractFilter.hpp
faster2/faster2.cpp
faster2/index.hpp
faster2/NamesFilter.hpp
faster2/jaz/
faster2/jaz/string.hpp
faster2/jaz/LICENSE
faster2/jaz/algorithm.hpp
faster2/jaz/files.hpp
faster2/Makefile
faster2/SampleFilter.hpp
faster2/SelectFilter.hpp
faster2/README
faster2/FilterFilter.hpp
faster2/LICENSE
faster2/ReportFilter.hpp
faster2/FilterFactory.hpp
faster2/pipe.hpp
faster2/stream.hpp
faster2/PrintFilter.hpp
Voila, we are ready to compile. Faster2 makes the extensive use
of C++11 features, and the Boost library. These days Boost is routinely
provided with any Linux distro, so just make sure that you have it
installed. The C++11 support on the other hand varies between compilers,
but GCC and Clang seem to be the most advanced. I tested Faster2
with g++ 4.6
and clang++ 3.1
, and so I recommend
you use one of them. Let us take a look into Faster2 Makefile:
$ cd faster2
$ head -n 10 Makefile
CXX=g++
#BOOST_INCLUDE=-I/usr/local/boost/include
#BOOST_LIB=-L/usr/local/boost/lib
CXXFLAGS=-std=c++0x -O3 -I. $(BOOST_INCLUDE)
#CXXFLAGS=-std=c++0x -stdlib=libc++ -O3 -I. $(BOOST_INCLUDE)
LDLIBS=$(BOOST_LIB) -lboost_iostreams -lboost_system -lboost_filesystem
This Makefile requires no tuning if you have the standard Boost
installation and you are fine with using GCC. If you want to change
compiler then simply edit CXX
variable, and
optionally CXXFLAGS
, if you need to set compiler specific
options. For instance, to use Clang I do the following changes:
$ head -n 10 Makefile
CXX=clang++
#BOOST_INCLUDE=-I/usr/local/boost/include
#BOOST_LIB=-L/usr/local/boost/lib
#CXXFLAGS=-std=c++0x -O3 -I. $(BOOST_INCLUDE)
CXXFLAGS=-std=c++0x -stdlib=libc++ -O3 -I. $(BOOST_INCLUDE)
LDLIBS=$(BOOST_LIB) -lboost_iostreams -lboost_system -lboost_filesystem
This enables clang++
compiler and makes it use
the libc++ standard library
(which must be available in the system). If the Boost library is not
installed in the default location you should uncomment and
customize BOOST_INCLUDE
and BOOST_LIB
variables. For instance, in the Makefile distributed with Faster2
both variables are prepared for the case where Boost has been
installed in /usr/local/boost
directory. OK, we are
ready to run make
.
$ make
clang++ -std=c++0x -stdlib=libc++ -O3 -I. -I/usr/local/boost/include faster2.cpp
-o faster2 -L/usr/local/boost/lib -lboost_iostreams -lboost_system -lboost_filesystem
If everything ran smoothly we can launch faster2
:
$ ./faster2
Version: faster2 0.1 2012-06-22
Copyright: (c) 2012 Jaroslaw Zola <jaroslaw.zola@gmail.com>
License: Distributed under the MIT License
Usage: faster2 DIR COMMAND|FILTER[,FILTER1,FILTER2,...]
where DIR is the database directory
and COMMAND is one of:
index ['fasta'|'fastq'] create database index
and FILTER is any of:
filter <'N'|size> filter by string or size
names [file] write names of sequences
print [file] ['fasta'|'fastq'] write sequences
report [file] write report
sample <size> [seed] create sample without replacement
select <name> [name1 name2 ...] select by name
Congratulation, installation is completed. Now you can (but do not have
to) copy faster2
executable to some more convenient
location, for instance bin/
directory in your home folder
(if you have one), and you can remove the unpacked faster2/
directory.
Faster2 relays on data indexing. The principal idea is the following: first we create an index of FASTA or FASTQ files in a given directory, which is one time effort. Next, we pass the index to filters that in turn are organized into a pipeline implementing a task of interest. Faster2 works with raw text files as well as gzip and bzip2 compressed files. However, it does not support directories that consist of FASTA and FASTQ files at the same time. In other words, it can index either FASTA or FASTQ files but not both together. Consider the following data directory with FASTA files:
$ ls -la data/
drwxr-xr-x 2 zola zola xxxx xxx x xx:xx .
drwxr-xr-x 5 zola zola xxxx xxx x xx:xx ..
-rw-r--r-- 1 zola zola 8005 xxx x xx:xx file0.fa
-rw-r--r-- 1 zola zola 2682 xxx x xx:xx file1.fa.gz
-rw-r--r-- 1 zola zola 2620 xxx x xx:xx file2.fa.bz2
To index this directory we call index
command and we specify type of sequence.
Here, we can either use "nt" for DNA/RNA or "aa" for proteins:
$ ./faster data/ index nt
$ ls -la data/
total 28
drwxr-xr-x 2 zola zola 4096 xxx x xx:xx .
drwxr-xr-x 5 zola zola 4096 xxx x xx:xx ..
-rw-r--r-- 1 zola zola 219 xxx x xx:xx .f2index
-rw-r--r-- 1 zola zola 8005 xxx x xx:xx file0.fa
-rw-r--r-- 1 zola zola 2682 xxx x xx:xx file1.fa.gz
-rw-r--r-- 1 zola zola 2620 xxx x xx:xx file2.fa.bz2
Observe that a new file has been created in the indexed directory. This
is a binary file that stores the actual index. Note also that Faster2
transparently handled different data compression formats.
Faster2 by default assumes that files are in the FASTA format. To create index
of FASTQ files it is sufficient to add fastq
option to the
indexing command:
$ ./faster2 data/ index nt fastq
Error: failed to build index
One important thing to keep in mind is that index is static and captures the state of the indexed directory as it was at the time of indexing. Hence, whenever content of the directory changes index must be recreated. Finally, Faster2 is very flexible and handles all possible variants of FASTA and FASTQ files.
Once index has been created we are ready to start using filters. In general we can specify as many filters as we want and in any order we like. All filters must be separated by coma. Collectively, specified filters form a pipeline in which output of one filter is passed as an input to the following filter. Faster2 implicitly adds to the beginning of the pipeline a filter that selects all sequences. The example below demonstrates a simple pipeline that selects sequences that are strictly DNA/RNA or protein (e.g. for DNA/RNA have no 'N' or 'X' bases), next creates from the selected sequences a random sample of size 10, and writes it to the FASTA file.
$ ./faster2 data/ filter N, sample 10, print sample10.fa
The most basic filter we can run is report
. This
filter summarizes output of the previous filter in the pipeline.
For instance, the following command will generate
summary of the entire indexed data:
$ ./faster2 data/ report
total files 3
total reads 3
quality scores no
clean sequences 3
average sequence 7915.67
shortest sequence 7788
longest sequence 8166
By default report is written to the standard output. However, you can
specify a file name in which report should be stored. Keep in mind that
special name '-' is used to denote the standard output. In fact, this is true
for other filters that write output, such as e.g. print
and names
.
As we already explained data flows between filters in a sequential manner.
At any point in the pipeline we can insert print
filter that
will write result of the processing to FASTA or FASTQ file. For instance,
to write to the standard output a randomly selected sequence from the pool
of all sequences we can combine sample
and print
filters:
$ ./faster2 data/ sample 1, print
anonymize
anonymize
compact
compact, print
filter <'N'|size>
size
specifies threshold length that must be prefixed
with '-', to remove sequences longer than given threshold, or '+'
to remove sequences shorter than given threshold.
If 'N' is specified sequences that are not strictly DNA/RNA
(i.e. contain bases different than A,C,T,G) or are not protein
will be removed. Note, that sequence type (i.e. DNA/RNA vs. protein)
is decided during indexing.filter N
filter +250
filter -425, filter +425
names [file]
file
specifies the name of the output file, where '-'
represents the standard output. If no file name is specified the output
is written to the standard output. Names are written line by line.dirB
and dirA
:
faster2 dirA/ names | read -a LIST
faster2 dirB/ select "${LIST[@]}", print
read -a
does not work try:
faster2 dirA/ names > names.txt
mapfile -t LIST < names.txt
faster2 dirB/ select "${LIST[@]}", print
faster2 dirA/ names names.txt
faster2 dirB/ select @names.txt, print
print [file] ['fasta'|'fastq']
file
specifies the name of the output file, where '-'
represents the standard output. If no file name is specified the
output is written to the standard output. By default the output is
written in FASTA format. To change the format 'fasta' for FASTA and
'fastq' for FASTQ can be specified as the second argument. Note that
'fastq' can be used only if the indexed data is in the FASTQ format.
The output sequences are written single line in uppercase. Sequences are
printed in the order in which they are stored in the original files (to speedup extraction).print
example.fq
: print example.fq fastq
report [file]
file
specifies the name of the output file, where '-'
represents the standard output. If no file name is specified the
output is written to the standard output.report
sample <size> [seed]
size
from selected sequences.size
denotes the size of the sample. seed
can be specified to initialize the random
number generator used to extract sample. If seed is not specified generator is initialized based on a system
random device.sample 8
select '@'<name> | <name> [name1 name2 ...]
name
, if with prefix @, is a path to the file storing names of sequences to extract (one name per line).
Otherwise it is a string representing sequence name. Then, it can contain spaces and any characters permitted in FASTA and FASTQ formats. If a given name contains spaces it should be clearly delimited by double quotes.
Note: in versions prior to 0.25 commas are not handled properly! If a given sequence name is listed multiple times it will be extracted
only once.Seq A
and Seq B
: select "Seq A" "Seq B"
Seq
, A
and B
: select Seq A B
/tmp/tost.txt
: select @/tmp/tost.txt
names
filter). You can now use exported names
to query the other directory (see select
filter).
Copyright © 2012-2016 Jaroslaw Zola