Faster2

News

2014-09-09 Version 0.31 with minor updates and new filter released.
2014-03-17 Version 0.29 released. A new filter to anonymize names of sequences introduced.
2013-11-06 Important bug fix to the sample filter released. Please use the current version 0.28 instead of 0.27.

What is Faster2

Faster2 is an extensible C++11 framework and program for efficient access and extraction of DNA/RNA and protein sequences from FASTA and FASTQ files. It works with large file collections of raw as well as compressed data, and is based on the set of filters that can be organized into a pipeline. Faster2 performs input data indexing in order to accelerate all supported operations. It can be easily customized and extended with new filters, and its pipeline building sub-system can be incorporated into other tools. Faster2 is not a database system nor a data analytics tool. Its sole purpose is to simplify tedious operations that are part of everyday tasks performed routinely by bioinformaticians and computational biologists, and yet often require writing specialized text-processing scripts.

↑ Top

Requirements

Faster2 is written in C++11 and extensively uses the Boost library:

C++11 capable compiler with support for e.g. new initializer lists, lambdas, etc.
I recommend using either GCC ≥4.6, or Clang ≥3.1.
Boost library with support for iostreams and filesystem.

↑ Top

Download

The latest version of Faster2 is 0.31 2014-09-09, and it can be obtained from here.
The tarball provides a README file that explains how to install Faster2.
If you are a Linux user you can download our 64bit executable from here.
You can download pre-release updates from GitLab.

↑ Top

Tutorial

In this tutorial I will first show you how to download and install Faster2, and then how to use some of its basic filters. At the end I will explain how to write a simple filter for Faster2 and how to add it to the main program.

Installation

Let us start with downloading and compiling Faster2. In your favorite terminal enter the directory that will be our working location. For example, I will use folder tmp in my home directory:

$ cd
$ cd tmp/

Now run wget to get the latest Faster2 archive:

$ wget http://www.jzola.org/faster2/faster2-current.tar.bz2
$ ls -la
total 20
drwxr-xr-x 2 zola zola 4096 xxx x xx:xx .
drwxr-xr-x 68 zola zola 4096 xxx x xx:xx ..
-rw-r--r-- 1 zola zola 10571 xxx x xx:xx faster2-current.tar.bz2

The next step is to uncompress the archive with tar and bzip2:

$ tar xfjv faster2-current.tar.bz2
faster2/
faster2/bio/
faster2/bio/fastx_iterator.hpp
faster2/AbstractFilter.hpp
faster2/faster2.cpp
faster2/index.hpp
faster2/NamesFilter.hpp
faster2/jaz/
faster2/jaz/string.hpp
faster2/jaz/LICENSE
faster2/jaz/algorithm.hpp
faster2/jaz/files.hpp
faster2/Makefile
faster2/SampleFilter.hpp
faster2/SelectFilter.hpp
faster2/README
faster2/FilterFilter.hpp
faster2/LICENSE
faster2/ReportFilter.hpp
faster2/FilterFactory.hpp
faster2/pipe.hpp
faster2/stream.hpp
faster2/PrintFilter.hpp

Voila, we are ready to compile. Faster2 makes the extensive use of C++11 features, and the Boost library. These days Boost is routinely provided with any Linux distro, so just make sure that you have it installed. The C++11 support on the other hand varies between compilers, but GCC and Clang seem to be the most advanced. I tested Faster2 with g++ 4.6 and clang++ 3.1, and so I recommend you use one of them. Let us take a look into Faster2 Makefile:

$ cd faster2
$ head -n 10 Makefile
CXX=g++

#BOOST_INCLUDE=-I/usr/local/boost/include
#BOOST_LIB=-L/usr/local/boost/lib

CXXFLAGS=-std=c++0x -O3 -I. $(BOOST_INCLUDE)
#CXXFLAGS=-std=c++0x -stdlib=libc++ -O3 -I. $(BOOST_INCLUDE)

LDLIBS=$(BOOST_LIB) -lboost_iostreams -lboost_system -lboost_filesystem

This Makefile requires no tuning if you have the standard Boost installation and you are fine with using GCC. If you want to change compiler then simply edit CXX variable, and optionally CXXFLAGS, if you need to set compiler specific options. For instance, to use Clang I do the following changes:

$ head -n 10 Makefile
CXX=clang++

#BOOST_INCLUDE=-I/usr/local/boost/include
#BOOST_LIB=-L/usr/local/boost/lib

#CXXFLAGS=-std=c++0x -O3 -I. $(BOOST_INCLUDE)
CXXFLAGS=-std=c++0x -stdlib=libc++ -O3 -I. $(BOOST_INCLUDE)

LDLIBS=$(BOOST_LIB) -lboost_iostreams -lboost_system -lboost_filesystem

This enables clang++ compiler and makes it use the libc++ standard library (which must be available in the system). If the Boost library is not installed in the default location you should uncomment and customize BOOST_INCLUDE and BOOST_LIB variables. For instance, in the Makefile distributed with Faster2 both variables are prepared for the case where Boost has been installed in /usr/local/boost directory. OK, we are ready to run make.

$ make
clang++ -std=c++0x -stdlib=libc++ -O3 -I. -I/usr/local/boost/include faster2.cpp
-o faster2 -L/usr/local/boost/lib -lboost_iostreams -lboost_system -lboost_filesystem

If everything ran smoothly we can launch faster2:

$ ./faster2
Version: faster2 0.1 2012-06-22
Copyright: (c) 2012 Jaroslaw Zola <jaroslaw.zola@gmail.com>
License: Distributed under the MIT License

Usage: faster2 DIR COMMAND|FILTER[,FILTER1,FILTER2,...]
where DIR is the database directory
and COMMAND is one of:
   index ['fasta'|'fastq']           create database index
and FILTER is any of:
   filter <'N'|size>                 filter by string or size
   names [file]                      write names of sequences
   print [file] ['fasta'|'fastq']    write sequences
   report [file]                     write report
   sample <size> [seed]              create sample without replacement
   select <name> [name1 name2 ...]   select by name

Congratulation, installation is completed. Now you can (but do not have to) copy faster2 executable to some more convenient location, for instance bin/ directory in your home folder (if you have one), and you can remove the unpacked faster2/ directory.

Creating Index

Faster2 relays on data indexing. The principal idea is the following: first we create an index of FASTA or FASTQ files in a given directory, which is one time effort. Next, we pass the index to filters that in turn are organized into a pipeline implementing a task of interest. Faster2 works with raw text files as well as gzip and bzip2 compressed files. However, it does not support directories that consist of FASTA and FASTQ files at the same time. In other words, it can index either FASTA or FASTQ files but not both together. Consider the following data directory with FASTA files:

$ ls -la data/
drwxr-xr-x 2 zola zola xxxx xxx x xx:xx .
drwxr-xr-x 5 zola zola xxxx xxx x xx:xx ..
-rw-r--r-- 1 zola zola 8005 xxx x xx:xx file0.fa
-rw-r--r-- 1 zola zola 2682 xxx x xx:xx file1.fa.gz
-rw-r--r-- 1 zola zola 2620 xxx x xx:xx file2.fa.bz2

To index this directory we call index command and we specify type of sequence. Here, we can either use "nt" for DNA/RNA or "aa" for proteins:

$ ./faster data/ index nt
$ ls -la data/
total 28 drwxr-xr-x 2 zola zola 4096 xxx x xx:xx .
drwxr-xr-x 5 zola zola 4096 xxx x xx:xx ..
-rw-r--r-- 1 zola zola 219 xxx x xx:xx .f2index
-rw-r--r-- 1 zola zola 8005 xxx x xx:xx file0.fa
-rw-r--r-- 1 zola zola 2682 xxx x xx:xx file1.fa.gz
-rw-r--r-- 1 zola zola 2620 xxx x xx:xx file2.fa.bz2

Observe that a new file has been created in the indexed directory. This is a binary file that stores the actual index. Note also that Faster2 transparently handled different data compression formats. Faster2 by default assumes that files are in the FASTA format. To create index of FASTQ files it is sufficient to add fastq option to the indexing command:

$ ./faster2 data/ index nt fastq
Error: failed to build index

One important thing to keep in mind is that index is static and captures the state of the indexed directory as it was at the time of indexing. Hence, whenever content of the directory changes index must be recreated. Finally, Faster2 is very flexible and handles all possible variants of FASTA and FASTQ files.

Using Filters

Once index has been created we are ready to start using filters. In general we can specify as many filters as we want and in any order we like. All filters must be separated by coma. Collectively, specified filters form a pipeline in which output of one filter is passed as an input to the following filter. Faster2 implicitly adds to the beginning of the pipeline a filter that selects all sequences. The example below demonstrates a simple pipeline that selects sequences that are strictly DNA/RNA or protein (e.g. for DNA/RNA have no 'N' or 'X' bases), next creates from the selected sequences a random sample of size 10, and writes it to the FASTA file.

$ ./faster2 data/ filter N, sample 10, print sample10.fa

Creating Report

The most basic filter we can run is report. This filter summarizes output of the previous filter in the pipeline. For instance, the following command will generate summary of the entire indexed data:

$ ./faster2 data/ report
total files     3
total reads     3
quality scores  no
clean sequences     3
average sequence    7915.67
shortest sequence   7788
longest sequence    8166

By default report is written to the standard output. However, you can specify a file name in which report should be stored. Keep in mind that special name '-' is used to denote the standard output. In fact, this is true for other filters that write output, such as e.g. print and names.

Writing Selected Sequences

As we already explained data flows between filters in a sequential manner. At any point in the pipeline we can insert print filter that will write result of the processing to FASTA or FASTQ file. For instance, to write to the standard output a randomly selected sequence from the pool of all sequences we can combine sample and print filters:

$ ./faster2 data/ sample 1, print

↑ Top

Filters

anonymize
anonymize sequences by masking their original names with numbers.

Examples:
anonymize: anonymize
compact
remove sequences with duplicate names. If two sequences have the same name only the first occurrence will be retained. Note that this filter does not compare the actual sequences.

Examples:
print unique sequences: compact, print
filter <'N'|size>
filter sequences by their content or length.

size specifies threshold length that must be prefixed with '-', to remove sequences longer than given threshold, or '+' to remove sequences shorter than given threshold. If 'N' is specified sequences that are not strictly DNA/RNA (i.e. contain bases different than A,C,T,G) or are not protein will be removed. Note, that sequence type (i.e. DNA/RNA vs. protein) is decided during indexing.

Examples:
remove incomplete sequences: filter N
remove sequences shorter than 250 bp: filter +250
extract sequences of exactly 425 bp: filter -425, filter +425
names [file]
print names of selected sequences.

file specifies the name of the output file, where '-' represents the standard output. If no file name is specified the output is written to the standard output. Names are written line by line.

Examples:
this shows how using Bash and Faster2 one can extract sequences that are common to the directory dirB and dirA:
faster2 dirA/ names | read -a LIST faster2 dirB/ select "${LIST[@]}", print
If read -a does not work try:
faster2 dirA/ names > names.txt mapfile -t LIST < names.txt faster2 dirB/ select "${LIST[@]}", print
Since version 0.27 you can use (recommended):
faster2 dirA/ names names.txt faster2 dirB/ select @names.txt, print
print [file] ['fasta'|'fastq']
print selected sequences in FASTA or FASTQ format.

file specifies the name of the output file, where '-' represents the standard output. If no file name is specified the output is written to the standard output. By default the output is written in FASTA format. To change the format 'fasta' for FASTA and 'fastq' for FASTQ can be specified as the second argument. Note that 'fastq' can be used only if the indexed data is in the FASTQ format. The output sequences are written single line in uppercase. Sequences are printed in the order in which they are stored in the original files (to speedup extraction).

Examples:
write FASTA to the standard output: print
write FASTQ to example.fq: print example.fq fastq
report [file]
generate comprehensive report about selected sequences.

file specifies the name of the output file, where '-' represents the standard output. If no file name is specified the output is written to the standard output.

Examples:
write report to the standard outpput: report
sample <size> [seed]
create a random sample (without replacement) of size size from selected sequences.

size denotes the size of the sample. seed can be specified to initialize the random number generator used to extract sample. If seed is not specified generator is initialized based on a system random device.

Examples:
create random sample of size 8: sample 8
select '@'<name> | <name> [name1 name2 ...]
extract sequences that exacly match one of specified names.

name, if with prefix @, is a path to the file storing names of sequences to extract (one name per line). Otherwise it is a string representing sequence name. Then, it can contain spaces and any characters permitted in FASTA and FASTQ formats. If a given name contains spaces it should be clearly delimited by double quotes. Note: in versions prior to 0.25 commas are not handled properly! If a given sequence name is listed multiple times it will be extracted only once.

Examples:
extract sequences named Seq A and Seq B: select "Seq A" "Seq B"
extract sequences named Seq, A and B: select Seq A B
extract sequences with names stored in file /tmp/tost.txt: select @/tmp/tost.txt

↑ Top

FAQ

How can I cite Faster2?
Please cite this web page.
My C++ compiler does not compile Faster2, what should I do?
Unfortunately, the only option is to change (or update) your compiler (you will have to do this sooner or later anyway).
I miss certain functionality, what can I do?
You may try to implement your own filter. If you find this too difficult please contact me and I will consider adding it to the next Faster2 release.
Can I use Faster2 with paired reads?
The short answer is yes, but it is a bit tricky. You have to create two directories each storing one part of paired reads. You apply filters you want to one of them, and then export resulting names (see names filter). You can now use exported names to query the other directory (see select filter).

↑ Top