PPSeq 1.3 contains a parallel aligner: ppbwt-1.0 and a parallel indexer: ppalgn-1.3, released on October 1, 2013.
Download and exact the appropriate PPSeq-1.3 binary release and a sample archive into a fresh directory. Change to that directory.
>> Obtaining the required documents: go to PPSeq-1.3 release directory and download ppseq-1.3-Linux-x86_64-pthread.gz and ppseq-example-01.tar.gz . The first is the PPSeq-1.3 binary release and the second contains the examples/sample for the tutorial.
>> Installation: Exact the binary release and the sample archieve into a fresh directory and then change to that directory. Do the following:
>> >> Exact the binary release like this:
gzip -d ppseq-1.3-Linux-x86_64-pthread.gz |
>> >> Exact the sample archive like this:
tar zxvf ppseq-example-01.tar.gz |
>> Check up: make sure you have the binary: ppseq-1.3-Linux-x86_64-pthread-free and the folder: ppseq-1.3-Linux-x86_64-pthread-free-sample.
>> >> Check the ppseq-1.3-Linux-x86_64-pthread-free like this:
user@LinuxBox:~/bin$ ldd ppseq-1.3-Linux-x86_64-pthread-free
linux-vdso.so.1 => (0x00007fff971ff000)
libmpi_cxx.so.0 => /usr/local/pkg/openmpi/lib/libmpi_cxx.so.0 (0x00007faac524e000)
libmpi.so.0 => /usr/local/pkg/openmpi/lib/libmpi.so.0 (0x00007faac4e6b000)
libopen-rte.so.0 => /usr/local/pkg/openmpi/lib/libopen-rte.so.0 (0x00007faac4be0000)
libopen-pal.so.0 => /usr/local/pkg/openmpi/lib/libopen-pal.so.0 (0x00007faac497e000)
libtorque.so.2 => /usr/local/pkg/torque-3.0.4/lib/libtorque.so.2 (0x00007faac4679000)
libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007faac4466000)
libnsl.so.1 => /lib/x86_64-linux-gnu/libnsl.so.1 (0x00007faac424e000)
libutil.so.1 => /lib/x86_64-linux-gnu/libutil.so.1 (0x00007faac404a000)
libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007faac3dc8000)
libstdc++.so.6 => /usr/lib/x86_64-linux-gnu/libstdc++.so.6 (0x00007faac3ac1000)
libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007faac38aa000)
libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007faac368e000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007faac3304000)
/lib64/ld-linux-x86-64.so.2 (0x00007faac546a000) |
>> Running ppbwt on a cluster with a PBS scheduler:
cd ppseq-1.3-Linux-x86_64-pthread-free-sample/bwt_chrU
qsub test_bwt.qsub |
In this directory, you can find out the necessary input files:
- test_bwt.qsub: a OpenPBS script example through which you can submit jobs to a cluster.
- ppbwt.config is the input script of ppbwt in which the options are defined.
- init_data_bwt.cfg defines all of the intial data files such as genome and short-read files.
And you can find the results files:
- chrU.[1-4].bt2: the forward index of the reference genome.
- chrU.rev.[1-2].bt2: the mirror index of the reference genome.
- ./res contains the logfile.
>> Running ppalgn on a cluster with a PBS scheduler:
cd ppseq-1.3-Linux-x86_64-pthread-free-sample/algn_chrU
qsub test_algn.qsub |
In this directory, you can find out the necessary input fules:
- test_algn.qsub: a OpenPBS script example through which you can submit jobs to a cluster.
- ppalgn_step_X.config is the input script of stage X of ppalgn. X could be 1 ~ 4 in which X=1/2/4 means N=1/2/4 and X=3 means N<=1. N is the number of mismatches in a seed of a multiseed alignment.
- init_data_algn_stpX.cfg defines the corresponding intial data file for ppalgn_step_X.config where X could be 1 ~ 4.
And you can find the results files:
- results/stp[1-4]_Y.sam: the output SAM files for ppalgn_step_X.config where Y is from 0 to (# of data managers).
- results/stp[1-4]_Y_paired1.fq and results/stpX_Y_paired.fq are the unmapped paired reads from a specific stage.
- results/stp[1-4]_Y_unpaired.fq are the unmapped unpaired reads from a specific stage.
- res/ contains the logfiles.
[top]
Manual (still under constrcution and more to come...)
ppbwt is a parallel indexer that builds an index from a set of DNA sequences. It reads in FASTA files with extensions .fa and generates six indexer files. The index is based on the FM-index which in turn is based on the Burrows Wheeler transform (BWT).
Options:
- num_tmgr < total number of task managers >
Acceptable Values: positive integer
Default Value: n/a
Description: The number of task managers must be equal to that defined in datafiles.
- datafiles < define the input data files >
Acceptable Values: non-empty string
Default Value: n/a
Description: Each line defines one initial data files. Multiple data files are acceptable. At least, one data file has to be defined.
- bwt_pattern_length < the length of a seed in the indexer algorithm >
Acceptable Values: positive integer
Default Value: n/a
Description: The length of a seed for the task manager in the indexer. A task manager will create 4^K seeds at which K is bwt_pattern_length.
- bwt_work_iteration < how many iterations is an engine expected to run on average? >
Acceptable Values: positive number, no less than 1.0
Default Value: 1.0
Description: At the beginning, a task manager will compute S = FLOOR(4^K / E*W) , at which K is bwt_pattern_length, E is the number of participating engines, and W is the bwt_work_iteration. The, the task manager will assign no more than S seeds to the next available engine, until the reference geome completes the indexing.
- compression < turn on the indexer? >
Acceptable Values: on or off
Default Value: off
Description: Turn on or off the indexer. This option must be turned "ON" and the option "alignment" must be turned "OFF". The indexer and the aligner are exclusive.
ppalgn is a parallel aligner that reads in both genome index files prebuilt by ppbwt and short-read files in the FASTQ format.The final output can be SAM/BAM files, or other user-defined formats.
Options:
- num_tmgr < total number of task managers >
Acceptable Values: positive integer
Default Value: n/a
Description: The number of task managers must be equal to that defined in datafiles.
- num_dmgr < total number of data managers >
Acceptable Values: positive integer
Default Value: 1
Description: The number of data managers. These data managers will collectively produce the final output files.
- datafiles < define the input data files >
Acceptable Values: non-empty string
Default Value: n/a
Description: Each line defines one initial data files. Multiple data files are acceptable. At least, one data file has to be defined.
- algn_readformat < input reads file format >
Acceptable Values: fastq or fasta
Default Value: fastq
Description: The input reads file formats. Supported FASTQ and FASTA.
- algn_job_size <how many short reads will be sent to an engine in each time ? >
Acceptable Values: positive integer
Default Value: n/a
Description: It specifies the number of short reads that will be sent to an engine in each time.
- algn_seed_length <the length of the seed? >
Acceptable Values: positive integer
Default Value: 20
Description: The length of the seed.
- algn_seed_number <the length of the seed? >
Acceptable Values: positive integer
Default Value: 1
Description: The number of times PPAlgn will re-seed reads with repetitive seeds.
- algn_dp_search_depth <the search depth ? >
Acceptable Values: positive integer
Default Value: 15
Description: The search depth, namely, the maximum number of consecutive times the seed extension attempts.
- algn_read_ids <the name of the reads index file? >
Acceptable Values: string
Default Value: n/a
Description: If algn_strategy_hash is set as "1", a reads index file is generated at which the short reads are "hashed".
- algn_dmgr_prefix <the prefix of the output file generated by the data managers? >
Acceptable Values: string
Default Value: n/a
Description:
- algn_stage <the alignment stage? >
Acceptable Values: 1, 2, 3, 4
Default Value: 1
Description: The alignment stage is defined in the paper. 1/2/4 means N =1/2/4. 3 means N <= 1. Here, N is the number of mismatches allowed in a seed in the multiseed algorithm.
- algn_k_mode <how many distinct alignments will be searched at most? >
Acceptable Values: positive integer
Default Value: 1
Description: Search for at most K distinct alignments. K = algn_k_mode.
- algn_func_type <specify a function rather than an individual number or setting? >
Acceptable Values: N, C, L, S, G
Default Value: N
Description: Specify a function type to generate the reads. N - use the default setting. C - constant, L - linear, S - square roots, G - natural log.
- algn_intercept <the intercept of the function defined in algn_func_type? >
Acceptable Values: floating-point number
Default Value: 1.0
Description: This option is used to specify the intercept of the function in algn_func_type.
- algn_slope <the slope of the function defined in algn_func_type? >
Acceptable Values: floating-point number
Default Value: 2.5
Description: This option is used to specify the slope of the function in algn_func_type.
- algn_tmpdir <temporary directory for the data manager >
Acceptable Values: string
Default Value: n/a
Description: Temporary directory for the data manager.
- algn_strategy_hash <how can the short reads be identitied? >
Acceptable Values: 1 or 2
Default Value: 1
Description: 1 means that we use a Hash function to identify each short reads by converting the name of a short reads to an integer. 2 means that we use the read-in sequential number of a short reads to identify through the alignment algorithm.
- algn_threads <the number of theads in engine to align simultaneouly? >
Acceptable Values: positive integer
Default Value: 1
Description: The number of theads in engine to align simultaneouly.
- algn_FilterThreads <the number of theads in the data manager to assemble the results simultaneouly? >
Acceptable Values: positive integer
Default Value: 1
Description: The number of theads in the data manager to assemble the results in parallel.
- algn_SizeOf <the number of results messages at most will be buffered at a data manager ? >
Acceptable Values: positive integer
Default Value: 1
Description: The maximum number of results messages that can be buffered at the data manager. The larger the number is, the faster the alignment is but the more memory size.
- alignment < turn on the aligner? >
Acceptable Values: on or off
Default Value: off
Description: Turn on or off the aligner. This option must be turned "ON" and the option "compression" must be turned "OFF". The indexer and the aligner are exclusive.
The perfomance tuning options are available for both the indexer and the aligner.
- asgn_mthd < how the managers will be assigned? >
Acceptable Values: naive or jump
Default Value: naive
Description: If it is set as "naive", then the task and data managers will be assigned in a row. If it is set as "jump", the task and data managers would be assigned in an interlaced mode.
- load_blnc < turn on the load balancer? >
Acceptable Values: on or off
Default Value: off
Description: Turn on off off the load balancing algorithm.
[top]