PacBioToCA is a module in the Celera Assembler software package that performs error correction on PacBio long reads by mapping shorter, high accuracy reads onto the long reads. The error corrected reads can then be assembled using a long-read assembler, such as Celera Assembler, Mira, or Allora.
Installing and using PacBioToCA and Celera Assembler with PacBio data is documented extensively on the pacBioToCA wiki page (and the software is bundled in SMRT Analysis), so this page will serve as a brief tutorial.
The following hybrid assembly of lambda sequence uses two fastq files, one containing long reads, another containing short reads (e.g. Illumina, 454, CCS). For example data, please see the pacBioToCA wiki page under the “Example with Phage Reads” section.
First, create a frg file based on the short read fastq data, which contains metadata about the data (technology, whether it’s paired end, etc.):
/tools/wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname illumina -technology illumina -reads illumina.fastq > illumina.frg
Some points to consider:
The documentation is presented in full in the following:
usage: /tools/wgs-7.0/Linux-amd64/bin/fastqToCA [-insertsize <mean> <stddev>] [-libraryname <name>]
-insertsize i d Mates are on average i +- d bp apart.
-libraryname n The UID of the library these reads are added to.
-technology p What instrument were these reads generated on ('illumina' is the default):
'sanger' --
'454' --
'illumina' --
'pacbio' --
-type t What type of fastq ('sanger' is the default):
'sanger' -- QV's are PHRED, offset=33 '!', NCBI SRA data.
'solexa' -- QV's are Solexa, early Solexa data.
'illumina' -- QV's are PHRED, offset=64 '@', Illumina reads from version 1.3 on.
See Cock, et al., 'The Sanger FASTQ file format for sequences with quality scores, and
the Solexa/Illumina FASTQ variants', doi:10.1093/nar/gkp1137
-innie The paired end reads are 5'-3' <-> 3'-5' (the usual case) (default)
-outtie The paired end reads are 3'-5' <-> 5'-3' (for Illumina Mate Pair reads)
This switch will reverse-complement every read, transforming outtie-oriented
mates into innie-oriented mates. This trick only works if all reads are the
same length.
-reads A Single ended reads, in fastq format.
-mates A Mated reads, interlaced, in fastq format.
-mates A,B Mated reads, in fastq format.
Since PacBioToCA uses the AMOS package, which is bundled with SMRT Analysis, ensure you have setup the SMRT Analysis environment:
source /opt/smrtanalysis/etc/setup.sh
Given a frg file of short reads (illumina.frg), and a fastq file of long reads (pacbio.filtered_subreads.fastq), the following performs error correction:
/tools/wgs-7.0/Linux-amd64/bin/pacBioToCA -length 500 -partitions 200 -l ec_pacbio -t 16 -s pacbio.spec \
-fastq pacbio.filtered_subreads.fastq illumina.frg > run.out 2>&1
Some points to consider:
The error corrected reads can then be assembled using any long read assembler. This is what you would call using Celera Assembler:
/tools/wgs-7.0/Linux-amd64/bin/runCA -p asm -d asm -s asm.spec ec_pacbio.frg > asm.out 2>&1
Some notes:
There are two spec files used in Celera Assembler hybrid assembly. The first, pacbio.spec, is used in error correction. The second, asm.spec, is used for assembly.
These files contain both algorithm-specific parameters as well as hardware-specific parameters. In general the algorithm parameters can be left alone, but the hardware-specific parameters need to be adjusted to your computing environment. The sample data found on the PacBioToCA wiki page contains reasonable spec files.
Here is a pacbio.spec file tuned to our cluster environment:
stopAfter=overlapper
# original asm settings
utgErrorRate = 0.25
utgErrorLimit = 4.5
cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25
merSize=14
merylMemory = 128000
merylThreads = 16
ovlStoreMemory = 8192
# grid info
useGrid = 1
scriptOnGrid = 1
frgCorrOnGrid = 1
ovlCorrOnGrid = 1
sge = -V -S /bin/sh
#sge = -V -A assembly
sgeScript = -pe smp 16
sgeConsensus = -pe smp 1
sgeOverlap = -pe smp 4
sgeFragmentCorrection = -pe smp 2
sgeOverlapCorrection = -pe smp 1
#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 4
ovlHashBlockLength = 20000000
ovlRefBlockSize = 50000000
# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000
frgCorrThreads = 2
frgCorrBatchSize = 100000
ovlCorrBatchSize = 100000
# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4
ovlStoreMemory = 8192
ovlConcurrency = 6
cnsConcurrency = 16
merOverlapperThreads = 2
merOverlapperSeedConcurrency = 6
merOverlapperExtendConcurrency = 6
frgCorrConcurrency = 8
ovlCorrConcurrency = 16
cnsConcurrency = 16
Here is an asm.spec tuned to our cluster environment:
cnsErrorRate = 0.10
ovlErrorRate = 0.10
overlapper = ovl
unitigger = bogart
utgBubblePopping = 1
merSize = 14
merylMemory = 128000
merylThreads = 16
ovlStoreMemory = 8192
# grid info
useGrid = 1
scriptOnGrid = 1
frgCorrOnGrid = 1
ovlCorrOnGrid = 1
sge = -V -S /bin/sh
sgeScript = -pe smp 16
sgeConsensus = -pe smp 1
sgeOverlap = -pe smp 4
sgeFragmentCorrection = -pe smp 2
sgeOverlapCorrection = -pe smp 1
#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 6
ovlHashBlockLength = 20000000
ovlRefBlockSize = 5000000
# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000
frgCorrThreads = 2
frgCorrBatchSize = 100000
ovlCorrBatchSize = 100000
# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 12
ovlStoreMemory = 8192
ovlConcurrency = 8
merOverlapperThreads = 6
merOverlapperSeedConcurrency = 2
merOverlapperExtendConcurrency = 2
frgCorrConcurrency = 8
ovlCorrConcurrency = 16
cnsConcurrency = 16
doToggle=0
toggleNumInstances = 0
toggleUnitigLength = 2000
doOverlapBasedTrimming = 1
doExtendClearRanges = 2