pacBioToCA (error correction via Celera Assembler)

PacBioToCA is a module in the Celera Assembler software package that performs error correction on PacBio long reads by mapping shorter, high accuracy reads onto the long reads. The error corrected reads can then be assembled using a long-read assembler, such as Celera Assembler, Mira, or Allora.

Installing and using PacBioToCA and Celera Assembler with PacBio data is documented extensively on the pacBioToCA wiki page (and the software is bundled in SMRT Analysis), so this page will serve as a brief tutorial.

The following hybrid assembly of lambda sequence uses two fastq files, one containing long reads, another containing short reads (e.g. Illumina, 454, CCS). For example data, please see the pacBioToCA wiki page under the “Example with Phage Reads” section.

Preparing the short read data

First, create a frg file based on the short read fastq data, which contains metadata about the data (technology, whether it’s paired end, etc.):

/tools/wgs-7.0/Linux-amd64/bin/fastqToCA -libraryname illumina -technology illumina -reads illumina.fastq > illumina.frg

Some points to consider:

  • The -libraryname is required, but isn’t important for any downstream analysis.
  • Verify that the QV values are offset 33; otherwise use the -type parameter.
  • If paired end data, make sure -insertsize and -innie/-outie are correctly specified.
  • The frg file contains the absolute path of the fastq, so moving either will corrupt the frg file.

The documentation is presented in full in the following:

usage: /tools/wgs-7.0/Linux-amd64/bin/fastqToCA [-insertsize <mean> <stddev>] [-libraryname <name>]

  -insertsize i d    Mates are on average i +- d bp apart.

  -libraryname n     The UID of the library these reads are added to.

  -technology p      What instrument were these reads generated on ('illumina' is the default):
                       'sanger'   --
                       '454'      --
                       'illumina' --
                       'pacbio'   --

  -type t            What type of fastq ('sanger' is the default):
                       'sanger'   -- QV's are PHRED, offset=33 '!', NCBI SRA data.
                       'solexa'   -- QV's are Solexa, early Solexa data.
                       'illumina' -- QV's are PHRED, offset=64 '@', Illumina reads from version 1.3 on.
                     See Cock, et al., 'The Sanger FASTQ file format for sequences with quality scores, and
                     the Solexa/Illumina FASTQ variants', doi:10.1093/nar/gkp1137

  -innie             The paired end reads are 5'-3' <-> 3'-5' (the usual case) (default)

  -outtie            The paired end reads are 3'-5' <-> 5'-3' (for Illumina Mate Pair reads)
                     This switch will reverse-complement every read, transforming outtie-oriented
                     mates into innie-oriented mates.  This trick only works if all reads are the
                     same length.

  -reads A           Single ended reads, in fastq format.
  -mates A           Mated reads, interlaced, in fastq format.
  -mates A,B         Mated reads, in fastq format.

Error correction with PacBioToCA

Since PacBioToCA uses the AMOS package, which is bundled with SMRT Analysis, ensure you have setup the SMRT Analysis environment:

source /opt/smrtanalysis/etc/setup.sh

Given a frg file of short reads (illumina.frg), and a fastq file of long reads (pacbio.filtered_subreads.fastq), the following performs error correction:

/tools/wgs-7.0/Linux-amd64/bin/pacBioToCA -length 500 -partitions 200 -l ec_pacbio -t 16 -s pacbio.spec \
    -fastq pacbio.filtered_subreads.fastq illumina.frg > run.out 2>&1

Some points to consider:

  • In general, the -length, -partitions, and -t parameters can be used without modification.
  • -l specifies the error corrected output filename prefix. The output of the example above will be ec_pacbio.frg, ec_pacbio.fasta, and ec_pacbio.qual.
  • Celera Assembler works best on a cluster, so the SGE-enabled version of pacbio.spec should be used. This is discussed in more detail below.

Assembly of error corrected reads

The error corrected reads can then be assembled using any long read assembler. This is what you would call using Celera Assembler:

/tools/wgs-7.0/Linux-amd64/bin/runCA -p asm -d asm -s asm.spec ec_pacbio.frg > asm.out 2>&1

Some notes:

  • The assembled contigs will be at asm/9-terminator/asm.ctg.fasta.
  • -p asm can be used without changes.
  • -d is the output directory.
  • Similar to the pacbio.spec, the asm.spec can be tuned for a cluster environment.

Other notes

  • If using Illumina data for the short reads, look at the Optional Parameters on the PacBioToCA page for some alternative parameters.
  • In the CA 7.0 release, there is a bug that may cause crashes in the PacBioToCA pipeline. The Known Issues section on the PacBioToCA page has details. I have had good luck compiling from source, which addresses this issue.

Spec files

There are two spec files used in Celera Assembler hybrid assembly. The first, pacbio.spec, is used in error correction. The second, asm.spec, is used for assembly.

These files contain both algorithm-specific parameters as well as hardware-specific parameters. In general the algorithm parameters can be left alone, but the hardware-specific parameters need to be adjusted to your computing environment. The sample data found on the PacBioToCA wiki page contains reasonable spec files.

Here is a pacbio.spec file tuned to our cluster environment:

stopAfter=overlapper

# original asm settings
utgErrorRate = 0.25
utgErrorLimit = 4.5

cnsErrorRate = 0.25
cgwErrorRate = 0.25
ovlErrorRate = 0.25

merSize=14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 1
scriptOnGrid = 1
frgCorrOnGrid = 1
ovlCorrOnGrid = 1

sge = -V -S /bin/sh
#sge = -V -A assembly
sgeScript = -pe smp 16
sgeConsensus = -pe smp 1
sgeOverlap = -pe smp 4
sgeFragmentCorrection = -pe smp 2
sgeOverlapCorrection = -pe smp 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 4
ovlHashBlockLength = 20000000
ovlRefBlockSize =  50000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 4

ovlStoreMemory = 8192

ovlConcurrency = 6

cnsConcurrency = 16

merOverlapperThreads = 2
merOverlapperSeedConcurrency = 6
merOverlapperExtendConcurrency = 6

frgCorrConcurrency = 8
ovlCorrConcurrency = 16
cnsConcurrency = 16

Here is an asm.spec tuned to our cluster environment:

cnsErrorRate = 0.10
ovlErrorRate = 0.10

overlapper = ovl
unitigger = bogart
utgBubblePopping = 1

merSize = 14

merylMemory = 128000
merylThreads = 16

ovlStoreMemory = 8192

# grid info
useGrid = 1
scriptOnGrid = 1
frgCorrOnGrid = 1
ovlCorrOnGrid = 1

sge = -V -S /bin/sh
sgeScript = -pe smp 16
sgeConsensus = -pe smp 1
sgeOverlap = -pe smp 4
sgeFragmentCorrection = -pe smp 2
sgeOverlapCorrection = -pe smp 1

#ovlMemory=8GB --hashload 0.7
ovlHashBits = 25
ovlThreads = 6
ovlHashBlockLength = 20000000
ovlRefBlockSize =  5000000

# for mer overlapper
merCompression = 1
merOverlapperSeedBatchSize = 500000
merOverlapperExtendBatchSize = 250000

frgCorrThreads = 2
frgCorrBatchSize = 100000

ovlCorrBatchSize = 100000

# non-Grid settings, if you set useGrid to 0 above these will be used
merylMemory = 128000
merylThreads = 12

ovlStoreMemory = 8192

ovlConcurrency = 8

merOverlapperThreads = 6
merOverlapperSeedConcurrency = 2
merOverlapperExtendConcurrency = 2

frgCorrConcurrency = 8

ovlCorrConcurrency = 16
cnsConcurrency = 16

doToggle=0
toggleNumInstances = 0
toggleUnitigLength = 2000

doOverlapBasedTrimming = 1
doExtendClearRanges = 2