P_ErrorCorrection

P_ErrorCorrection is a module in SMRT Pipe versions 1.3.1 to 1.3.3 that performs error correction on PacBio long reads by mapping shorter, high accuracy reads onto the long reads. The error corrected reads can then be assembled using a long-read assembler, such as Celera Assembler, Mira, or Allora.

If you are using SMRT Analysis 1.4, please consider using the HGAP approach instead.

Unlike the PacBioToCA module in Celera Assembler, P_ErrorCorrection has the ability to keep long reads together in regions where there is no short read coverage.

P_ErrorCorrection requires a params.xml and an input.xml.

params.xml

P_ErrorCorrection can use fasta/fastq files as input if the options useFastqAsShortReads and useFastaAsLongReads are set to true. The following shows this configuration:

<?xml version="1.0"?>
<smrtpipeSettings>
    <module id="P_ErrorCorrection">
        <description>Error Correction</description>
        <param name="useFastqAsShortReads" hidden="true">
            <value>True</value>
        </param>
        <param name="useFastaAsLongReads" hidden="true">
            <value>True</value>
        </param>
        <param name="useLongReadsInConsensus" hidden="true">
            <value>False</value>
        </param>
        <param name="useUnalignedReadsInConsensus" hidden="true">
            <value>False</value>
        </param>
        <param name="blasrOpts">
            <value>-advanceHalf -noSplitSubreads -ignoreQuality -minMatch 10 -minPctIdentity 70 -bestn 20</value>
        </param>
        <param name="layoutOpts">
            <value>--overlapTolerance=25</value>
        </param>
    </module>
</smrtpipeSettings>

bas.h5 files can also be used directly if either useFastqAsShortReads and useFastaAsLongReads are set to false. The P_Fetch and P_Filter modules are needed in this scenario, and the params.xml needs to be adjusted accordingly; this is detailed in the SMRT Pipe Reference Guide.

Additional parameters are described in the SMRT Pipe Reference Guide.

input.xml

You can specify the actual data to be error corrected in the input.xml file:

<?xml version="1.0"?>
<pacbioAnalysisInputs>
  <dataReferences>
    <url ref="fasta:pacbio.filtered_subreads.fasta" />
    <url ref="fastq:illumina.fastq" />
  </dataReferences>
</pacbioAnalysisInputs>

As the params.xml options suggest, the long reads data need to be in fasta format, and the short read data need to be in fastq format.

If referring to bas.h5 files, you can use the following format:

<url ref="run:0000000-0000"><location>/path/to/bas.h5</location></url>

The run id needs to be unique. If there are a number of bas.h5 files, the input.xml can be autogenerated using the fofnToSmrtpipeInput.py script. This script takes a fofn file, which contains list of filenames separated by carriage returns, and outputs a properly formatted input.xml.

job.sh

The job.sh will contain the command to run the P_ErrorCorrection module using SMRTPipe.

smrtpipe.py --params=params.xml xml:input.xml

Running P_ErrorCorrection

First ensure you have setup the SMRT Analysis environment:

source /opt/smrtanalysis/etc/setup.sh

Afterwards, now run job.sh:

source job.sh

Table Of Contents

This Page