HGAP

This page describes the use of PreAssembler, Celera® Assembler and Quiver for the purpose of Hierarchical Genome Assembly in SMRT® Analysis version 1.4 with special emphasis on SMRT Portal, the web interface for SMRT Analysis 1.4.

If you are interested in executing the entire workflow on the smrtpipe command line please refer to the HGAP wrapper script developed for that purpose.

https://github.com/PacificBiosciences/Bioinformatics-Training/tree/master/hgap

HGAP in SMRT Analysis 1.4 has been tested on microbial sized genomes, so while it will run on larger genomes there is no official support for larger-sized genomes.

## Introduction The Hierarchical Genome Assembly Process (HGAP) was developed to improve the performance, accuracy and efficiency of _de novo_ assemblies from long single-pass reads generated by the PacBio® sequencing platform. The process consists of 3 main procedures:

  1. Pre-assembly of PacBio single-pass reads;
  2. The assembly of trimmed pre-assembled reads via long read assemblers; and
  3. Polishing the assembly to the highest consensus accuracy via Quiver.

## Pre Assembly workflow The most recent algorithmic advance, HGAP, reduces the sample preparation requirements to a single large insert library. Instead of using just the longest reads in a large insert library (as is the case with error correction), the HGAP approach incorporates almost all the reads. This is accomplished by aligning all subreads above the minimum subread length against the longest seed reads and generating a consensus sequence of the mapped subreads. These pre-assembled reads are both long and high accuracy and can be used by long read assemblers for de novo genome assembly utilizing only PacBio data.

Mapping long subreads to even longer seed reads is able to boost the accuracy of the preassembled reads even in difficult regions of the genome (low complexity, repetitive). SMRT® sequencing’s lack of amplification bias, its ability to span difficult regions of genomes and the random nature of the error model contribute to producing accurate and complete genome assemblies that previously had not been possible or only with substantial investments in manual genome finishing.

### Filtering After preparing a SMRT®bell insert library that is as long as the quality of the genomic DNA allows (preferably longer than 10kb) and generating single pass reads from the longest movie length possible (currently 120min) the first step before starting preassembly is to evaluate the subread length distribution and cumulative yield starting with the longest subreads.

One would select all the SMRT®cells for the genome to assemble and run the data through the RS_Filter_Only protocol at the minimum RQ (0.8) and minimum subread (500 bp) that will be employed by the RS_PreAssembler protocol in the next step. Given an estimated genome size for the organism of interest one would use the Subread Filtering graph to identify a long subread yield that would correspond to 20-30X genome coverage. For a 5Mb genome one would want to identify the minimum subread cutoff that would generate 100-150Mb coverage which should typically be in the 4kb-6kb range (Fig 1). We will use this subread cutoff as the seed read length for the PreAssembly step.

[[/images/HGAP-Filtering-Plot.png]]

Fig. 1.

### PreAssembly The RS_PreAssembler protocol utilizes the subreads that are longer than the seed read length cutoff as reference to map all other single pass subreads to their respective best n locations. All overlapping subreads that map to a given seed read are then used to generate a consensus sequence which in effect results in a seed read with significantly higher accuracy in regions of high coverage. The stringency and efficiency of the mapping step can be controlled through the advanced choices box that allows the pass through of all BLASR options to the command line (Fig 2).

Prepare the RS_PreAssembler protocol in SMRT®Portal by specifying a Minimum Seed Read Length as identified from the subread filtering graph. When using data generated from the XL enzyme in combination with the C2 sequencing kit it is suggested to change the Advanced BLASR option of -maxLCPLength from the default 16 to 14 to accommodate the slightly altered error profile between the C2 and XL polymerase. If one is interested in carrying out a traditional error correction step that utilizes CCS data to map to the seed reads the Use CCS box would need to be checked (Fig 2).

[[/images/PreAssembler-SMRT-Portal.png]]

Fig. 2.

### Fastq Trimming The consensus algorithm of the PreAssembler protocol reports the confidence in the mapped coverage in the form of a Quality Value (QV) Score that is encoded in the corrected.fastq. One can therefore use minimum QV filtering to remove regions of low coverage like low quality subreads, seed read edges, merge points of chimeric seed reads and inverted repeats of unidentified adapter sequences.

Choosing automatic fastq filtering in RS_PreAssembler protocol uses a QV cutoff of 59.5 and a length cutoff of 500bp (Fig. 2). If automatic fastq filtering is selected the filtered corrected.fastq will replace the original corrected.fastq preventing additional filtering with lower stringency to be carried out. You may want to do manual fastq trimming to fine tune filtering to the actual library insert size, total coverage and the specifics of the genome (repeat content, complexity etc.).

Empirical observations suggest that carrying out assemblies at different QV cutoffs (20, 35 or 50) at similar coverage levels (different lengths cutoffs) will help in identifying the best trimming parameters for a given data set. For manual fastq filtering use the trimFastqByQVWindow.py script on the command line.

source /opt/smrtanalysis/etc/setup.sh

trimFastqByQVWindow.py -h Usage: trimFastqByQVWindow.py [–help] [options] ARG1 Options: -h, –help show this help message and exit -l LOGFILE, Specify a file to log to. Defaults to stderr. -d, –debug Increases verbosity of logging -i, –info Display informative log entries -p, –profile Profile this script, dumping to<name>.profile –qvCut=QVCUT quality value cutoff –trimFront=TRIMFRONT beginning base pairs to remove(gmin) –out=OUT out file(gdefaults to stdout) –fastaOut=FASTAOUT out fasta file(gdefaults to None) –minSeqLen=MINSEQLEN min seq len to keep>

## Celera Assembly workflow The use of Celera Assembler is well documented. In short the first step is the preparation of a frg file to deliver the preassembled reads to the Celera Assembler pipeline:

fastqToCA -libraryname reads -technology sanger -reads corrected.fastq > preassembled.frg

Some points to consider: The -libraryname is required, but isn’t important for any downstream analysis. The QV values are offset by 33 by default so the -type sanger parameter can be omitted. The frg file contains the absolute path of the fastq, so moving either will corrupt the frg file.

The frg file is then used to specify the input data for Celera Assemblers runCA command:

runCA -p asm -d asm -s asm.spec preassembled.frg > asm.out 2>&1

Some points to consider: The assembled contigs will be at asm/9-terminator/asm.ctg.fasta. -p asm specifies the prefix for the results files and can be used without changes. -d asm specifiesis the output directory.

The asm.spec file can be tuned for the compute and cluster environment (see Celera Assembler homepage for details).

## Quiver workflow The quiver consensus algorithm is now the default for the RS_Resequencing protocol in SMRT®Portal 1.4. To improve the consensus accuracy of the assembly one would need to copy the output of the Celera Assembler to the common/reference_dropbox to generate a reference in SMRT®Portal. The next step is to carry out the RS_Resequencing protocol either with the same QV and subread filter settings used for Preassembler or less stringent for additional coverage.

The default behavior of the quiver algorithm is geared towards generating the highest quality consensus sequence of the reference – the _de novo_ assembly in this case – which limits coverage only to unambiguously mapped reads (Fig. 3). For non-repetitive regions with high unique coverage the consensus algorithm is able to achieve accuracies above QV50 but for repetitive regions the consensus accuracy could be reduced due to lower unique coverage or in extreme cases insufficient coverage for Quiver.

[[/images/Assembly-Polishing-SMRT-Portal.png]]

Fig. 3

In regions without sufficient unique coverage the default behavior is to insert _Ns_ into the consensus sequence. One way to combat this behavior is to allow ambiguously mapped long reads to participate in the consensus calculation by deselecting _Use only unambiguously mapped reads_. The alternative would be to carry out the default RS_Resequencing protocol and then re-running the consensus.fastq export command with an alternate option that will insert the original reference sequence in regions that otherwise would be filled with Ns.

quiver –noEvidenceConsensusCall=reference –algorithm=quiver data/aligned_reads.cmp.h5 -r reference.fasta -o consensus.fastq

This Page