Evaluating assemblies
=====================

Aligning reads against a reference
----------------------------------

The following commands create a sorted bam file that can be viewed in IGV, mapping the lambda PacBio raw reads against the reference::

    /tools/bwa/bwa index lambda.fasta
    nice -n 15 /tools/bwa/bwa bwasw -t 16 lambda.fasta pacbio.filtered_subreads.fasta > pacbio.filtered_subreads.sam
    /tools/samtools/samtools view -bS pacbio.filtered_subreads.sam > pacbio.filtered_subreads.bam
    /tools/samtools/samtools sort pacbio.filtered_subreads.bam pacbio.filtered_subreads.sorted
    /tools/samtools/samtools index pacbio.filtered_subreads.sorted.bam

The software can be downloaded here:

* samtools: http://samtools.sourceforge.net/
* bwa: http://bio-bwa.sourceforge.net/

The bam files can be viewed using this software:

* IGV: http://www.broadinstitute.org/software/igv/download
* Tablet: http://bioinf.scri.ac.uk/tablet/

Running Mauve
-------------

Mauve needs to be called from within its root directory, so first let's change directories::

    cd /tools/mauve

Now we can run Mauve::

    java -cp Mauve.jar org.gel.mauve.assembly.ScoreAssembly -reference /data/assembly/lambda.fasta -assembly ~/training/assembly/celera/asm/9-terminator/asm.ctg.fasta -reorder -outputDir ~/training/assembly/mauve_celera/

Note that we need to specify both a ``-reference`` and an ``-assembly``, as well as the output directory ``-outputDir``. ``-reorder`` allows Mauve to reorder and flip the strand of the contigs to best align the contigs to the reference. A recent build of Mauve is required to use this functionality.

More details about scoring genome assemblies with Mauve can be found here: http://code.google.com/p/ngopt/wiki/How_To_Score_Genome_Assemblies_with_Mauve

Mauve includes a Java-based browser of the genome alignments, downloadable at http://asap.ahabs.wisc.edu/mauve/ . To browse an alignment, open one of files in the subdirectories in the output directory, e.g. alignment5/alignment5

Nucmer: Contig accuracy and dotplots
------------------------------------

Using ``nucmer`` from the MUMmer package, we can calculate the per base accuracy of contigs against a reference::

    /tools/MUMmer/nucmer --maxmatch -c 100 -p celera lambda.fasta celera-assembly.fasta
    /tools/MUMmer/show-coords -c celera.delta > celera.coords

Notes:

* ``--maxmatch -c 100`` tunes nucmer to look exhaustively for longer matches.
* ``-p celera`` specifies ``celera`` to be the output prefix, in this case generating ``celera.delta``
* Here we placed the reference fasta before the assembly fasta.

``celera.coords`` contains the results::

    /data/assembly/lambda.fasta /home/lhon/training/assembly/celera/asm/9-terminator/asm.ctg.fasta
    NUCMER

        [S1]     [E1]  |     [S2]     [E2]  |  [LEN 1]  [LEN 2]  |  [% IDY]  |  [COV R]  [COV Q]  | [TAGS]
    ==========================================================================================================
          42    48496  |    48452        1  |    48455    48452  |    99.93  |    99.90   100.00  | ref000001|lambda_NEB3011        ctg7180000000001

In this case, the single contig that Celera Assembler generated had a 99.93% accuracy with 99.9% coverage of the genome.

Now let's visualize the alignment, using ``celera.delta`` generated by the nucmer command above::

    /tools/MUMmer/mummerplot --png -p celera celera.delta -R lambda.fasta -Q celera-assembly.fasta