The Genome Factory: April 2016

Introduction

As of April 2016, there are about 70,000 genome assemblies in Genbank (draft and complete), with the majority being bacterial genomes. For genomes that have been submitted in NGS era, the COMMENT section of the Genbank file header has machine readable information about the sequencing technology, depth of coverage, and software used.

For example, the entry for Enterococcus faecium OC2A-1 contains this:

##Genome-Assembly-Data-START##
Finishing Goal           :: High-Quality Draft
Current Finishing Status :: High-Quality Draft
Assembly Method          :: Velvet v. 1.1.06
Genome Coverage          :: 104x
Sequencing Technology    :: Illumina
##Genome-Assembly-Data-END##

Method

I decided to parse this header for all the bacterial .gbff.gz (GenBank File Format, aka .gbk) files available at NCBI FTP to see what genome assembly software is being used for bacterial genomes. Now, like any user provided information, there is a lot of junk in this field, so I wrote some curated regexps to categorise them into cleaner bins. If more than one method was listed, I binned into Hybrid/Mixed. If if it was too minor or probably wrong I binned as Could not parse.

Results

Count	Assembler Software
23725	Not provided
9883	AllPaths
5325	Newbler
3783	Velvet
3585	CLC Genomics Workbench
3347	Spades
2610	IDBA
2477	Celera Assembler
2082	ABYSS
1815	CLC NGS Cell
1782	SOAPdenovo
1370	Could not parse
1119	HGAP
870	MaSuRCA
853	MIRA
793	A5-MiSeq
308	Ray
149	Phred/Phrap/Consed
132	Geneious
110	SeqMan
109	HGAP3
98	Edena
69	Hybrid/Mixed
59	DNAstar
55	Platanus
53	NextGene
20	Arachne
19	DISCOVAR
9	VelvetOptimiser
5	Falcon
4	Megahit
66618	Total

Discussion

I was a little surprised to see ALLPATHS top the list due to its particular requirements for DNA library construction (overlapping PE + long mate pair), but the Broad Institute does do a lot of sequencing. A lot of people are using Velvet and Spades, but equal many using CLC Workbench or the NGS Cell product.

The most disturbing and funniest entries in the Could not parse division are listed below.

in-house software v. 10/18/2012
Unknown program v. before 2013-07-02
Direct Sequencing
DNASTAR SeqMan NGen v. 4.0.0
GS Reference Mapper v. September 2013
Trimmomatic v. 0.32;
Ion Torrent PGM
Artimis v. 10.1 
artimist v. 10.1
De Bruijn graph v. Apr-2011
BCFtools Consensus
BLASTN v. actual
BOWTIE v. Version 2.1.0
BWA v. 0.5.1
BioNumerics v. 6.6
ELAND alignment algorithm
Galaxy v. May 2012
de Bruijn graphs v. Mar-2013
MAQ v. 0.7.1
MATLAB v. R2013a

At the top we have in-house software (with a version number!). The Direct Sequencing could be a single perfect read of full chromosome from a really lucky Oxford Nanopore user. Is there anything Artimist (aka Artemis) cannot do? I need to upgrade my version of Trimmomatic and "actual" BLASTN too.

Conclusion

My main concern is the number of read aligners listed. There are some draft genomes myself and others have encountered where it appears the submitters have just aligned the reads to a close reference and submitted the consensus sequence as the assembly. These "genomes" sometimes cause problems in population studies, and I'd rather the reads be available instead.

The Genome Factory

Monday 11 April 2016

What bacterial genome assemblers are people using?

Introduction

Method

Results

Discussion

Conclusion