The Genome Factory: September 2012

Yesterday we got an email from a Nesoni user who said that the SHRiMP aligner was failing on his FASTQ data. Then again today we got another similar report from our trusted colleague Tim Stinear with the same issue. The evidence was mounting for a bug either in Nesoni or in the FASTQ file, rather than user error. Tim had recently upgraded his PGM Sequencer to Torrent Suite v3.0 (point zero release alarm bells!), and Nesoni saved the shrimp_log.txt file and it contained this:

note: quality value format set to PHRED+33
done r/hr r/core-hr
The qv-offset might be set incorrectly! Currenty qvs are interpreted as PHRED+33 and a qv of 62 was observed. To disable this error, etiher set the offset correctly or disable this check (see README).

Wow! The PGM has improved dramatically, calling some bases at Q62, that's better than 1 in 1 million error rate... Here's a breakdown of the Q values in the file:

Symbol ASCII Q+33 Frequency
+ 43 10 1105774
, 44 11 347753
- 45 12 1167099
. 46 13 673276
/ 47 14 137220
0 48 15 225893
1 49 16 1621714
2 50 17 1731775
3 51 18 2726736
4 52 19 4280447
5 53 20 4272951
6 54 21 2556953
7 55 22 7783535
8 56 23 5153631
9 57 24 2362016
: 58 25 2406869
; 59 26 2517450
< 60 27 5762153
= 61 28 4334469
> 62 29 7109066
? 63 30 11113780
@ 64 31 13934507
A 65 32 9227417
B 66 33 12758868
C 67 34 8228985
D 68 35 9935410
E 69 36 1459950
F 70 37 2358692
G 71 38 682190
H 72 39 158322
I 73 40 168311
J 74 41 269
K 75 42 199121
L 76 43 83457
M 77 44 4971
N 78 45 464143
O 79 46 8657
P 80 47 0
Q 81 48 0
R 82 49 0
S 83 50 0
T 84 51 0
U 85 52 0
V 86 53 0
W 87 54 0
X 88 55 0
Y 89 56 0
Z 90 57 0
[ 91 58 0
\ 92 59 0
] 93 60 0
^ 94 61 0
_ 95 62 746

Ok, there really are 746 bases with Q62. Some grep work shows me they are all occuring alone in 746 reads, all in position 1 in the read, like this:

@QWRK0:02580:02076
GGGAATCAAAACGCTGATTTTTGATGAACAGAATAACGAA
+
_8,59=<<@@6@BCCDDDFFF3FEEEFAEC@@;77-7777

The table also shows quite a few >Q41 bases, which aren't typically seen in FASTQ files. These ones are probably ok (?) but the Q62 ones surely must be some artifact or bug in the version 3.0 of Torrent Suite.

In the meantime, our solution has been this:

sed 's/^_/H/' < dodgy.fastq > fixed.fastq

Be interested to see if others have encountered this problem.

Introduction

Illumina sequencing instruments (HiSeq, MiSeq, Genome Analyzer) can produce three main types of reads when sequencing genomic DNA:

Single-end
Each "read" is a single sequence from one end of a DNA fragment. The fragment is usually 200-800bp long, with the amount being read can be chosen between 50 and 250 bp.
Paired-end
Each "read" is two sequences (a pair) from each end of the same genomic DNA fragment (more info). The distance between the reads on the original genome sequence is equal to the length of the DNA fragment that was sequenced (usually 200-800 bp).
Mate-pair:
Like paired-end reads, each "read" is two sequences from each end of the same DNA fragment, but the DNA fragment has been engineered from a circularization process (more info) such that the distance between the reads on the original genome sequence is much longer (say 3000-10000 bp) than the proxy DNA fragment (200-800 bp).

Single-end library ("SE")

When we got the original Illumina Genome Analyzer, all it could do was 36 bp single-end reads, and each lane gave us a massive 250 Mbp, and we had to walk 7 miles through snow in the dark to get it. Ok, that last bit is clearly false as we don't get snow in Australia and we speak metric here, but the point is that there is still plenty of legacy SE data around, and SE reads are still used in RNA-Seq sometimes. Let's imagine our data was provided as a standard FASTQ file called SE.fq:

velveth Dir 31 -short -fastq SE.fq

velvetg Dir -exp_cov auto -cov_cutoff auto

I strongly recommend enabling the -exp_cov auto and -cov_cutoff auto options. They will almost always improve the quality of your assemblies.

Paired-end library ("PE")

Paired-end reads are the standard output of most Illumina sequencers these days, currently 2x100bp for the HiSeq and 2x150bp for the GAIIx and MiSeq, but they are all migrating to 2x250bp soon. The two sequences per paired read are typically distributed in two separate files, the "1" file contains all the "left" reads and the "2" file contains all the corresponding "right" reads. Let's imagine our paired-end run gave us two files in standard FASTQ format, PE_1.fq and PE_2.fq:

velveth Dir 31 -shortPaired -separate -fastq PE_1.fq PE_2.fq

velvetg Dir -exp_cov auto -cov_cutoff auto

Previously you had to interleaved the left and right files for Velvet, but we recently added support to Velvet for the -separate option which we hope is now saving time and disk space throughout the Velvetsphere!

Mate-pair library ("MP")

Mate-pair reads are extremely valuable in a de novo setting as they provide long-range information about the genome, and can help link contigs together into larger scaffolds. They have been used reliably for years on the 454 FLX platform, but used less often on the Illumina platform. I think the main reasons for this are the poorer reliability of the Illumina mate-pair protocol and the larger amount of DNA required compared to a PE library.

We can consider MP reads as the same as PE reads, but with a larger distance between them ("insert size"). But there is one technical difference due to the circularization procedure used in their preparation. PE reads are oriented "opp-in" (L=>.....<=R), whereas MP reads are oriented "opp-out" (L<=.....=>R). Velvet likes its paired reads to be in opp-in orientation, so we need to reverse-complement all our MP reads first, which I do here with the EMBOSS "revseq" tool.

revseq -sequence MP_1.fq -outseq rcMP_1.fq -notag

revseq -sequence MP_2.fq -outseq rcMP_2.fq -notag

velveth Dir 31 -shortPaired -separate -fastq rcMP_1.fq rcMP_2.fq

velvetg Dir -exp_cov auto -cov_cutoff auto -shortMatePaired yes

Early Illumina MP libraries are often contaminated with PE reads (the so-called shadow library) which are the result of imperfect selection of biotin-marked fragments in the circularization process. There is a special option in velvetg (not velveth) called -shortMatePaired added by Sylvain Foret which informs Velvet that a paired channel is MP but may contain PE reads, which helps it to account for them and avoid mis-scaffolding. I recommend using this no matter how pure you think your MP library is.

Combining them all! (SE + PE + MP)

When de novo assembling multiple libraries in Velvet, you should order them from smallest insert size to largest insert size. For our case, this means the SE first, then the PE, then the MP. Each library must go in its own "channel" as it comes from a differently prepared DNA library. Channels are specified in Velvet with a numerical suffix on the read-type parameter (absence means channel 0):

velveth \

Dir 31 \

-short -fastq SE.fq \

-shortPaired2 -separate -fastq PE_1.fq PE_2.fq \

-shortPaired3 -separate -fastq rcMP_1.fq rcMP_2.fq

velvetg \

Dir \

-exp_cov auto -cov_cutoff auto \

-shortMatePaired3 yes

Note that the -shortMatePaired option has been allocated to channel 3 now (the -shortPaired3 library) as that is the MP channel.

Conclusions

It's relatively to get up and running with Velvet, but when your projects become more complicated, the methods in this post should help you. But if you prefer a nice GUI to take care of most of the issues discussed here, I recommend using our Velvet GUI called VAGUE (Velvet Assembler Graphical User Environment).

The Genome Factory

Friday 21 September 2012

Problematic FASTQ output from Ion TorrentSuite 3.0

Saturday 8 September 2012

Using Velvet with mate-pair sequences