The Genome Factory: 2011

Thursday 10 November 2011

Compressing FASTQ reads by splitting into homogeneous streams

Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.

883252 R1.fastq
233296 R1.fastq.gz
182056 R1.fastq.bz2

I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.

189588 id.txt
341756 seq.txt
341756 qual.txt
873100 TOTAL

Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.

20608 id.txt.gz
84096 qual.txt.gz
102040 seq.txt.gz
206644 TOTAL (was 233296 combined)

If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.

16560 id.txt.bz2
66812 qual.txt.bz2
93564 seq.txt.bz2
176936 TOTAL (was 182056 combined)

So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.

Friday 30 September 2011

Counting sequences with Unix tools

Every command line bioinformatician has a suite of specific utility tools in their $PATH for doing core operations on common data files, such as counting the number of sequences in a .fasta file. Sometimes however, you end up on someone else's server where those tools are no longer available. It's in this situation when a good working knowledge of the standard Unix tools can be valuable. Here's a short list of the most useful ones for chaining together:

cat - show
grep - filter
sed - modify
wc - count
cut - extract columns
sort - sort
uniq - identify duplicates
head - extract start
tail - extract end
expr - arithmetic

For example, the classic example is to count the number of sequences in a .fasta file:

% grep '>' in.fasta | wc -l

Because each sequence has a ">" character for its ID line, the grep selects out all the ID lines, which the number of is equal to the number of sequences. The wc command counts its input, and -l means to count lines, rather than characters or words.

Most of us would be using modern GNU/Unix systems, where the grep command has a -c option which counts matching lines rather than printing them out one by one, removing the need for wc:

% grep -c '>' in.fasta

Please ensure you use quote characters around the ">", otherwise the shell will think you want to send the output of grep to in.fasta, which will result in in.fasta being truncated to a zero length file. Not ideal.

More commonly today is the .fastq file, used for storing millions of short reads from high-throughput sequencing machines. An example of one entry is shown below:

@HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1

TCAAAAATCAGCCGTCACCGAGTATTACCTGAATCACGGCAAATGGCCGGAAAACAC

+HWUSI-EAS-100R_0002:7:1:2596:12829#ACAGTG/1

ffffffffffdfffedaddbaa\ba\Yb`]_a`a```_^`_]YT\Q`]]TT]^BBBB

At first glance, the simplest way to count entries in a .fastq file would be to just extend what we did with .fasta files, but use the "@" character for the ID line matching.

% grep -c '@' in.fastq

Unfortunately, this doesn't always work, as "@" is also a valid symbol in the encoded quality string! No worries, let's use the "+" character on the second ID line. Arrggh, it's also a valid quality symbol! WTF? At this stage you start muttering expletives about moron file format designers, but then calm down when you realise it could have been much worse ie. XML.

Technically, the sequence and quality parts of a .fastq entry can span multiple lines, just like you see 60 column wrapped .fasta files. However, most vendors stick to 4 lines per entry, putting the sequence and quality strings on a single line each. This means we can count entries by dividing the number of lines in the file by four:

% LINES=`cat in.fasta | wc -l`

% READS=`expr $LINES / 4`

% echo $READS

The above is traditional Bourne shell, but most people are using modern-ish shells like BASH, where this can be written more concisely as:

% expr $(cat in.fasta | wc -l) / 4

That's it for now. More posts to follow.

Wednesday 21 September 2011

The easy way to close a genome

I don't know what all the fuss is about. If molecular biologists simply learnt some basic Unix command line tools, they could rid themselves of messy PCRs, oligo design, optical maps, primer walking etc.

% echo ">ClosedGenome" > closed.fasta
% grep -v '>' 454AllContigs.fna | sed 's/[^AGTC]//gi' >> closed.fasta
% mail -s "Genbank genome submission" genomes@ncbi.nlm.nih.gov < closed.fasta

Too easy! :-P