Thursday 10 November 2011

Compressing FASTQ reads by splitting into homogeneous streams


Today I took FASTQ file with 3.5M reads, which was Read1 from a paired-end Illumina 100bp run - it was about 883Mb in size. As many have shown before me, GZIP compresses to about 1/4 the size, and BZIP2 about 1/5.
  • 883252 R1.fastq
  • 233296 R1.fastq.gz
  • 182056 R1.fastq.bz2
I then split the read file into 3 separate files: (1) The ID line, but with the mandatory '@' removed, (2) the sequence line, but uppercased for consistency, and (3) the quality line unchanged. It ignored the 3rd line of each FASTQ entry, as it is redundant. This knocked 1% off the total size.
  • 189588 id.txt
  • 341756 seq.txt
  • 341756 qual.txt
  • 873100 TOTAL
Now, I compressed each of the three streams (ID, Sequence, Quality) independently with GZIP. The idea is that these dictionary-based compression schemes will work better on more homogeneous data streams, than when they are interleaved in one stream. As you can see this does improve things by about 15%, but still not as good as BZIP2 without de-interleaving.
  •  20608 id.txt.gz
  •  84096 qual.txt.gz
  • 102040 seq.txt.gz
  • 206644 TOTAL (was 233296 combined)
If we use BZIP2 to compress the interleaved stream, it does only 5% better than when it was a single stream. This is testament to BZIP2's ability to cope with heterogeneous data streams better than GZIP.
  •  16560 id.txt.bz2
  •  66812 qual.txt.bz2
  •  93564 seq.txt.bz2
  • 176936 TOTAL (was 182056 combined)
So in summary, we've re-learnt that BZIP2 is better than GZIP, and that they are both doing quite well adapting to the three interleaved data types in a FASTQ file.