Monday, October 01, 2012

code snip to decide Phred encoding of FASTQ file

It's a common task to check the Phred quality score encoding of fastq file. Sure, it's fairly easy to check this by eyes (according to the rul below), and also there might be already some tools for that purpose.  The task is essentially a problem of converting ASCII character into its decimal format.


  SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS.....................................................
  ..........................XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX......................
  ...............................IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................
  .................................JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ......................
  LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL....................................................
  !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
  |                         |    |        |                              |                     |
 33                        59   64       73                            104                   126

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
    with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
    (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)


To be simple, I just wrote a line of code for that, by only looking the first lines of reads. Here it is:

head input.fastq | awk '{if(NR%4==0) printf("%s",$0);}' |  od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding!";}'

To use it, you can also save it into a bash, like

#!/bin/sh

inputfile=$1

# Update: The following part for checking the file extension can be simplified (Thanks to the comment from Unknown) 
[[ "$inputfile" =~ ".*\.(fq\.gz$|fastq\.gz$)" ]] && zcat $inputfile | head -n40 ...
[[ "$inputfile" =~ ".*\.(fq$|fastq$)" ]] && head -n40 $inputfile | ...

less $inputfile | head -n40 | awk '{if(NR%4==0) printf("%s",$0);}' |  od -A n -t u1 | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i<min) min=$i;}}END{if(max<=74 && min<59) print "Phred+33"; else if(max>73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding!";}'

The only trick is the command od, which is to display/convert file content in a specific format. Here is the more detail for od (http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.cmds/doc/aixcmds4/od.htm).

5 comments:

  1. Anonymous4:56 AM

    I think my fastq file is Phred+64 (based on the DetermineFastqQualityEncoding.pl script available from this URL:
    http://www.mcdonaldlab.biology.gatech.edu/bioinformatics/DetermineFastqQualityEncoding.pl

    But, when I give the 40 lines of fastq below to your code it says phred+33. If I give it 400 lines from the same file it gives me "Unknown score encoding!". If I give the 400 lines to DetermineFastqQualityEncoding.pl it give sme phred+64. The problem is garbage reads in the top of this file. Seems like you should at least update your "head -n40" to "head -n400" since low quality in the first few lines is common. But maybe you could also add a check for very low quality sequences into your awk code?

    Here are the 40 lines of FASTQ:
    @HWUSI-EAS1814:10:1:1:1131:942#0/1NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN+HWUSI-EAS1814:10:1:1:1131:942#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1164:943#0/1
    NANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1164:943#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1179:930#0/1
    NANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1179:930#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1233:932#0/1
    NANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1233:932#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1285:947#0/1
    NGANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1285:947#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1314:941#0/1
    NAANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1314:941#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1365:949#0/1
    NGACNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1365:949#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1409:944#0/1
    NAGCANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1409:944#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1428:940#0/1
    NTTATATNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
    +HWUSI-EAS1814:10:1:1:1428:940#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
    @HWUSI-EAS1814:10:1:1:1525:933#0/1
    NACCAAGTATGGAATTGTTTGACAGCTTGGGAGGGAACCTTCAATTNNNGNNNNNNNNNNNTNNNNNAGAGAAAATTATCCCTNNTGTG
    +HWUSI-EAS1814:10:1:1:1525:933#0/1
    BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB

    ReplyDelete
    Replies
    1. Thanks for the comment. Definitely increasing N to a larger number will help. Can you post your min and max score in your first 400 reads? If they are between 64 and 73, it's impossible to tell if it's Solexa+64 or Phred+64, according to the definition.

      Delete
    2. Anonymous1:42 PM

      I actually just ran into this same problem. It turns out that od will return * if it sees too many of the same character in the row. This frequently happens at the end of a low quality read where you will see all Bs (as in the example above). The easy solution is to just add -v to the od command:

      head input.fastq | awk '{if(NR%4==0) printf("%s",$0);}' | od -A n -t u1 -v | awk 'BEGIN{min=100;max=0;}{for(i=1;i<=NF;i++) {if($i>max) max=$i; if($i73 && min>=64) print "Phred+64"; else if(min>=59 && min<64 && max>73) print "Solexa+64"; else print "Unknown score encoding!";}'

      Delete
  2. Just a little "trick" to avoid the more complicated way you describe your fastq.gz method:
    replacing head by

    less sequences.fastq.gz | head| ....

    It works as well

    ReplyDelete