Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality score encoding #147

Open
rjg2186 opened this issue Feb 7, 2025 · 1 comment
Open

Quality score encoding #147

rjg2186 opened this issue Feb 7, 2025 · 1 comment

Comments

@rjg2186
Copy link

rjg2186 commented Feb 7, 2025

Hi @s-andrews,

I have a scenario where, I have few thousand reads for which the base quality is >=30 for all the bases in the reads. Majority of the reads have qual 34. When I run through FastQC, the report says that the quality encoding is "Illumina 1.5", but these should be basically Illumina 1.9 phred scale 33. Is there any way to provide the quality encoding as parameter to FastQC. Below is example of few reads

@VH00243:66:AAGGMYWM5:1:1101:66233:4502 1:N:0:NGTCAGACGA+TGTCGCTGGT
ACCTTACGGGACTTTCCTACTTGGCAGTACATCTACGTA
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH00243:66:AAGGMYWM5:1:1101:40159:4900 1:N:0:NGTCAGACGA+TGTCGCTGGT
CACTGAGGCCGCCCGGGCAAAGCCCGGGCGTCGGG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
@VH00243:66:AAGGMYWM5:1:1101:56954:13930 1:N:0:NGTCAGACGA+TGTCGCTGGT
CAGTACGCCTTTGTCACTTTCTTACACTGTCTCCTATAG
+
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

Image

Thanks

@s-andrews
Copy link
Owner

Something odd is going on with this data. This type of error is certainly possible in fastqc, but it would only happen if there were no bases anywhere in the file with a Phred score of less than 31 (ASCII char < 64) which would be a fairly remarkable dataset unless it's been heavily filtered.

In your case it's even more weird because it appears that every base call in every read has the exact same quality (ASCII=C, Phred33=34, Phred64=3). That would seem very unlikely in any real dataset so either you're looking at a highly selected subset of reads, or something has messed with your quality scores before they got here.

There isn't an option in fastqc to bypass the auto-detection. In theory this could be added but I've never seen a real dataset where this was needed, and adding it would require adding a bunch of other sanity checks because it would allow for really stupid Phred scores to be calculated which could break other parts of the code if it was applied incorrectly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants