Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sequencing pileup #98

Open
wants to merge 13 commits into
base: main
Choose a base branch
from
Open

sequencing pileup #98

wants to merge 13 commits into from

Conversation

Koeng101
Copy link
Owner

@Koeng101 Koeng101 commented Sep 5, 2024

this is code for doing real pileup sequence mutation calling. Attached are over 100 test sequencing pileups from a real run of mine, so this is what ACTUAL data will look like, with manual annotation of what I think is going on.

@Koeng101
Copy link
Owner Author

Alrighty. So now I have a mutation caller than I think works. Now I just need to populate it with real data.

@Koeng101
Copy link
Owner Author

Aight, so now I have a wtf moment. The pileup and sequence analysis from cs optional flags give me different results, but upon manual analysis, I trust the cs optional flags more. Maybe it's the quality scores that are impacting alignments?

@Koeng101
Copy link
Owner Author

Now, here is the problem, and how I am thinking about it:

Bcftools and samtools are implementing variant calling on a base level - using base quality scores, mapping quality, read depth, bias, etc. However, most of those don't matter to me:

  1. Mapping quality will be relatively high, because we identified the proper expected template elsewhere.
  2. Base quality CAN matter, but more on that below.
  3. We take into account in plasmidcall all the other biases

So really, the difference is that we can do allele analysis to a greater depth with vcf and the like. But we're not looking for variants, we're looking for if something is sequenced correctly or not with nanopore sequencing. With the newer R10 cells, the errors are usually not random - effectively, I don't care about random errors. I do care about nanopore sequencing errors - then the question is, how much does the quality score of different non-random nanopore errors compare to the correct scores? Let's do an analysis of this line:

A_6WeTo1Txn3tfRs8BVQ484s    165 A   677 ...........G.....G..G..G.G.G..G..G.G..........GG...G..GGG.G...GGC.G.G.G..G..G...G.....G.GG..........G..G...G..G.....G.....GG.G...G....GGGG...G.G..GGGG..G.....G...GGGGGG.....GGGGG..GGGG.GGGGG..G.GGGGGG..GGGGG.GGG...GGG.GGGGG.......G......G......G.................GG.G..G.......................G.....G...........G.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,....G.G.G.GG..,,,,,,,*,,,..,,,,,,...GG....GG...G,..G..G.,G,.   FB8EFB6/CDB3C61A71GA1?A9:7?J?G686.C4:4E;3S=5<A<<9<JLFF41B6?SD8:.1;4D:13@>44>CF.0@<:.;?953/4GDE48=;4I3158>?53E::08<B>=<A654363/A?>:8@68.:@79.>0>?1DK:EB2;B3/0873D/?.1>5F59/EL?<./J9>F=05:3.8<0B><.><94F;19@@@D0G11C<93<<>F5?7A@:?B07FAGES99;8177>@@B5C:BC21HF4C1A0I?8J;00@?=@/.G<=>4J:/EB/:1>97:9>37D3H>:09GF;<SB;C1B9;.5S4ESDLAGSGA9=6CQB@>/@:C=EJ:0HGF>O8H;HFLQ2.GAE8DB8F54FF=B9>FGEFIEDFJJN0EAGCFJM<BAHDO8ED7SKNEGF?C20B3F:;2ASS8I8=G:IHF>AJ4IH@;7F?FK0ECLS4A5IAB446S0:GFI66AS;7IIL<IA>4CD2F4/FH==F=G64F>JI161HS0A9F:?AG;2GN@;8HGA=//1>81F1A9J;B6E8?5;DCC4A6;@FE/7E9S/>6OD=8CFEFM<JCDC41KSSF:SJL;/GHESKACHS>QOJ;DG=PSSEIBFAIEDSEEDSEKJS3;?HS9=7BBL0E:6EDIELA6;CSH:SAFP036.E4FC?0926CB.?J=.0J.E/J<SE

Now the problem is that they're the base-level quality scores aren't that different:

For the base G (potential error): 24.19
For the base . (correct match on forward strand): 27.24
For the base , (correct match on reverse strand): 32.45

Furthermore:

For the base G with a quality score of 24.19, the error probability P(∼A)P(∼A) is approximately 0.00381. This suggests there's about a 0.381% chance that this base is incorrectly called.
For the base . with a quality score of 27.24, the error probability P(∼A)P(∼A) is about 0.00189. This indicates roughly a 0.189% chance of an error in this call.
For the base , with a quality score of 32.45, the error probability P(∼A)P(∼A) is around 0.000569. This corresponds to an extremely low error likelihood of about 0.057%.

Essentially, the basecaller has no fucking clue, it still assigns a 99.6% correctness score to DNA which is clearly not correct. The base quality information, in this case, is simply not useful information.

Thus, we gain more information from the strand analysis. I believe plasmidcall is sufficient, but now looking at doing a little bit more robust testing.

@Koeng101
Copy link
Owner Author

Ok, now I have it mostly working with pileup files, but there are some exception cases. But those don't matter as much. I think the current version is working well enough to deploy:

  1. Check vcf file. If there is nothing, great! We say the sequence is correct
  2. Check pileup file. If it is confirmed, great! We say the sequence is correct
  3. If neither are true, flag the sequence, and continue along.

This PR just needs some final finishing details and should be ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant