sequencing pileup #98

Koeng101 · 2024-09-05T03:47:42Z

this is code for doing real pileup sequence mutation calling. Attached are over 100 test sequencing pileups from a real run of mine, so this is what ACTUAL data will look like, with manual annotation of what I think is going on.

Koeng101 · 2024-09-11T17:37:31Z

Alrighty. So now I have a mutation caller than I think works. Now I just need to populate it with real data.

Koeng101 · 2024-09-12T16:29:50Z

Aight, so now I have a wtf moment. The pileup and sequence analysis from cs optional flags give me different results, but upon manual analysis, I trust the cs optional flags more. Maybe it's the quality scores that are impacting alignments?

Koeng101 · 2024-09-24T17:36:39Z

Now, here is the problem, and how I am thinking about it:

Bcftools and samtools are implementing variant calling on a base level - using base quality scores, mapping quality, read depth, bias, etc. However, most of those don't matter to me:

Mapping quality will be relatively high, because we identified the proper expected template elsewhere.
Base quality CAN matter, but more on that below.
We take into account in plasmidcall all the other biases

So really, the difference is that we can do allele analysis to a greater depth with vcf and the like. But we're not looking for variants, we're looking for if something is sequenced correctly or not with nanopore sequencing. With the newer R10 cells, the errors are usually not random - effectively, I don't care about random errors. I do care about nanopore sequencing errors - then the question is, how much does the quality score of different non-random nanopore errors compare to the correct scores? Let's do an analysis of this line:

A_6WeTo1Txn3tfRs8BVQ484s    165 A   677 ...........G.....G..G..G.G.G..G..G.G..........GG...G..GGG.G...GGC.G.G.G..G..G...G.....G.GG..........G..G...G..G.....G.....GG.G...G....GGGG...G.G..GGGG..G.....G...GGGGGG.....GGGGG..GGGG.GGGGG..G.GGGGGG..GGGGG.GGG...GGG.GGGGG.......G......G......G.................GG.G..G.......................G.....G...........G.,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,*,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,....G.G.G.GG..,,,,,,,*,,,..,,,,,,...GG....GG...G,..G..G.,G,.   FB8EFB6/CDB3C61A71GA1?A9:7?J?G686.C4:4E;3S=5<A<<9<JLFF41B6?SD8:.1;4D:13@>44>CF.0@<:.;?953/4GDE48=;4I3158>?53E::08<B>=<A654363/A?>:8@68.:@79.>0>?1DK:EB2;B3/0873D/?.1>5F59/EL?<./J9>F=05:3.8<0B><.><94F;19@@@D0G11C<93<<>F5?7A@:?B07FAGES99;8177>@@B5C:BC21HF4C1A0I?8J;00@?=@/.G<=>4J:/EB/:1>97:9>37D3H>:09GF;<SB;C1B9;.5S4ESDLAGSGA9=6CQB@>/@:C=EJ:0HGF>O8H;HFLQ2.GAE8DB8F54FF=B9>FGEFIEDFJJN0EAGCFJM<BAHDO8ED7SKNEGF?C20B3F:;2ASS8I8=G:IHF>AJ4IH@;7F?FK0ECLS4A5IAB446S0:GFI66AS;7IIL<IA>4CD2F4/FH==F=G64F>JI161HS0A9F:?AG;2GN@;8HGA=//1>81F1A9J;B6E8?5;DCC4A6;@FE/7E9S/>6OD=8CFEFM<JCDC41KSSF:SJL;/GHESKACHS>QOJ;DG=PSSEIBFAIEDSEEDSEKJS3;?HS9=7BBL0E:6EDIELA6;CSH:SAFP036.E4FC?0926CB.?J=.0J.E/J<SE

Now the problem is that they're the base-level quality scores aren't that different:

For the base G (potential error): 24.19
For the base . (correct match on forward strand): 27.24
For the base , (correct match on reverse strand): 32.45

Furthermore:

For the base G with a quality score of 24.19, the error probability P(∼A)P(∼A) is approximately 0.00381. This suggests there's about a 0.381% chance that this base is incorrectly called.
For the base . with a quality score of 27.24, the error probability P(∼A)P(∼A) is about 0.00189. This indicates roughly a 0.189% chance of an error in this call.
For the base , with a quality score of 32.45, the error probability P(∼A)P(∼A) is around 0.000569. This corresponds to an extremely low error likelihood of about 0.057%.

Essentially, the basecaller has no fucking clue, it still assigns a 99.6% correctness score to DNA which is clearly not correct. The base quality information, in this case, is simply not useful information.

Thus, we gain more information from the strand analysis. I believe plasmidcall is sufficient, but now looking at doing a little bit more robust testing.

Koeng101 · 2024-09-25T03:40:49Z

Ok, now I have it mostly working with pileup files, but there are some exception cases. But those don't matter as much. I think the current version is working well enough to deploy:

Check vcf file. If there is nothing, great! We say the sequence is correct
Check pileup file. If it is confirmed, great! We say the sequence is correct
If neither are true, flag the sequence, and continue along.

This PR just needs some final finishing details and should be ready.

Koeng101 added 7 commits September 4, 2024 20:45

init

9a0e546

make linter happy

4c1108b

Add cs alignment and cs example to sam

cc83cb5

add plasmidcall

6ef8cd8

fix linter

ab81809

update artifact

ba7d905

remove pileup

ee97a32

hmmm

edcf4ea

Koeng101 added 5 commits September 24, 2024 16:48

remove plasmidcall

e4c06c0

updated pileup stuffs

42a222f

z

fb21046

Merge branch 'main' into sequencingPileup

114470c

add docs

4c5c3b9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sequencing pileup #98

sequencing pileup #98

Koeng101 commented Sep 5, 2024

Koeng101 commented Sep 11, 2024

Koeng101 commented Sep 12, 2024

Koeng101 commented Sep 24, 2024

Koeng101 commented Sep 25, 2024

sequencing pileup #98

Are you sure you want to change the base?

sequencing pileup #98

Conversation

Koeng101 commented Sep 5, 2024

Koeng101 commented Sep 11, 2024

Koeng101 commented Sep 12, 2024

Koeng101 commented Sep 24, 2024

Koeng101 commented Sep 25, 2024