Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing values for multi-allelic variants result in large negative numbers in .format("AD"), not all the same #304

Closed
blaiseli opened this issue Apr 24, 2024 · 1 comment

Comments

@blaiseli
Copy link

blaiseli commented Apr 24, 2024

I came upon a vcf record that failed to pass a sanity check I had put in my code.

Here is a minimal example using test.vcf.gz:

$ cat test.py
import numpy as np
from cyvcf2 import VCF

vcf_in = VCF("test.vcf.gz")
rec = next(vcf_in)
depths = rec.format("AD")
print(rec.gt_ref_depths)
print(rec.gt_alt_depths)
print(depths)
assert np.alltrue(depths[:, 1:].sum(axis=1) == rec.gt_alt_depths)
$ python3 test.py 
[-1  3  2  2  4  5]
[2 3 0 2 0 1]
[[-2147483648 -2147483647 -2147483647]
 [          3           1           2]
 [          2           0           0]
 [          2           2           0]
 [          4           0           0]
 [          5           0           1]]
Traceback (most recent call last):
  File "/pasteur/appa/homes/bli/test/test.py", line 10, in <module>
    assert np.alltrue(depths[:, 1:].sum(axis=1) == rec.gt_alt_depths)
AssertionError

If I understand correctly, -2147483648 corresponds to missing values (#195 (comment)), but here I also have -2147483647.

If the values are missing, it would make sense to me to either set them as 0 or as explicit missing values.
What would be a reliable and clean way to do that ?

The record:

$ zcat test.vcf.gz | tail -1
5       1065    .       C       A,T     0.00292918      .       AB=0,0;ABP=0,0;AC=0,0;AF=0,0;AN=10;AO=3,3;CIGAR=1X,1X;DP=22;DPB=22;DPRA=1.25,1.8;EPP=3.73412,9.52472;EPPR=16.582;GTI=0;LEN=1,1;MEANALT=1.5,1.5;MQM=4.66667,9.33333;MQMR=17.8125;NS=5;NUMALT=2;ODDS=8.77585;PAIRED=1,1;PAIREDR=0.875;PAO=0,0;PQA=0,0;PQR=0;PRO=0;QA=119,123;QR=581;RO=16;RPL=0,0;RPP=9.52472,9.52472;RPPR=7.89611;RPR=3,3;RUN=1,1;SAF=2,3;SAP=3.73412,9.52472;SAR=1,0;SRF=10;SRP=5.18177;SRR=6;TYPE=snp,snp;ANN=A|intergenic_region|MODIFIER|CHR_START-Ld1S_050809800|CHR_START-Ld1S_050809800|intergenic_region|CHR_START-Ld1S_050809800|||n.1065C>A||||||,T|intergenic_region|MODIFIER|CHR_START-Ld1S_050809800|CHR_START-Ld1S_050809800|intergenic_region|CHR_START-Ld1S_050809800|||n.1065C>T||||||        GT:DP:AD:RO:QR:AO:QA:GL .       0/0:6:3,1,2:3:115:1,2:41,82:0,-1.04247,-8.64321,-0.118841,-8.29491,-7.49664      0/0:2:2,0,0:2:82:0,0:0,0:0,-0.60206,-0.474948,-0.60206,-0.474948,-0.474948      0/0:4:2,2,0:2:73:2,0:78,0:0,-0.0648794,-2.54583,-0.60206,-3.14789,-3.55812       0/0:4:4,0,0:4:164:0,0:0,0:0,-1.20412,-3.88208,-1.20412,-3.88208,-3.88208 0/0:6:5,0,1:5:147:0,1:0,41:0,-1.50515,-4.99707,-0.506833,-4.1256,-3.82457

(This is extracted from a file produced by freebayes, then snpEFF, if that matters.)

@brentp
Copy link
Owner

brentp commented Apr 24, 2024

Hi, for speed, cyvcf2 returns a numpy array where all rows must have the same dimension.
-2147483648 is missing and -2147483647 is vector end.
Using that, you can go through with a list comprehension or numpy expression depths[depths < 0] = 0 but that's not done internally again for speed. I just reads directly from htslib which reads directly from the file.
remember that the rec.gt_* attributes only work for bi-allelic variants and should not be used for multi-allelics.

@brentp brentp closed this as completed Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants