-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More options for control of behavior in read_genotype #33
Comments
Thanks @CarlKCarlK! Let's see:
Looks to be the same: https://github.com/limix/cbgen/blob/master/cbgen/_bgen_file.py#L70
👍
Woa, where'd that come from?! I've been wondering that very thing recently.
How does the int8 option work? If there was a way to get the encoded values (i.e. before they're converted back to probabilities), that would be amazing. I'm basically working through how to redo the encoding in sgkit-dev/sgkit-bgen#14 so bypassing that altogether would be great.
Very cool. I figured that would be a pain. How do you do it out of curiosity? |
Just to clarify that it is Plink Bed-reader, not the Bgen reader that offers direct support for {float32,float64,int8}x{F,C}
(The names are confusing. I’ll start prefacing “bed” with “plink bed”.]
From: Eric Czech<mailto:[email protected]>
Sent: Monday, August 24, 2020 9:02 AM
To: limix/bgen-reader-py<mailto:[email protected]>
Cc: Carl Kadie<mailto:[email protected]>; Mention<mailto:[email protected]>
Subject: Re: [limix/bgen-reader-py] More options for control of behavior in read_genotype (#33)
Thanks @CarlKCarlK<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FCarlKCarlK&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548935215&sdata=5h1D0wKfAw3La1ARm5ZmesklQr%2FjucsrADsNO9XCxGc%3D&reserved=0>! Let's see:
I haven’t look at the new CBGEN API
Looks to be the same: https://github.com/limix/cbgen/blob/master/cbgen/_bgen_file.py#L70<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fcbgen%2Fblob%2Fmaster%2Fcbgen%2F_bgen_file.py%23L70&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548945209&sdata=Y4%2FCLjjGXbKDv6Yh2sDHJw2frOf4zG2f%2B2n5rm%2FWFKo%3D&reserved=0>
Bgen2 allocates a (relatively tiny) float64 buffer that it reuses across samples [until/unless it sees the number of combinations changes, which often never happens.] It then copies from that buffer into an array of whatever dtype and order the user wants.
👍
“If you know the compression level of your BGEN file, you can sometimes save 50% or 75% on memory with dtype.
(Test with your data to confirm you are not losing any precision.) The approximate relationship is:
* BGEN compression 1 to 10 bits: dtype ='float16'
* BGEN compression 11 to 23 bits: dtype ='float32'
* BGEN compression 24 to 32 bits: dtype ='float64' (default)”
Woa, where'd that come from?! I've been wondering that very thing recently.
Aside: The Bed reader’s C++ code offers direct support for {float32,float64,int8}x{F,C}
How does the int8 option work? If there was a way to get the encoded values (i.e. before they're converted back to probabilities), that would be amazing. I'm basically working through how to redo the encoding in sgkit-dev/sgkit-bgen#14<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fpystatgen%2Fsgkit-bgen%2Fissues%2F14&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548945209&sdata=0NaU7HAQ2CTUlcp%2FQUra6xOzHJeX3TpIb4dvi%2F%2B5vfo%3D&reserved=0> so bypassing that altogether would be great.
It was a pain to “templatize” the C++ code with macros.
Very cool. I figured that would be a pain. How do you do it out of curiosity?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F33%23issuecomment-679216847&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548955203&sdata=mIZZL8cI%2B6IdsuFWkvJJQuynGoIW2usypqc9WoGSz20%3D&reserved=0>, or unsubscribe<https://eur05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P5WQY7DMTH6UYNTKJDSCKFJTANCNFSM4QJM4RWA&data=02%7C01%7C%7C59371dab68e54589831108d848471d48%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637338817548955203&sdata=ebEbEQxlofPiVfiBdP3hoRadyth1YgWQNTnkKIOjXdM%3D&reserved=0>.
|
Whoops, read that too quickly. I see. |
I think I did some crazy thing with my BGEN writer (which uses QCTOOL under the covers) and Excel. However, you can shortcut that (at least to +- 1) with this table in Wikipedia https://en.wikipedia.org/wiki/Floating-point_arithmetic#Internal_representation and its "bit precision" column. |
(This is an aside that refers to the PLINK Bed Reader)
https://github.com/fastlmm/bed-reader/blob/master/bed_reader/CPlinkBedFile.cpp e.g.
|
As Carl mentioned, it is possible. His API already does that. |
As Carl hinted to, bgen file types uses a variable number of bits to encode probability: from 1 bits to 32 bits. BGEN format specify how to convert that sequence of bits into a real number. It does not specify the floating point format but I take it as being double-precision:
And I take the suggestion given by their specification (https://www.well.ox.ac.uk/~gav/bgen_format/spec/latest.html) to renormalize the resulting number to sum to one:
I don't like that part of bgen because it is not precise enough: someone can create another bgen reader that will give slightly different probability numbers. |
Thanks @CarlKCarlK. Very handy reference.
I'm a little confused on that one @horta -- don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file? Or do you mean that another reader might choose to decode the ints in the file as float32 instead of say float64 before doing the subtraction? |
Oh and @horta, do you think skipping reads of phasing, ploidy, and missing fields is still worth adding (the probability precision discussion aside)? |
The bgen specification does not tell what a floating-point to use (I guess they had in mind binary64 of IEEE 754 standard, which is the common one when we use double in C) nor the order of the arithmetic operations on the floating-point. And at the end it says
Should we normalize it or not? If so, still the arithmetic operations order is important to guarantee determinism. |
Yes, I will work on that tonight. Should be quite easy to implement. And then finally start using cbgen on bgen-reader. |
Eric writes:
* don't all readers have to create one of the genotype probabilities as 1 - the sum of the other probabilities since one of them is always left out of the file?
Eric, my understanding is that surprisingly, they don’t leave that one probability out of the file!
Danilo says they include them all (which for phased sugarcane could be 100’s). They expect them to add up to about 1.0, but it won’t be exact because of precision limits and rounding. So at the end, there is a normalization step to make the distribution “more right”.
* Carl
|
cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired: https://cbgen.readthedocs.io/en/stable/bgen_file.html#cbgen.bgen_file.read_probability |
Hm is this perhaps true for phased probabilities but not unphased? I'm looking at this part of the spec:
Gotcha, I can see how that makes sense after the decoding now. Though I'm still thinking of this as being implicit in the 1 - other probabilities for unphased calls but as a "divide by the sum" operation for phased calls. Let me know if that's wrong.
😁 thanks! |
Eric,
Whoops, I mis-remembered. Thanks for checking the spec.
* Carl
From: Eric Czech <[email protected]>
Sent: Thursday, September 03, 2020 11:16 AM
To: limix/bgen-reader-py <[email protected]>
Cc: Carl Kadie <[email protected]>; Mention <[email protected]>
Subject: Re: [limix/bgen-reader-py] More options for control of behavior in read_genotype (#33)
Eric, my understanding is that surprisingly, they don’t leave that one probability out of the file!
Hm is this perhaps true for phased probabilities but not unphased? I'm looking at this part of the spec:
For unphased data ... Probabilities are stored in colex order of these vectors. The last probability (corresponding the the K-th allele homozygotes) is not stored
So at the end, there is a normalization step to make the distribution “more right”.
Gotcha, I can see how that makes sense after the decoding now. Though I'm still thinking of this as being implicit in the 1 - other probabilities for unphased calls but as a "divide by the sum" operation for phased calls. Let me know if that's wrong.
cbgen now accepts probability precision (either 32 or 64 bits) and is able to read probability only if desired
😁 thanks!
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flimix%2Fbgen-reader-py%2Fissues%2F33%23issuecomment-686664420&data=02%7C01%7C%7C3fb2ec2a24ff45b3a90608d850356ca5%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637347537662213895&sdata=fnJYEkdui0rcYPJw6IrAEFz3cnj6V4tGMtM%2FKNqQ7So%3D&reserved=0>, or unsubscribe<https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FABR65P6HIDXKAMSAS3TZPTLSD7MOLANCNFSM4QJM4RWA&data=02%7C01%7C%7C3fb2ec2a24ff45b3a90608d850356ca5%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C637347537662223888&sdata=MBHcvcpv8EYEP2G0Ty7ZZzxX67xexfuI50La7wb9A0I%3D&reserved=0>.
|
Given that it seems like the intention with bgen in practice is to often use less than 20 bits for storage of discretized probabilities, per the paper, do you think it would be reasonable to use 32-bit floats instead of 64 for the numpy arrays those probabilities are read into @horta?
I'm looking at this and wondering what it takes to make that dtype a parameter, or if it shouldn't just be float32 all the time:
bgen-reader-py/bgen_reader/_bgen_file.py
Lines 70 to 71 in eceda00
I'm also wondering if making it possible to skip reading some of these other fields would speed things up appreciably:
bgen-reader-py/bgen_reader/_bgen_file.py
Lines 73 to 79 in eceda00
The text was updated successfully, but these errors were encountered: