Add create_genotype_dosage_dataset #38

tomwhite · 2020-07-13T09:44:11Z

A dosage representation is needed for sgkit-dev/sgkit-bgen#1 - this is an example to help discussion in #21

alimanfoo · 2020-07-15T16:37:38Z

Just wondering, do we need separate concepts of "genotype call dataset" and "genotype dosage dataset", or would it be more natural to have a single concept (e.g., "variation dataset") which could contain either a genotype call array or a genotype dosage array (or a genotype likelihoods array or a genotype probabilities array)?

tomwhite · 2020-07-16T13:26:43Z

The reason I added a separate function was so it didn't have to check that one of the required arrays was supplied. Could unify into one though.

eric-czech · 2020-07-16T16:48:13Z

On the call today I think was trying to communicate the same thing as @alimanfoo's "variation dataset" idea where this could be common point of convergence for many workflows:

create_genotype_dataset(
  ...,
  # From VCF.GP or bgen 
  call_genotype_probability: Optional[float[VARIANTS, SAMPLES, GENOTYPES]],
  # Genotypes derived from probabilities/dosages or provided directly 
  call_genotype: Optional[int[VARIANTS, SAMPLES, PLOIDY]],
  # Dosages derived from probabilities, which would come from imputation or sequencing
  call_dosage: Optional[float[VARIANTS, SAMPLES]]
)

I'm not aware of many workflows that make use of genotype probabilities other than using them to create hard calls or dosages, but this would group together everything I've ever seen get used as a starting point for GWAS workflows.

If I had to pick one though, I would lean towards the single method for each genotype variable like you did already did @tomwhite since the results would be easy to merge and it takes the ambiguity out of all the optional fields.

tomwhite · 2020-07-20T14:23:07Z

Rebased and added a test.

I would lean towards the single method for each genotype variable like you did already did @tomwhite since the results would be easy to merge and it takes the ambiguity out of all the optional fields.

Let's go ahead with this way - we can have a unifying function later if we feel this way becomes too cumbersome.

tomwhite mentioned this pull request Jul 13, 2020

Genotype call array to dosage #21

Open

eric-czech approved these changes Jul 16, 2020

View reviewed changes

Add create_genotype_dosage_dataset

acc76f6

tomwhite force-pushed the dosages branch from 9b9cd2a to acc76f6 Compare July 20, 2020 14:21

tomwhite merged commit 9c46ac6 into sgkit-dev:master Jul 20, 2020

tomwhite deleted the dosages branch July 20, 2020 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add create_genotype_dosage_dataset #38

Add create_genotype_dosage_dataset #38

Uh oh!

tomwhite commented Jul 13, 2020

Uh oh!

alimanfoo commented Jul 15, 2020

Uh oh!

tomwhite commented Jul 16, 2020

Uh oh!

eric-czech commented Jul 16, 2020

Uh oh!

tomwhite commented Jul 20, 2020

Uh oh!

Uh oh!

Add create_genotype_dosage_dataset #38

Add create_genotype_dosage_dataset #38

Uh oh!

Conversation

tomwhite commented Jul 13, 2020

Uh oh!

alimanfoo commented Jul 15, 2020

Uh oh!

tomwhite commented Jul 16, 2020

Uh oh!

eric-czech commented Jul 16, 2020

Uh oh!

tomwhite commented Jul 20, 2020

Uh oh!

Uh oh!