Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add create_genotype_dosage_dataset #38

Merged
merged 1 commit into from
Jul 20, 2020
Merged

Conversation

tomwhite
Copy link
Collaborator

A dosage representation is needed for sgkit-dev/sgkit-bgen#1 - this is an example to help discussion in #21

@alimanfoo
Copy link
Collaborator

Just wondering, do we need separate concepts of "genotype call dataset" and "genotype dosage dataset", or would it be more natural to have a single concept (e.g., "variation dataset") which could contain either a genotype call array or a genotype dosage array (or a genotype likelihoods array or a genotype probabilities array)?

@tomwhite
Copy link
Collaborator Author

The reason I added a separate function was so it didn't have to check that one of the required arrays was supplied. Could unify into one though.

@eric-czech
Copy link
Collaborator

On the call today I think was trying to communicate the same thing as @alimanfoo's "variation dataset" idea where this could be common point of convergence for many workflows:

create_genotype_dataset(
  ...,
  # From VCF.GP or bgen 
  call_genotype_probability: Optional[float[VARIANTS, SAMPLES, GENOTYPES]],
  # Genotypes derived from probabilities/dosages or provided directly 
  call_genotype: Optional[int[VARIANTS, SAMPLES, PLOIDY]],
  # Dosages derived from probabilities, which would come from imputation or sequencing
  call_dosage: Optional[float[VARIANTS, SAMPLES]]
)

I'm not aware of many workflows that make use of genotype probabilities other than using them to create hard calls or dosages, but this would group together everything I've ever seen get used as a starting point for GWAS workflows.

If I had to pick one though, I would lean towards the single method for each genotype variable like you did already did @tomwhite since the results would be easy to merge and it takes the ambiguity out of all the optional fields.

@tomwhite
Copy link
Collaborator Author

Rebased and added a test.

I would lean towards the single method for each genotype variable like you did already did @tomwhite since the results would be easy to merge and it takes the ambiguity out of all the optional fields.

Let's go ahead with this way - we can have a unifying function later if we feel this way becomes too cumbersome.

@tomwhite tomwhite merged commit 9c46ac6 into sgkit-dev:master Jul 20, 2020
@tomwhite tomwhite deleted the dosages branch July 20, 2020 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants