Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dgenomes: subdataset for bioinformatics #38

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

apraga
Copy link

@apraga apraga commented Jan 8, 2025

Hi,

Context

For bioinformatics, there is a pressing need to have a centralized resources for databases needed for the different pipelines. Existing data is already availaible of public website (FTW, AWS...) but without a central hub.
Datalad could fill this gap.

This PR is useful for germline human analysis and offer major publicly available databases : human genomes and clinical database for variant annotation.

Technical details

I've chose to simply mirror existing resources without renaming them for the sake of simpliclity. Also, several files are archive and must be decompressed.

To try to follow YODA principles, this is a metadataset, with a subdataset for each database. Each of this subdataset have a branch for major genome versions to be able to switch between them. At the moment, only the current genome version is implememented (GRCh38) but I plan to add T2T support as a separate branch.

Possible improvements

  • Database versions are stored in git commit messages. Using a custom extractor from the data itself or the URL would be better

This is a draft PR to check it suits integration into datalad main datasets. I will test it into production if the content and layout fit the project :)

Thanks,

@apraga apraga marked this pull request as draft January 8, 2025 23:01
@apraga apraga marked this pull request as ready for review January 14, 2025 22:23
Genome and annotation for germline pipelines
@yarikoptic
Copy link
Member

Hi @apraga , thanks for the perspective contribution!

I would wholeheartedly support inclusion of your "superdataset" into datasets.datalad.org "distribution". I guess we would need to add a rule for it to get auto-updated each time I become brave to update them ;) the ad-hoc script for that is https://github.com/datalad/datasets.datalad.org/blob/master/.datalad/utils/cron_update#L149 , so I guess the desired behavior here is just `datalad_update_ff_r dgenomes -R1;;`` so we just would follow what you have and carry only leading superdandiset... or we could mirror entire hierarchy (didn't look yet into how heavy it is) -- what would you prefer?

@yarikoptic
Copy link
Member

NB if I fail to reply promptly, feel welcome to ping me ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants