dgenomes: subdataset for bioinformatics #38

apraga · 2025-01-08T23:00:57Z

Hi,

Context

For bioinformatics, there is a pressing need to have a centralized resources for databases needed for the different pipelines. Existing data is already availaible of public website (FTW, AWS...) but without a central hub.
Datalad could fill this gap.

This PR is useful for germline human analysis and offer major publicly available databases : human genomes and clinical database for variant annotation.

Technical details

I've chose to simply mirror existing resources without renaming them for the sake of simpliclity. Also, several files are archive and must be decompressed.

To try to follow YODA principles, this is a metadataset, with a subdataset for each database. Each of this subdataset have a branch for major genome versions to be able to switch between them. At the moment, only the current genome version is implememented (GRCh38) but I plan to add T2T support as a separate branch.

Possible improvements

Database versions are stored in git commit messages. Using a custom extractor from the data itself or the URL would be better

This is a draft PR to check it suits integration into datalad main datasets. I will test it into production if the content and layout fit the project :)

Thanks,

Genome and annotation for germline pipelines

yarikoptic · 2025-01-16T02:36:01Z

Hi @apraga , thanks for the perspective contribution!

I would wholeheartedly support inclusion of your "superdataset" into datasets.datalad.org "distribution". I guess we would need to add a rule for it to get auto-updated each time I become brave to update them ;) the ad-hoc script for that is https://github.com/datalad/datasets.datalad.org/blob/master/.datalad/utils/cron_update#L149 , so I guess the desired behavior here is just `datalad_update_ff_r dgenomes -R1;;`` so we just would follow what you have and carry only leading superdandiset... or we could mirror entire hierarchy (didn't look yet into how heavy it is) -- what would you prefer?

yarikoptic · 2025-01-16T02:36:34Z

NB if I fail to reply promptly, feel welcome to ping me ...

apraga marked this pull request as draft January 8, 2025 23:01

apraga marked this pull request as ready for review January 14, 2025 22:23

apraga force-pushed the dgenomes branch from 3c9a8e9 to d1bf4cc Compare January 14, 2025 22:27

dgenomes: subdataset for human bionformatics

7029587

Genome and annotation for germline pipelines

apraga force-pushed the dgenomes branch from d1bf4cc to 7029587 Compare January 14, 2025 22:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dgenomes: subdataset for bioinformatics #38

dgenomes: subdataset for bioinformatics #38

apraga commented Jan 8, 2025

yarikoptic commented Jan 16, 2025

yarikoptic commented Jan 16, 2025

dgenomes: subdataset for bioinformatics #38

Are you sure you want to change the base?

dgenomes: subdataset for bioinformatics #38

Conversation

apraga commented Jan 8, 2025

Context

Technical details

Possible improvements

yarikoptic commented Jan 16, 2025

yarikoptic commented Jan 16, 2025