This repository contains scripts for downloading, analyzing, and enriching dataset profile collections. It provides the basis for an experimental evaluation of novel research ideas in the field of metadata-driven dataset search (also known as dataset search over decentralized data repositories). The scope of this repository includes dataset collections that are:
- publicly available
- provide metadata in a standardized format (e.g., Croissant)
- make the raw data available for download so that we can enrich dataset profiles with additional information (such as synopses)
The scripts for each dataset collection are in their own directory with joint utilities located in dataset_scrapers/.
More details on our profile enrichment and scraping statistics are located in docs/.
All scripts are written in Python with the dependencies specified in pyproject.toml.
We recommend using uv to install and manage the project dependencies.
To set up a new virtual environment, clone the repository and run uv sync.
After that, the virtual environment is available at .venv/bin/activate.
Instructions for how to reproduce a dataset collection are located in docs/.