dataset-scrapers

This repository contains scripts for downloading, analyzing, and enriching dataset profile collections. It provides the basis for an experimental evaluation of novel research ideas in the field of metadata-driven dataset search (also known as dataset search over decentralized data repositories). The scope of this repository includes dataset collections that are:

publicly available
provide metadata in a standardized format (e.g., Croissant)
make the raw data available for download so that we can enrich dataset profiles with additional information (such as synopses)

The scripts for each dataset collection are in their own directory with joint utilities located in dataset_scrapers/. More details on our profile enrichment and scraping statistics are located in docs/.

Setup

All scripts are written in Python with the dependencies specified in pyproject.toml. We recommend using uv to install and manage the project dependencies. To set up a new virtual environment, clone the repository and run uv sync. After that, the virtual environment is available at .venv/bin/activate.

Instructions for how to reproduce a dataset collection are located in docs/.

Dataset Collections

Available

Kaggle

Roadmap

OpenML

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
dataset_scrapers		dataset_scrapers
docs		docs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dataset-scrapers

Setup

Dataset Collections

Available

Roadmap

About

Uh oh!

Uh oh!

Contributors 3

Uh oh!

Languages

License

lbhm/dataset-scrapers

Folders and files

Latest commit

History

Repository files navigation

dataset-scrapers

Setup

Dataset Collections

Available

Roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 3

Uh oh!

Languages