This utility aggregates one or more LeRobot datasets from the Hugging Face Hub using explicit local roots, then optionally uploads the result back to the Hub.
Why this works well
- Pre-downloads data and videos locally → robust and fast
- Uses LeRobot’s
aggregate_datasets
with explicit local roots → correct - Writes to an ABSOLUTE output path → avoids ffconcat path quirks
Dependencies (CPU-only example)
- pip install datasets huggingface_hub pandas pyarrow numpy pillow tqdm av
- pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
- LeRobot: either
pip install lerobot
(if available) orpip install "git+https://github.com/huggingface/lerobot"
Usage
-
Aggregate two repos and upload automatically (requires HF login/token): python aggregate_hf_datasets.py
yourname/repoA
yourname/repoB -
Aggregate N repos in order, custom output and repo id: python aggregate_hf_datasets.py
user/repoA user/repoB user/repoC
--out ./aggregated_demo
--aggr-repo-id yourname/multi-merge-demo -
Skip upload: python aggregate_hf_datasets.py user/repoA user/repoB --no-upload
Notes
- Task order (task_index mapping) follows the order of repos passed on the command line.
- If you’re logged in to the Hub,
--aggr-repo-id
is optional and will default to<username>/multi-merge-<timestamp>
.
File produced
meta/info.json
(summary)meta/tasks.parquet
(unified task mapping)meta/episodes/chunk-000/file-000.parquet
(episode-level metadata)data/chunk-XXX/file-XXX.parquet
(frames)videos/<camera>/chunk-XXX/file-XXX.mp4
(video files)