Skip to content

abhishekloiwal/aggregate-hf-datasets

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Aggregate HF Datasets (LeRobot)

This utility aggregates one or more LeRobot datasets from the Hugging Face Hub using explicit local roots, then optionally uploads the result back to the Hub.

Why this works well

  • Pre-downloads data and videos locally → robust and fast
  • Uses LeRobot’s aggregate_datasets with explicit local roots → correct
  • Writes to an ABSOLUTE output path → avoids ffconcat path quirks

Dependencies (CPU-only example)

  • pip install datasets huggingface_hub pandas pyarrow numpy pillow tqdm av
  • pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu
  • LeRobot: either pip install lerobot (if available) or pip install "git+https://github.com/huggingface/lerobot"

Usage

  • Aggregate two repos and upload automatically (requires HF login/token): python aggregate_hf_datasets.py
    yourname/repoA
    yourname/repoB

  • Aggregate N repos in order, custom output and repo id: python aggregate_hf_datasets.py
    user/repoA user/repoB user/repoC
    --out ./aggregated_demo
    --aggr-repo-id yourname/multi-merge-demo

  • Skip upload: python aggregate_hf_datasets.py user/repoA user/repoB --no-upload

Notes

  • Task order (task_index mapping) follows the order of repos passed on the command line.
  • If you’re logged in to the Hub, --aggr-repo-id is optional and will default to <username>/multi-merge-<timestamp>.

File produced

  • meta/info.json (summary)
  • meta/tasks.parquet (unified task mapping)
  • meta/episodes/chunk-000/file-000.parquet (episode-level metadata)
  • data/chunk-XXX/file-XXX.parquet (frames)
  • videos/<camera>/chunk-XXX/file-XXX.mp4 (video files)

About

Aggregate one or more LeRobot datasets from the Hugging Face Hub using local roots; uploads by default.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages