Question: File Counts and Dataset Size #44

darien-schettler · 2023-03-25T19:39:44Z

I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:

The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)
Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?

Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.

Thanks in advance!

ChenghaoMou · 2023-05-11T01:15:55Z

The dataset was compressed with parquet + snappy when uploaded to the hub. Here is a before and after deduplication comparison in terms of physical size (w/o compression) and number of files:
bquxjob_25a65048_188085aa72f.csv

The last line is the total change, here is the screenshot for quick reference:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: File Counts and Dataset Size #44

Question: File Counts and Dataset Size #44

darien-schettler commented Mar 25, 2023

ChenghaoMou commented May 11, 2023 •

edited by loubnabnl

Loading

Question: File Counts and Dataset Size #44

Question: File Counts and Dataset Size #44

Comments

darien-schettler commented Mar 25, 2023

ChenghaoMou commented May 11, 2023 • edited by loubnabnl Loading

ChenghaoMou commented May 11, 2023 •

edited by loubnabnl

Loading