You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?
Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.
Thanks in advance!
The text was updated successfully, but these errors were encountered:
The dataset was compressed with parquet + snappy when uploaded to the hub. Here is a before and after deduplication comparison in terms of physical size (w/o compression) and number of files: bquxjob_25a65048_188085aa72f.csv
The last line is the total change, here is the screenshot for quick reference:
I recently downloaded The Stack (the-stack-dedup) from Huggingface via GIT LFS. I have two questions that I need help with:
The size on disk of the dedup datset is only around 900GB (much smaller than the 1.5TB indicated on the data card - https://huggingface.co/datasets/bigcode/admin/resolve/main/the-stack-infographic-v11.png)
Is there somewhere were the file counts are listed in full for each dataset by language (dedup and full)?
Essentially I am looking to make sure that I have accessed the entirety of the dataset, so I either need to understand the dataset size difference, or know how many files there should be for each language so I can validate my download. Ideally both.
Thanks in advance!
The text was updated successfully, but these errors were encountered: