This is the official repository for Generative AI for Math: Part I - MathPile: A Billion-Token-Scale Pretraining Corpus for Math
Homepage | Datasets | Paper | Limitations | Statement & License | Citation | Featured By AK
Please be aware that our corpus could be updated (we will notify upon release). It is advisable to use the latest version.
- [2024/09/26] Our paper has been accepted by NeurIPS D&B Track 2024 🎉
- [2024/06/23] We released the data processing scripts, available at the
src
dir. - [2024/01/06] We released the commercial-use version of MathPile, namely MathPile_Commercial.
- [2024/01/06] We released a new version (
v0.2
, a cleaner version) of MathPile. See our HF dataset homepage for more details. - [2023/12/30] MathPile was featured on the Hugging Face Datasets trending list. [snapshot]
- [2023/12/29] We released the MathPile, a 9.5B high-quality and diverse math-centric pre-training corpus.
- [2023/12/28] We released the technical report of MathPile.
High-quality, large-scale corpora are the cornerstone of building powerful foundation models. In this work, we introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. our work is significantly different from the previous work in the following characteristics:
-
Math-centric: MathPile uniquely caters to the math domain, unlike general domain-focused corpora like Pile and RedPajama, or multilingual-focused ones like ROOTS and The Stack. While there are math-centric corpora, they're often either closed-sourced, like Google's Minerva and OpenAI's MathMix, or lack diversity, such as ProofPile and OpenWebMath.
-
Diversity: MathPile draws from a wide range of sources: Textbooks (including lecture notes), arXiv, Wikipedia, ProofWiki, StackExchange, and Web Pages. It encompasses mathematical content suitable for K-12, college, postgraduate levels, and math competitions. This diversity is a first, especially with our release of a significant collection of high-quality textbooks (~0.19B tokens).
-
High-Quality: We adhered to the principle of less is more, firmly believing in the supremacy of data quality over quantity, even in the pre-training phase. Our meticulous data collection and processing efforts included a complex suite of preprocessing, prefiltering, cleaning, filtering, and deduplication, ensuring the high quality of our corpus.
-
Data Documentation: To enhance transparency, we've extensively documented MathPile. This includes a dataset sheet (see Table 5 in our paper) and quality annotations for web-sourced documents, like language identification scores and symbol-to-word ratios. This gives users flexibility to tailor the data to their needs. We've also performed data contamination detection to eliminate duplicates from benchmark test sets like MATH and MMLU-STEM.
We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. See our paper for more technical details.
- The decisions made during the data collection and processing phases might not always be optimal.
- Some documents in MathPile may not always be of the highest quality. We are committed to continually refining and optimizing this corpus.
-
These invaluable corpora are the culmination of human intellect and should be utilized for the betterment of humanity, aiding in the improvement of human life. We strongly urge all users to refrain from using our corpus for any activities that may harm national or social security or violate the law.
-
We have done our utmost to ensure the high quality and lawful use of the data. However, unforeseen issues may still arise, including but not limited to data security concerns and any risks or problems stemming from misuse. We shall not be held responsible for any such issues.
If the source data of MathPile is governed by a license more restrictive than CC BY-NC-SA 4.0, MathPile adheres to that stricter licensing. In all other cases, it operates under the CC BY-NC-SA 4.0 license. We also plan to release a commercially usable version of the dataset soon.
Below are some projects that use MathPile, covering scenarios including but not limited to pre-training, data synthesis, and benchmarking:
- Quality or Quantity? Comparing Domain-Adaptive Pre-training Approaches for Language Models with Mathematical Understanding [Stanford CS224N Custom Project]
- JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models
- Task Oriented In-Domain Data Augmentation
- Great Memory, Shallow Reasoning: Limits of $k$NN-LMs
- BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts
- SciDFM: A Large Language Model with Mixture-of-Experts for Science
- MIND: Math Informed syNthetic Dialogues for Pretraining LLMs
- ...
If you find our work useful or use MathPile, please cite our paper:
@article{wang2023mathpile,
title={Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math},
author={Wang, Zengzhi and Xia, Rui and Liu, Pengfei},
journal={arXiv preprint arXiv:2312.17120},
year={2023}
}