-
Notifications
You must be signed in to change notification settings - Fork 5
Data Collection and How to Contribute
As part of building JoeyLLM, Australia’s first open-source language model, we are actively collecting and processing high-quality textual data. This page outlines the current approach to data gathering, filtering, and tokenization, and how you can contribute.
Our goal is to collect at least 10 TB of high-quality English-language text, with an emphasis on Australian-specific content where possible. As of now, we have processed approximately 1 TB, and we’re calling on the community to help us scale further.
We’ve created a centralised GitHub repository under the SouthernCrossAI organization:
👉 SouthernCrossAI/CentralisedData
Each dataset is managed under its own branch in this repo. For example:
main-
fineweb– contains scripts and data filtered from the FineWeb dataset - Future datasets (e.g., Common Crawl, Wikipedia) will each have their own branch
We use the tiktoken tokenizer (GPT-2 encoding) to process and shard datasets into .npy files for training.
The tokenizer script performs the following steps:
-
Load a dataset from Hugging Face using
load_dataset -
Tokenize each document by appending
<|endoftext|>and encoding with GPT-2 tokens - Shard tokenized data into 100 million-token blocks
-
Save each shard as a NumPy array (
.npy)
python tokenize_fineweb.py --dataset fineweb --field <field_name>We need your help to grow the dataset! Here’s how:
- Fork or clone
centraliseddata - Create a new branch for your dataset
- Adapt
tokenize_fineweb.pyto your data source - Push tokenized
.npyfiles to your branch
Please include a Markdown file in your branch (e.g., DATA_SOURCE.md) or update this Wiki. Be sure to describe:
- Dataset source (e.g., Hugging Face link)
- Filtering scheme (keywords, geolocation, etc.)
- Whether the data is:
- 🇦🇺 Australia-specific
- 🌏 General English-language
- Any cleaning or preprocessing steps taken
- Submit a PR to the
SouthernCrossAI/centraliseddatarepo - Make sure your PR includes:
- Tokenized
.npyfiles - Cleaning and filtering documentation
- Notes/logs from preprocessing (optional but helpful)
- Tokenized
Before tokenization, please ensure your dataset:
- Removes non-English or low-quality content
- Filters spam, boilerplate, or duplicated entries
- Prioritizes long-form, informative content (articles, essays, discussions, code comments)
If you’re unsure how to begin, refer to the fineweb branch` or ask questions in our discussion board.
This process is community-driven and open to contributors of all backgrounds. Whether you're uploading datasets, helping clean them, or improving documentation — you’re part of the mission.
If you're interested in contributing to the data pipeline, start with the fineweb branch` and follow its structure.
Let’s build a truly representative Australian language model — together. 🇦🇺