Data Collection and How to Contribute

📦 Data Collection & Tokenization Process for JoeyLLM

As part of building JoeyLLM, Australia’s first open-source language model, we are actively collecting and processing high-quality textual data. This page outlines the current approach to data gathering, filtering, and tokenization, and how you can contribute.

🔍 Overview

Our goal is to collect at least 10 TB of high-quality English-language text, with an emphasis on Australian-specific content where possible. As of now, we have processed approximately 1 TB, and we’re calling on the community to help us scale further.

📁 Centralised Data Repository

We’ve created a centralised GitHub repository under the SouthernCrossAI organization:

👉 SouthernCrossAI/CentralisedData

Each dataset is managed under its own branch in this repo. For example:

main
fineweb – contains scripts and data filtered from the FineWeb dataset
Future datasets (e.g., Common Crawl, Wikipedia) will each have their own branch

🧪 Data Tokenization Pipeline

We use the tiktoken tokenizer (GPT-2 encoding) to process and shard datasets into .npy files for training.

✅ Script Summary

The tokenizer script performs the following steps:

Load a dataset from Hugging Face using load_dataset
Tokenize each document by appending <|endoftext|> and encoding with GPT-2 tokens
Shard tokenized data into 100 million-token blocks
Save each shard as a NumPy array (.npy)

🔧 How to Run

python tokenize_fineweb.py --dataset fineweb --field <field_name>

🤝 How You Can Contribute

We need your help to grow the dataset! Here’s how:

1. Add a New Dataset

Fork or clone centraliseddata
Create a new branch for your dataset
Adapt tokenize_fineweb.py to your data source
Push tokenized .npy files to your branch

2. Document Your Process

Please include a Markdown file in your branch (e.g., DATA_SOURCE.md) or update this Wiki. Be sure to describe:

Dataset source (e.g., Hugging Face link)
Filtering scheme (keywords, geolocation, etc.)
Whether the data is:
- 🇦🇺 Australia-specific
- 🌏 General English-language
Any cleaning or preprocessing steps taken

3. Submit a Pull Request

Submit a PR to the SouthernCrossAI/centraliseddata repo
Make sure your PR includes:
- Tokenized .npy files
- Cleaning and filtering documentation
- Notes/logs from preprocessing (optional but helpful)

🧼 Cleaning Requirements

Before tokenization, please ensure your dataset:

Removes non-English or low-quality content
Filters spam, boilerplate, or duplicated entries
Prioritizes long-form, informative content (articles, essays, discussions, code comments)

If you’re unsure how to begin, refer to the fineweb branch` or ask questions in our discussion board.

💡 Final Notes

This process is community-driven and open to contributors of all backgrounds. Whether you're uploading datasets, helping clean them, or improving documentation — you’re part of the mission.

If you're interested in contributing to the data pipeline, start with the fineweb branch` and follow its structure.

Let’s build a truly representative Australian language model — together. 🇦🇺

← Back to Southern Cross AI – JoeyLLM Repository • Join Our Discord • Report an Issue

📘 JoeyLLM Wiki

Home
Project Structure
Contributing
Docker
- System Install
  - 🐧 Linux Setup
  - 🪟 Windows Setup
  - 🍎 macOS Setup
- JoeyLLM Docker
Team Information
License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Data Collection and How to Contribute

📦 Data Collection & Tokenization Process for JoeyLLM

🔍 Overview

📁 Centralised Data Repository

🧪 Data Tokenization Pipeline

✅ Script Summary

🔧 How to Run

🤝 How You Can Contribute

1. Add a New Dataset

2. Document Your Process

3. Submit a Pull Request

🧼 Cleaning Requirements

💡 Final Notes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

📘 JoeyLLM Wiki

Clone this wiki locally