Skip to content

Data Collection and How to Contribute

ashleyzhu0511 edited this page May 21, 2025 · 4 revisions

📦 Data Collection & Tokenization Process for JoeyLLM

As part of building JoeyLLM, Australia’s first open-source language model, we are actively collecting and processing high-quality textual data. This page outlines the current approach to data gathering, filtering, and tokenization, and how you can contribute.


🔍 Overview

Our goal is to collect at least 10 TB of high-quality English-language text, with an emphasis on Australian-specific content where possible. As of now, we have processed approximately 1 TB, and we’re calling on the community to help us scale further.


📁 Centralised Data Repository

We’ve created a centralised GitHub repository under the SouthernCrossAI organization:

👉 SouthernCrossAI/CentralisedData

Each dataset is managed under its own branch in this repo. For example:

  • main
  • fineweb – contains scripts and data filtered from the FineWeb dataset
  • Future datasets (e.g., Common Crawl, Wikipedia) will each have their own branch

🧪 Data Tokenization Pipeline

We use the tiktoken tokenizer (GPT-2 encoding) to process and shard datasets into .npy files for training.

✅ Script Summary

The tokenizer script performs the following steps:

  1. Load a dataset from Hugging Face using load_dataset
  2. Tokenize each document by appending <|endoftext|> and encoding with GPT-2 tokens
  3. Shard tokenized data into 100 million-token blocks
  4. Save each shard as a NumPy array (.npy)

🔧 How to Run

python tokenize_fineweb.py --dataset fineweb --field <field_name>

🤝 How You Can Contribute

We need your help to grow the dataset! Here’s how:

1. Add a New Dataset

  • Fork or clone centraliseddata
  • Create a new branch for your dataset
  • Adapt tokenize_fineweb.py to your data source
  • Push tokenized .npy files to your branch

2. Document Your Process

Please include a Markdown file in your branch (e.g., DATA_SOURCE.md) or update this Wiki. Be sure to describe:

  • Dataset source (e.g., Hugging Face link)
  • Filtering scheme (keywords, geolocation, etc.)
  • Whether the data is:
    • 🇦🇺 Australia-specific
    • 🌏 General English-language
  • Any cleaning or preprocessing steps taken

3. Submit a Pull Request

  • Submit a PR to the SouthernCrossAI/centraliseddata repo
  • Make sure your PR includes:
    • Tokenized .npy files
    • Cleaning and filtering documentation
    • Notes/logs from preprocessing (optional but helpful)

🧼 Cleaning Requirements

Before tokenization, please ensure your dataset:

  • Removes non-English or low-quality content
  • Filters spam, boilerplate, or duplicated entries
  • Prioritizes long-form, informative content (articles, essays, discussions, code comments)

If you’re unsure how to begin, refer to the fineweb branch` or ask questions in our discussion board.


💡 Final Notes

This process is community-driven and open to contributors of all backgrounds. Whether you're uploading datasets, helping clean them, or improving documentation — you’re part of the mission.

If you're interested in contributing to the data pipeline, start with the fineweb branch` and follow its structure.

Let’s build a truly representative Australian language model — together. 🇦🇺

Clone this wiki locally