Project Structure

Jump to bottom

UneshNeupane edited this page May 21, 2025 · 3 revisions

📁 Below is an overview of the top-level folders and files in the JoeyLLM repository, and what each contains:

[NOTE] THIS PAGE IS STILL UNDER CONSTRUCTION.

.github

Purpose: GitHub-specific configuration
Contains:
- Issue & PR templates
- Workflow configurations (e.g. CI/CD pipelines, code checks)

configs

Purpose: Hydra configuration groups & Pydantic schemas
Contains:
- config.yaml (master defaults)
- Sub-folders:
  - model/ (e.g. joey_12.yaml)
  - data/ (e.g. gutenberg_au.yaml)
  - train/ (e.g. 1GPU.yaml)
- config.py (Pydantic AppConfig, ModelConfig, DataConfig, TrainConfig)

data

Purpose: Data loading & preprocessing
Contains:
- dataset.py (Dataloaders factory for HF datasets → PyTorch DataLoader)
- chunk.py (scripts to split/tokenize & push chunks to HF Hub)
- test_data.py (unit tests for the data pipeline)

model

Purpose: Core model implementation & tests
Contains:
- joeyllm.py (PyTorch JoeyLLM transformer-decoder model)
- test_model.py (shape-checking & forward-pass smoke tests)

tests/configs

Purpose: Ensure config schemas & Hydra integration work
Contains:
- PyTest files that load Hydra configs and validate with Pydantic

tokenizer

Purpose: Tokenizer integration & custom scripts
Contains:
- test_tokenizer.py (validate tiktoken.get_encoding("cl100k_base"))
- train_tokenizer.py (placeholder for future BPE–training pipeline)

train

Purpose: Training orchestration & trainer classes
Contains:
- train_loop.py or JoeyLLMTrainer (multi/single-GPU loop, W&B integration)
- Supporting utils (checkpointing, optimizer setup)

🔧 Root Files

.dockerignore / Dockerfile – container build rules & base image
.gitignore – files excluded from Git tracking
README.md – high-level project overview & quickstart
main.py – Hydra entry-point for full training run
pre_run_test.py – smoke-test script for configs, model & data
requirements.txt – pinned Python dependencies

← Back to Southern Cross AI – JoeyLLM Repository • Join Our Discord • Report an Issue