Skip to content

Project Structure

UneshNeupane edited this page May 21, 2025 · 3 revisions

📁 Below is an overview of the top-level folders and files in the JoeyLLM repository, and what each contains:

[NOTE] THIS PAGE IS STILL UNDER CONSTRUCTION.


.github

  • Purpose: GitHub-specific configuration
  • Contains:
    • Issue & PR templates
    • Workflow configurations (e.g. CI/CD pipelines, code checks)

configs

  • Purpose: Hydra configuration groups & Pydantic schemas
  • Contains:
    • config.yaml (master defaults)
    • Sub-folders:
      • model/ (e.g. joey_12.yaml)
      • data/ (e.g. gutenberg_au.yaml)
      • train/ (e.g. 1GPU.yaml)
    • config.py (Pydantic AppConfig, ModelConfig, DataConfig, TrainConfig)

data

  • Purpose: Data loading & preprocessing
  • Contains:
    • dataset.py (Dataloaders factory for HF datasets → PyTorch DataLoader)
    • chunk.py (scripts to split/tokenize & push chunks to HF Hub)
    • test_data.py (unit tests for the data pipeline)

model

  • Purpose: Core model implementation & tests
  • Contains:
    • joeyllm.py (PyTorch JoeyLLM transformer-decoder model)
    • test_model.py (shape-checking & forward-pass smoke tests)

tests/configs

  • Purpose: Ensure config schemas & Hydra integration work
  • Contains:
    • PyTest files that load Hydra configs and validate with Pydantic

tokenizer

  • Purpose: Tokenizer integration & custom scripts
  • Contains:
    • test_tokenizer.py (validate tiktoken.get_encoding("cl100k_base"))
    • train_tokenizer.py (placeholder for future BPE–training pipeline)

train

  • Purpose: Training orchestration & trainer classes
  • Contains:
    • train_loop.py or JoeyLLMTrainer (multi/single-GPU loop, W&B integration)
    • Supporting utils (checkpointing, optimizer setup)

🔧 Root Files

  • .dockerignore / Dockerfile – container build rules & base image
  • .gitignore – files excluded from Git tracking
  • README.md – high-level project overview & quickstart
  • main.py – Hydra entry-point for full training run
  • pre_run_test.py – smoke-test script for configs, model & data
  • requirements.txt – pinned Python dependencies

Clone this wiki locally