-
Notifications
You must be signed in to change notification settings - Fork 5
Project Structure
UneshNeupane edited this page May 21, 2025
·
3 revisions
📁 Below is an overview of the top-level folders and files in the JoeyLLM repository, and what each contains:
[NOTE] THIS PAGE IS STILL UNDER CONSTRUCTION.
- Purpose: GitHub-specific configuration
-
Contains:
- Issue & PR templates
- Workflow configurations (e.g. CI/CD pipelines, code checks)
- Purpose: Hydra configuration groups & Pydantic schemas
-
Contains:
-
config.yaml(master defaults) - Sub-folders:
-
model/(e.g.joey_12.yaml) -
data/(e.g.gutenberg_au.yaml) -
train/(e.g.1GPU.yaml)
-
-
config.py(PydanticAppConfig,ModelConfig,DataConfig,TrainConfig)
-
- Purpose: Data loading & preprocessing
-
Contains:
-
dataset.py(Dataloadersfactory for HF datasets → PyTorchDataLoader) -
chunk.py(scripts to split/tokenize & push chunks to HF Hub) -
test_data.py(unit tests for the data pipeline)
-
- Purpose: Core model implementation & tests
-
Contains:
-
joeyllm.py(PyTorchJoeyLLMtransformer-decoder model) -
test_model.py(shape-checking & forward-pass smoke tests)
-
- Purpose: Ensure config schemas & Hydra integration work
-
Contains:
- PyTest files that load Hydra configs and validate with Pydantic
- Purpose: Tokenizer integration & custom scripts
-
Contains:
-
test_tokenizer.py(validatetiktoken.get_encoding("cl100k_base")) -
train_tokenizer.py(placeholder for future BPE–training pipeline)
-
- Purpose: Training orchestration & trainer classes
-
Contains:
-
train_loop.pyorJoeyLLMTrainer(multi/single-GPU loop, W&B integration) - Supporting utils (checkpointing, optimizer setup)
-
-
.dockerignore/Dockerfile– container build rules & base image -
.gitignore– files excluded from Git tracking -
README.md– high-level project overview & quickstart -
main.py– Hydra entry-point for full training run -
pre_run_test.py– smoke-test script for configs, model & data -
requirements.txt– pinned Python dependencies