Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7124125
Update README.md
kamel-yamani Jul 1, 2024
0b0fde7
First Commit
kamel-yamani Jul 1, 2024
9ab01b1
Adding figure
kamel-yamani Jul 1, 2024
7ac0507
Delete TLMF.png
kamel-yamani Jul 1, 2024
6b9a579
Update README.md
kamel-yamani Jul 1, 2024
f41a7e9
Update README.md
kamel-yamani Jul 1, 2024
ab9fbe4
Update README.md
kamel-yamani Jul 1, 2024
02551c5
adding code completion eval
MarwaNair Jul 2, 2024
a61c49c
Update README.md
MarwaNair Jul 2, 2024
9c3e9e3
Update README.md
MarwaNair Jul 2, 2024
e4710bb
Update README.md
kamel-yamani Jul 20, 2024
4bf4378
Update README.md
kamel-yamani Jul 21, 2024
962cb56
Update README.md
kamel-yamani Jul 22, 2024
3cf8ebc
Update README.md
kamel-yamani Jul 22, 2024
1416bdc
TinyLM Starter Notebook Added
kamel-yamani Jul 22, 2024
37cb9e9
Update requirements.txt
kamel-yamani Jul 22, 2024
917551e
Fixing TinyPy generator
kamel-yamani Jul 30, 2024
85cd7be
Update README.md
kamel-yamani Jul 30, 2024
92b963b
Update tinypy_generator.py
MarwaNair Jul 30, 2024
08e521d
Update README.md
MarwaNair Jul 30, 2024
6cd85d9
added tasks folder
BenouaklilHodhaifa Sep 25, 2024
a51cf0e
Contributing the line execution count task files, by ibrahim-aboud
ibrahim-aboud Sep 25, 2024
c68fca0
Merge pull request #1 from ibrahim-aboud/main
BenouaklilHodhaifa Sep 25, 2024
9ae30e3
first reorganised version of the repo
Oct 6, 2024
c192297
Merge pull request #4 from Modern-Compilers-Lab/reorg
YounesBoukacem Oct 6, 2024
5a4a26a
added dataset-3 and created dataprep-1 on it
YounesBoukacem Oct 7, 2024
ed353c7
created the first version of venus, an inhouse utility for managing t…
YounesBoukacem Oct 7, 2024
e333748
Merge pull request #6 from Modern-Compilers-Lab/venus
YounesBoukacem Oct 7, 2024
f93720f
Operator Prediction Data Preparation Code
Ellzo Oct 9, 2024
1f31ae0
README added for data preparation for operator prediction task
Ellzo Oct 9, 2024
862acb8
README added for data preparation for operator prediction task
Ellzo Oct 9, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 36 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,12 @@
# Tiny Language Models Framework

This repository contains the implementation and resources for the Tiny Language Models Framework project. In this project, we developed small-scale language models to facilitate detailed research into various aspects of large language models (LLMs), particularly in the domain of code.
This repository contains the implementation and resources for the Tiny Language Models Framework project. In this project, we developed small-scale language models to facilitate detailed research into various aspects of large language models (LLMs), particularly in the domain of code.

<p align="center">
<img src="https://github.com/Modern-Compilers-Lab/Tiny-Language-Models-Framework/assets/86785811/946011ac-90ca-454f-baeb-d74b09a1721c" width="500" >
</p>

We've also prepared a [TinyLM Starter Notebook on Kaggle](https://www.kaggle.com/code/nairmarwa/tinylm-starter-notebook). This notebook is designed to help you get started quickly with our project. It guides you through training a tiny language model from scratch using our dataset and evaluating its performance on code execution tasks.

## Project Structure

Expand Down Expand Up @@ -33,7 +39,11 @@ This repository contains the implementation and resources for the Tiny Language

- `demonstration.ipynb` : Jupyter notebook demonstrating the usage of the models and scripts.

- `eval.py` : Script to evaluate the trained models.
- `code_execution.py` : Script to evaluate the trained models on the code execution task.

- `token-level_code_completion.py` : Script to evaluate the trained models on the token-level code completion task.

- `line-level_code_completion.py` : Script to evaluate the trained models on the line-level code completion task.

- `model.py` : Contains the model architecture and related functions.

Expand All @@ -42,6 +52,7 @@ This repository contains the implementation and resources for the Tiny Language
- `train.py` : Script to train the models.

## Requirements
We've used Python 3.11.7.

To install the required packages, you can use the following:

Expand All @@ -59,15 +70,15 @@ cd data/
python tinypy_generator.py --num_programs 1000 --level 1.1 --filename sample_data.txt --deduplicate
```

This generation command is just an example to get you started. If you want to train your own model, you'll likely need to generate significantly more data.

### Data Preparation
Prepare the data by running:
Prepare (tokenize and split) the data by running:

```bash
python prepare.py
```

This generation command is just an example to get you started. If you want to train your own model, you'll likely need to generate significantly more data.

### Training
Train the model using the following command:

Expand All @@ -78,10 +89,22 @@ python train.py --batch_size 64 --max_iters 35000 --learning_rate 0.01 --miles 0
```

### Evaluation
Evaluate the trained model by running:
Evaluate the trained model on code execution by running:

```bash
python eval.py --dataset_dir data --model_name arithmetics_level1_696K
python code_execution.py --dataset_dir data --model_name arithmetics_level1_696K
```

Evaluate the trained model on token-level code completion by running:

```bash
python token-level_code_completion.py --dataset_dir data --model_name arithmetics_level1_696K
```

Evaluate the trained model on line-level code completion by running:

```bash
python line-level_code_completion.py --dataset_dir data --model_name arithmetics_level1_696K
```

### Demonstration
Expand All @@ -108,9 +131,14 @@ python evaluate.py --checkpoint_dir models/code-llama-finetuned-level1 --test_fi
#### Demonstration
To see a demonstration of the model's capabilities, open the generalization/demonstration.ipynb notebook and follow the instructions within.

# Contact

- **Kamel Yamani**: [[email protected]](mailto:[email protected])
- **Marwa Naïr**: [[email protected]](mailto:[email protected])


# License
This project is licensed under the MIT License.

# Acknowledgements
Special thanks to all contributors and the community for their support and contribution
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
16 changes: 16 additions & 0 deletions datasets/dataset-1/.readme.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# DATA DESCRIPTION:
- Around 1M code snippets generated with the full random code generator script

# DATA OBTENTION:

- dataset obtained by executing: python full_random_code_generator.py --output_file ./data/data.txt
- python version 3.10.14
- requires a unix based os (Linux/MacOS)

# META-DATA:
- ditched code snippets for overflow errors and the likes of it: 0.00%
- ditched code snippets for zero division errors: 0.94%
- random state stored in frcg-random-states

# DATA LOCATION:
- Not yet uploaded
Binary file not shown.
Loading