Modern-Compilers-Lab · Ellzo · Jul 1, 2024 · Jul 1, 2024 · Jul 1, 2024 · Jul 1, 2024
diff --git a/README.md b/README.md
@@ -1,6 +1,12 @@
 # Tiny Language Models Framework
 
-This repository contains the implementation and resources for the Tiny Language Models Framework project. In this project, we developed small-scale language models to facilitate detailed research into various aspects of large language models (LLMs), particularly in the domain of code.
+This repository contains the implementation and resources for the Tiny Language Models Framework project. In this project, we developed small-scale language models to facilitate detailed research into various aspects of large language models (LLMs), particularly in the domain of code. 
+
+<p align="center">
+  <img src="https://github.com/Modern-Compilers-Lab/Tiny-Language-Models-Framework/assets/86785811/946011ac-90ca-454f-baeb-d74b09a1721c" width="500" >
+</p>
+
+We've also prepared a [TinyLM Starter Notebook on Kaggle](https://www.kaggle.com/code/nairmarwa/tinylm-starter-notebook). This notebook is designed to help you get started quickly with our project. It guides you through training a tiny language model from scratch using our dataset and evaluating its performance on code execution tasks.
 
 ## Project Structure
 
@@ -33,7 +39,11 @@ This repository contains the implementation and resources for the Tiny Language
 
 - `demonstration.ipynb` : Jupyter notebook demonstrating the usage of the models and scripts.
 
-- `eval.py` : Script to evaluate the trained models.
+- `code_execution.py` : Script to evaluate the trained models on the code execution task.
+
+- `token-level_code_completion.py` : Script to evaluate the trained models on the token-level code completion task.
+
+- `line-level_code_completion.py` : Script to evaluate the trained models on the line-level code completion task.
 
 - `model.py` : Contains the model architecture and related functions.
 
@@ -42,6 +52,7 @@ This repository contains the implementation and resources for the Tiny Language
 - `train.py` : Script to train the models.
 
 ## Requirements
+We've used Python 3.11.7.
 
 To install the required packages, you can use the following:
 
@@ -59,15 +70,15 @@ cd data/
 python tinypy_generator.py --num_programs 1000 --level 1.1 --filename sample_data.txt --deduplicate
 ```
 
+This generation command is just an example to get you started. If you want to train your own model, you'll likely need to generate significantly more data. 
+
 ### Data Preparation
-Prepare the data by running:
+Prepare (tokenize and split) the data by running:
 
 ```bash
 python prepare.py
 ```
 
-This generation command is just an example to get you started. If you want to train your own model, you'll likely need to generate significantly more data. 
-
 ### Training
 Train the model using the following command:
 
@@ -78,10 +89,22 @@ python train.py --batch_size 64 --max_iters 35000 --learning_rate 0.01 --miles 0
 ```
 
 ### Evaluation
-Evaluate the trained model by running:
+Evaluate the trained model on code execution by running:
 
 ```bash
-python eval.py --dataset_dir data --model_name arithmetics_level1_696K
+python code_execution.py --dataset_dir data --model_name arithmetics_level1_696K
+```
+
+Evaluate the trained model on token-level code completion by running:
+
+```bash
+python token-level_code_completion.py --dataset_dir data --model_name arithmetics_level1_696K
+```
+
+Evaluate the trained model on line-level code completion by running:
+
+```bash
+python line-level_code_completion.py --dataset_dir data --model_name arithmetics_level1_696K
 ```
 
 ### Demonstration
@@ -108,9 +131,14 @@ python evaluate.py --checkpoint_dir models/code-llama-finetuned-level1 --test_fi
 #### Demonstration
 To see a demonstration of the model's capabilities, open the generalization/demonstration.ipynb notebook and follow the instructions within.
 
+# Contact
+
+- **Kamel Yamani**: [[email protected]](mailto:[email protected])
+- **Marwa Naïr**: [[email protected]](mailto:[email protected])
+
 
 # License
 This project is licensed under the MIT License.
 
 #  Acknowledgements
-Special thanks to all contributors and the community for their support and contribution
+This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
diff --git a/datasets/dataset-1/.readme.md b/datasets/dataset-1/.readme.md
@@ -0,0 +1,16 @@
+# DATA DESCRIPTION:
+- Around 1M code snippets generated with the full random code generator script
+
+# DATA OBTENTION:
+
+- dataset obtained by executing: python full_random_code_generator.py --output_file ./data/data.txt
+- python version 3.10.14
+- requires a unix based os (Linux/MacOS)
+
+# META-DATA:
+- ditched code snippets for overflow errors and the likes of it: 0.00%
+- ditched code snippets for zero division errors: 0.94%
+- random state stored in frcg-random-states
+
+# DATA LOCATION:
+- Not yet uploaded
diff --git a/datasets/dataset-1/frcg-random-states/random_state_2024-09-14_22-39.bin b/datasets/dataset-1/frcg-random-states/random_state_2024-09-14_22-39.bin