Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
7124125
Update README.md
kamel-yamani Jul 1, 2024
0b0fde7
First Commit
kamel-yamani Jul 1, 2024
9ab01b1
Adding figure
kamel-yamani Jul 1, 2024
7ac0507
Delete TLMF.png
kamel-yamani Jul 1, 2024
6b9a579
Update README.md
kamel-yamani Jul 1, 2024
f41a7e9
Update README.md
kamel-yamani Jul 1, 2024
ab9fbe4
Update README.md
kamel-yamani Jul 1, 2024
02551c5
adding code completion eval
MarwaNair Jul 2, 2024
a61c49c
Update README.md
MarwaNair Jul 2, 2024
9c3e9e3
Update README.md
MarwaNair Jul 2, 2024
e4710bb
Update README.md
kamel-yamani Jul 20, 2024
4bf4378
Update README.md
kamel-yamani Jul 21, 2024
962cb56
Update README.md
kamel-yamani Jul 22, 2024
3cf8ebc
Update README.md
kamel-yamani Jul 22, 2024
1416bdc
TinyLM Starter Notebook Added
kamel-yamani Jul 22, 2024
37cb9e9
Update requirements.txt
kamel-yamani Jul 22, 2024
917551e
Fixing TinyPy generator
kamel-yamani Jul 30, 2024
85cd7be
Update README.md
kamel-yamani Jul 30, 2024
92b963b
Update tinypy_generator.py
MarwaNair Jul 30, 2024
08e521d
Update README.md
MarwaNair Jul 30, 2024
6cd85d9
added tasks folder
BenouaklilHodhaifa Sep 25, 2024
a51cf0e
Contributing the line execution count task files, by ibrahim-aboud
ibrahim-aboud Sep 25, 2024
c68fca0
Merge pull request #1 from ibrahim-aboud/main
BenouaklilHodhaifa Sep 25, 2024
1f5f057
Added operator finetuning script
Soapiane Oct 4, 2024
e79cca8
Cleaned files
Soapiane Oct 4, 2024
44c6363
Cleaned files
Soapiane Oct 4, 2024
1072fcd
Deleted some files
Soapiane Oct 4, 2024
e5355ec
Added gitignore
Soapiane Oct 4, 2024
e82353f
Updated gitignore
Soapiane Oct 4, 2024
b623593
Updated gitignore
Soapiane Oct 4, 2024
f6b9412
Files refactoring
Soapiane Oct 4, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
116 changes: 116 additions & 0 deletions .ipynb_checkpoints/README-checkpoint.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Tiny Language Models Framework

This repository contains the implementation and resources for the Tiny Language Models Framework project. In this project, we developed small-scale language models to facilitate detailed research into various aspects of large language models (LLMs), particularly in the domain of code.

## Project Structure

- `data/`
- `meta.pkl` : Metadata for the dataset.
- `prepare.py` : Script to prepare data for training.
- `sample_data.txt` : Sample data used for testing and demonstration.
- `test.bin` : Binary file containing test data.
- `test.txt` : Text file containing test data.
- `tinypy_generator.py` : Script to generate TinyPy data.
- `train.bin` : Binary file containing training data.
- `train.txt` : Text file containing training data.
- `val.bin` : Binary file containing validation data.
- `val.txt` : Text file containing validation data.

- `generalization/`
- `data/` : Contains tokenized data to fine-tune and evaluate Code LLaMa model.
- `models/` : Stores fine-tuned Code LLaMa models.
- `results/` : Holds results from the evaluation.
- `demonstration.ipynb` : Jupyter notebook demonstrating fine-tuned Code LLaMa capabilities.
- `evaluate.py` : Script to evaluate fine-tuned Code LLaMa.
- `finetune.py` : Script for fine-tuning Code LLaMa model.
- `tokenizing.py` : Handles tokenization for Code LLaMa model.

- `models/`
- `arithmetics_level1_696K.pth` : Pretrained model for arithmetic operations at level 1 with 696K parameters.

- `results/`
- Directory to store results of model evaluations and tests.

- `demonstration.ipynb` : Jupyter notebook demonstrating the usage of the models and scripts.

- `eval.py` : Script to evaluate the trained models.

- `model.py` : Contains the model architecture and related functions.

- `README.md` : This file.

- `train.py` : Script to train the models.

## Requirements

To install the required packages, you can use the following:

```bash
pip install -r requirements.txt
```

## Usage

### Data Generation
Generate the data using TinyPy Generator by running :

```bash
cd data/
python tinypy_generator.py --num_programs 1000 --level 1.1 --filename sample_data.txt --deduplicate
```

### Data Preparation
Prepare the data by running:

```bash
python prepare.py
```

This generation command is just an example to get you started. If you want to train your own model, you'll likely need to generate significantly more data.

### Training
Train the model using the following command:

bash
```bash
cd ..
python train.py --batch_size 64 --max_iters 35000 --learning_rate 0.01 --miles 0.7 0.8 0.9 --eval_interval 10000 --eval_iters 500 --data_dir data
```

### Evaluation
Evaluate the trained model by running:

```bash
python eval.py --dataset_dir data --model_name arithmetics_level1_696K
```

### Demonstration
To see a demonstration of the model's capabilities, open the demonstration.ipynb notebook and follow the instructions within.

### Generalization
This section aims to generalize the results obtained from training tiny language models to large language models. This can be achieved through fine-tuning Code LLaMa.

#### Fine-tuning
Fine-tune Code LLaMa model using the following command:

```bash
cd generalization/
python finetune.py --train_dataset_path data/tokenized_train --val_dataset_path data/tokenized_val --output_dir models/code-llama-finetuned-demo
```

#### Evaluation
Evaluate the fine-tuned Code LLaMa model by running:

```bash
python evaluate.py --checkpoint_dir models/code-llama-finetuned-level1 --test_file data/test.txt --output_file results/result_llama.txt --csv_file results/results_llama.csv
```

#### Demonstration
To see a demonstration of the model's capabilities, open the generalization/demonstration.ipynb notebook and follow the instructions within.


# License
This project is licensed under the MIT License.

# Acknowledgements
This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
164 changes: 164 additions & 0 deletions .ipynb_checkpoints/code_execution-checkpoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
import os
import pickle
import torch
import numpy as np
import pandas as pd
import re
from tqdm import tqdm
import argparse
from model import GPT

class ScriptEvaluator:
"""
Class to evaluate a GPT model on a dataset and save the results.
"""

def __init__(self, dataset_dir, model_name):
"""
Initialize ScriptEvaluator with dataset directory and model name.

Args:
- dataset_dir (str): Directory where the dataset is stored.
- model_name (str): Name of the pre-trained model (without .pth extension).
"""
self.dataset = dataset_dir
self.model_name = model_name
self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(1337)
self.test_data, self.meta = self.load_dataset()
self.m = self.load_model()

def load_dataset(self):
"""
Load test dataset and metadata.
"""
test_data = np.memmap(os.path.join(self.dataset, 'test.bin'), dtype=np.uint16, mode='r')
meta_path = os.path.join(self.dataset, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
with open(meta_path, 'rb') as f:
meta = pickle.load(f)
meta_vocab_size = meta['vocab_size']
print(f"Found vocab_size = {meta_vocab_size} (inside {meta_path})")

return test_data, meta

def load_model(self):
"""
Load pre-trained model based on the provided model name.
"""
model_path = os.path.join('models', f"{self.model_name}.pth")
if not os.path.exists(model_path):
raise FileNotFoundError(f"Model file '{model_path}' not found.")

model = GPT()
print("Compiling the model...\n")
try:
model = torch.compile(model) # requires PyTorch 2.0
except Exception as e:
pass
model.load_state_dict(torch.load(model_path))
m = model.to(self.device)
return m

def encode(self, s):
"""
Encode string `s` into token IDs.
"""
return [self.stoi[c] for c in s]

def decode(self, l):
"""
Decode token IDs `l` into a string.
"""
return ''.join([self.itos[i] for i in l])

def evaluate_example(self, example, max_new_tokens=30):
"""
Evaluate an example using the loaded model.
"""
# Split example and determine maximum new tokens allowed
splited_example = example.split("# output")
if not ("for" in splited_example[0]):
max_new_tokens = 22

# Encode prompt and prepare for evaluation
encoded_example = torch.tensor(self.encode(splited_example[0] + "# output"), dtype=torch.long).unsqueeze(0).to(self.device)
prompt_text = splited_example[0] + "# output"
result_example = splited_example[-1]

# Extract real results from example
real_results = [float(match.group()) for match in re.finditer(r"(?<=# )-?\d+(\.\d+)?", result_example.split('\n\n')[0].replace("\n", ""))]

# Generate response from model and extract generated results
response = self.decode(self.m.generate(encoded_example, max_new_tokens=max_new_tokens)[0].tolist())
splited_response = response.split("# output")
result_response = splited_response[-1]
generated_results = [float(match.group()) for match in re.finditer(r"(?<=# )-?\d+(\.\d+)?", result_response.split('\n\n')[0].replace("\n", ""))]

return prompt_text, real_results, generated_results

def write_results_to_file(self, output_file, prompt, real_results, generated_results):
"""
Write evaluation results to a CSV file.
"""
df = pd.DataFrame({
'Prompt': prompt,
'Real_Results': real_results,
'Generated_Results': generated_results
})
df.to_csv(output_file, index=False)

def main(self):
"""
Main evaluation function.
"""
# Extracting stoi and itos from meta
self.stoi = self.meta['stoi']
self.itos = self.meta['itos']

# Split examples and initialize lists for results
examples = self.decode(self.test_data).split("\n\n")
examples = [example for example in examples if example]

# Start evaluation process
print(f"Starting evaluation for model '{self.model_name}' on dataset '{self.dataset}'...")
prompt = []
real_results = []
generated_results = []

# Iterate through examples and evaluate each one
for example in tqdm(examples):
prompt_text, real_result, result = self.evaluate_example(example)
prompt.append(prompt_text)
real_results.append(real_result)
generated_results.append(result)

# Calculate and print accuracy
correct_count = sum(1 for real, generated in zip(real_results, generated_results) if real == generated)
accuracy = correct_count / len(generated_results)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Store accuracy in a file
accuracy_file = os.path.join('results', f"{self.model_name}_accuracy.txt") # Saving in 'results' folder
with open(accuracy_file, 'w') as f:
f.write(f"Accuracy: {accuracy * 100:.2f}%\n")
print(f"Accuracy saved to {accuracy_file}")

# Store results in a CSV file
results_file = os.path.join('results', f"{self.model_name}_results.csv") # Saving in 'results' folder
self.write_results_to_file(results_file, prompt, real_results, generated_results)
print(f"Results saved to {results_file}")

if __name__ == "__main__":
# Argument parsing
parser = argparse.ArgumentParser(description='Evaluate NanoGPT model on a dataset.')
parser.add_argument('--dataset_dir', type=str, default='data', help='Directory where the dataset is stored')
parser.add_argument('--model_name', type=str, required=True, help='Name of the pre-trained model (without .pth extension)')

# Parse the command-line arguments
args = parser.parse_args()

# Create ScriptEvaluator instance and run main function
evaluator = ScriptEvaluator(args.dataset_dir, args.model_name)
evaluator.main()
99 changes: 99 additions & 0 deletions .ipynb_checkpoints/line-level_code_completion-checkpoint.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,99 @@
import os
import pickle
import argparse
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import pandas as pd
from tqdm import tqdm
import re
from model import GPT

# Argument parsing
parser = argparse.ArgumentParser(description='Evaluate NanoGPT model on token-level code completion.')
parser.add_argument('--dataset_dir', type=str, default='data', help='Directory where the dataset is stored')
parser.add_argument('--model_name', type=str, required=True, help='Name of the pre-trained model (without .pth extension)')

# Parse the command-line arguments
args = parser.parse_args()

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
torch.manual_seed(1337)


# Constants for dataset and file paths
MODEL_FILE = f"models/{args.model_name}.pth"
ACCURACY_FILE = f"results/{args.model_name}_acc_line-level_code_completion.txt"
RESULTS_FILE = f"results/{args.model_name}_line-level_code_completion.csv"


data_dir = args.dataset_dir
test_data = np.memmap(os.path.join(data_dir, 'test.bin'), dtype=np.uint16, mode='r')


# attempt to derive vocab_size from the dataset
meta_path = os.path.join(data_dir, 'meta.pkl')
meta_vocab_size = None
if os.path.exists(meta_path):
with open(meta_path, 'rb') as f:
meta = pickle.load(f)
meta_vocab_size = meta['vocab_size']
print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")

stoi = meta['stoi']
itos = meta['itos']
encode = lambda s: [stoi[c] for c in s]
decode = lambda l: ''.join([itos[i] for i in l])

model = GPT()
print("Compiling model...")
model = torch.compile(model) # pytorch 2.0
model.load_state_dict(torch.load(MODEL_FILE))
m = model.to(device)

examples = decode(test_data).split("\n\n")
examples = [example for example in examples if example]

correct_predictions = 0
total_predictions = 0

results = []

for code_snippet in tqdm(examples):

lines = code_snippet.split('\n')
for i in range(1, len(lines)):

context_lines = lines[:i]
actual_next_line = lines[i]

context_tokens = torch.tensor(encode('\n'.join(context_lines) + '\n'), dtype=torch.long).unsqueeze(0).to(device)
actual_next_line_tokens = torch.tensor(encode(actual_next_line), dtype=torch.long).unsqueeze(0).to(device)

n = actual_next_line_tokens.shape[1] # Limit to length of actual next line
predicted_next_line_tokens = m.generate(context_tokens, max_new_tokens=n)
predicted_next_line_tokens = predicted_next_line_tokens[:, -n:]
is_correct = torch.equal(predicted_next_line_tokens, actual_next_line_tokens)

if is_correct:
correct_predictions += 1
results.append({
'context': context_tokens.cpu(),
'actual_next_line': actual_next_line_tokens.cpu(),
'predicted_next_line': predicted_next_line_tokens.cpu(),
'is_correct': is_correct
})

total_predictions += 1

df = pd.DataFrame(results)
df.to_csv(RESULTS_FILE, index=False)

accuracy = (correct_predictions / total_predictions) * 100

# Store accuracy in a file
with open(ACCURACY_FILE, 'w') as f:
f.write(f"Accuracy: {accuracy:.2f}%\n")

print(accuracy)
Loading