Modern-Compilers-Lab · Soapiane · Jul 1, 2024 · Jul 1, 2024 · Jul 1, 2024 · Jul 1, 2024
diff --git a/.ipynb_checkpoints/README-checkpoint.md b/.ipynb_checkpoints/README-checkpoint.md
@@ -0,0 +1,116 @@
+# Tiny Language Models Framework
+
+This repository contains the implementation and resources for the Tiny Language Models Framework project. In this project, we developed small-scale language models to facilitate detailed research into various aspects of large language models (LLMs), particularly in the domain of code.
+
+## Project Structure
+
+- `data/`
+  - `meta.pkl` : Metadata for the dataset.
+  - `prepare.py` : Script to prepare data for training.
+  - `sample_data.txt` : Sample data used for testing and demonstration.
+  - `test.bin` : Binary file containing test data.
+  - `test.txt` : Text file containing test data.
+  - `tinypy_generator.py` : Script to generate TinyPy data.
+  - `train.bin` : Binary file containing training data.
+  - `train.txt` : Text file containing training data.
+  - `val.bin` : Binary file containing validation data.
+  - `val.txt` : Text file containing validation data.
+
+- `generalization/`
+  - `data/` : Contains tokenized data to fine-tune and evaluate Code LLaMa model.
+  - `models/` : Stores fine-tuned Code LLaMa models.
+  - `results/` : Holds results from the evaluation.
+  - `demonstration.ipynb` : Jupyter notebook demonstrating fine-tuned Code LLaMa capabilities.
+  - `evaluate.py` : Script to evaluate fine-tuned Code LLaMa.
+  - `finetune.py` : Script for fine-tuning Code LLaMa model.
+  - `tokenizing.py` : Handles tokenization for Code LLaMa model.
+
+- `models/`
+  - `arithmetics_level1_696K.pth` : Pretrained model for arithmetic operations at level 1 with 696K parameters.
+
+- `results/`
+  - Directory to store results of model evaluations and tests.
+
+- `demonstration.ipynb` : Jupyter notebook demonstrating the usage of the models and scripts.
+
+- `eval.py` : Script to evaluate the trained models.
+
+- `model.py` : Contains the model architecture and related functions.
+
+- `README.md` : This file.
+
+- `train.py` : Script to train the models.
+
+## Requirements
+
+To install the required packages, you can use the following:
+
+```bash
+pip install -r requirements.txt
+```
+
+## Usage
+
+### Data Generation
+Generate the data using TinyPy Generator by running : 
+
+```bash
+cd data/
+python tinypy_generator.py --num_programs 1000 --level 1.1 --filename sample_data.txt --deduplicate
+```
+
+### Data Preparation
+Prepare the data by running:
+
+```bash
+python prepare.py
+```
+
+This generation command is just an example to get you started. If you want to train your own model, you'll likely need to generate significantly more data. 
+
+### Training
+Train the model using the following command:
+
+bash
+```bash
+cd ..
+python train.py --batch_size 64 --max_iters 35000 --learning_rate 0.01 --miles 0.7 0.8 0.9 --eval_interval 10000 --eval_iters 500 --data_dir data
+```
+
+### Evaluation
+Evaluate the trained model by running:
+
+```bash
+python eval.py --dataset_dir data --model_name arithmetics_level1_696K
+```
+
+### Demonstration
+To see a demonstration of the model's capabilities, open the demonstration.ipynb notebook and follow the instructions within.
+
+### Generalization
+This section aims to generalize the results obtained from training tiny language models to large language models. This can be achieved through fine-tuning Code LLaMa.
+
+#### Fine-tuning
+Fine-tune Code LLaMa model using the following command:
+
+```bash
+cd generalization/
+python finetune.py  --train_dataset_path data/tokenized_train --val_dataset_path data/tokenized_val --output_dir models/code-llama-finetuned-demo
+```
+
+#### Evaluation
+Evaluate the fine-tuned Code LLaMa model by running:
+
+```bash
+python evaluate.py --checkpoint_dir models/code-llama-finetuned-level1 --test_file data/test.txt --output_file results/result_llama.txt --csv_file results/results_llama.csv 
+```
+
+#### Demonstration
+To see a demonstration of the model's capabilities, open the generalization/demonstration.ipynb notebook and follow the instructions within.
+
+
+# License
+This project is licensed under the MIT License.
+
+#  Acknowledgements
+This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
diff --git a/.ipynb_checkpoints/code_execution-checkpoint.py b/.ipynb_checkpoints/code_execution-checkpoint.py
@@ -0,0 +1,164 @@
+import os
+import pickle
+import torch
+import numpy as np
+import pandas as pd
+import re
+from tqdm import tqdm
+import argparse
+from model import GPT
+
+class ScriptEvaluator:
+    """
+    Class to evaluate a GPT model on a dataset and save the results.
+    """
+
+    def __init__(self, dataset_dir, model_name):
+        """
+        Initialize ScriptEvaluator with dataset directory and model name.
+
+        Args:
+        - dataset_dir (str): Directory where the dataset is stored.
+        - model_name (str): Name of the pre-trained model (without .pth extension).
+        """
+        self.dataset = dataset_dir
+        self.model_name = model_name
+        self.device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+        torch.manual_seed(1337)
+        self.test_data, self.meta = self.load_dataset()
+        self.m = self.load_model()
+
+    def load_dataset(self):
+        """
+        Load test dataset and metadata.
+        """
+        test_data = np.memmap(os.path.join(self.dataset, 'test.bin'), dtype=np.uint16, mode='r')
+        meta_path = os.path.join(self.dataset, 'meta.pkl')
+        meta_vocab_size = None
+        if os.path.exists(meta_path):
+            with open(meta_path, 'rb') as f:
+                meta = pickle.load(f)
+            meta_vocab_size = meta['vocab_size']
+            print(f"Found vocab_size = {meta_vocab_size} (inside {meta_path})")
+
+        return test_data, meta
+
+    def load_model(self):
+        """
+        Load pre-trained model based on the provided model name.
+        """
+        model_path = os.path.join('models', f"{self.model_name}.pth") 
+        if not os.path.exists(model_path):
+            raise FileNotFoundError(f"Model file '{model_path}' not found.")
+
+        model = GPT()
+        print("Compiling the model...\n")
+        try:
+            model = torch.compile(model)  # requires PyTorch 2.0
+        except Exception as e:
+            pass
+        model.load_state_dict(torch.load(model_path))
+        m = model.to(self.device)
+        return m
+
+    def encode(self, s):
+        """
+        Encode string `s` into token IDs.
+        """
+        return [self.stoi[c] for c in s]
+
+    def decode(self, l):
+        """
+        Decode token IDs `l` into a string.
+        """
+        return ''.join([self.itos[i] for i in l])
+
+    def evaluate_example(self, example, max_new_tokens=30):
+        """
+        Evaluate an example using the loaded model.
+        """
+        # Split example and determine maximum new tokens allowed
+        splited_example = example.split("# output")
+        if not ("for" in splited_example[0]):
+            max_new_tokens = 22
+
+        # Encode prompt and prepare for evaluation
+        encoded_example = torch.tensor(self.encode(splited_example[0] + "# output"), dtype=torch.long).unsqueeze(0).to(self.device)
+        prompt_text = splited_example[0] + "# output"
+        result_example = splited_example[-1]
+
+        # Extract real results from example
+        real_results = [float(match.group()) for match in re.finditer(r"(?<=# )-?\d+(\.\d+)?", result_example.split('\n\n')[0].replace("\n", ""))]
+
+        # Generate response from model and extract generated results
+        response = self.decode(self.m.generate(encoded_example, max_new_tokens=max_new_tokens)[0].tolist())
+        splited_response = response.split("# output")
+        result_response = splited_response[-1]
+        generated_results = [float(match.group()) for match in re.finditer(r"(?<=# )-?\d+(\.\d+)?", result_response.split('\n\n')[0].replace("\n", ""))]
+
+        return prompt_text, real_results, generated_results
+
+    def write_results_to_file(self, output_file, prompt, real_results, generated_results):
+        """
+        Write evaluation results to a CSV file.
+        """
+        df = pd.DataFrame({
+            'Prompt': prompt,
+            'Real_Results': real_results,
+            'Generated_Results': generated_results
+        })
+        df.to_csv(output_file, index=False)
+
+    def main(self):
+        """
+        Main evaluation function.
+        """
+        # Extracting stoi and itos from meta
+        self.stoi = self.meta['stoi']
+        self.itos = self.meta['itos']
+
+        # Split examples and initialize lists for results
+        examples = self.decode(self.test_data).split("\n\n")
+        examples = [example for example in examples if example]
+
+        # Start evaluation process
+        print(f"Starting evaluation for model '{self.model_name}' on dataset '{self.dataset}'...")
+        prompt = []
+        real_results = []
+        generated_results = []
+
+        # Iterate through examples and evaluate each one
+        for example in tqdm(examples):
+            prompt_text, real_result, result = self.evaluate_example(example)
+            prompt.append(prompt_text)
+            real_results.append(real_result)
+            generated_results.append(result)
+
+        # Calculate and print accuracy
+        correct_count = sum(1 for real, generated in zip(real_results, generated_results) if real == generated)
+        accuracy = correct_count / len(generated_results)
+        print(f"Accuracy: {accuracy * 100:.2f}%")
+
+        # Store accuracy in a file
+        accuracy_file = os.path.join('results', f"{self.model_name}_accuracy.txt")  # Saving in 'results' folder
+        with open(accuracy_file, 'w') as f:
+            f.write(f"Accuracy: {accuracy * 100:.2f}%\n")
+        print(f"Accuracy saved to {accuracy_file}")
+
+        # Store results in a CSV file
+        results_file = os.path.join('results', f"{self.model_name}_results.csv")  # Saving in 'results' folder
+        self.write_results_to_file(results_file, prompt, real_results, generated_results)
+        print(f"Results saved to {results_file}")
+
+if __name__ == "__main__":
+    # Argument parsing
+    parser = argparse.ArgumentParser(description='Evaluate NanoGPT model on a dataset.')
+    parser.add_argument('--dataset_dir', type=str, default='data', help='Directory where the dataset is stored')
+    parser.add_argument('--model_name', type=str, required=True, help='Name of the pre-trained model (without .pth extension)')
+
+    # Parse the command-line arguments
+    args = parser.parse_args()
+
+    # Create ScriptEvaluator instance and run main function
+    evaluator = ScriptEvaluator(args.dataset_dir, args.model_name)
+    evaluator.main()
diff --git a/.ipynb_checkpoints/line-level_code_completion-checkpoint.py b/.ipynb_checkpoints/line-level_code_completion-checkpoint.py
@@ -0,0 +1,99 @@
+import os
+import pickle
+import argparse
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+import pandas as pd
+from tqdm import tqdm
+import re
+from model import GPT
+
+# Argument parsing
+parser = argparse.ArgumentParser(description='Evaluate NanoGPT model on token-level code completion.')
+parser.add_argument('--dataset_dir', type=str, default='data', help='Directory where the dataset is stored')
+parser.add_argument('--model_name', type=str, required=True, help='Name of the pre-trained model (without .pth extension)')
+
+# Parse the command-line arguments
+args = parser.parse_args()
+
+device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
+torch.manual_seed(1337)
+
+
+# Constants for dataset and file paths
+MODEL_FILE = f"models/{args.model_name}.pth"
+ACCURACY_FILE = f"results/{args.model_name}_acc_line-level_code_completion.txt"
+RESULTS_FILE = f"results/{args.model_name}_line-level_code_completion.csv"
+
+
+data_dir = args.dataset_dir
+test_data = np.memmap(os.path.join(data_dir, 'test.bin'), dtype=np.uint16, mode='r')
+
+
+# attempt to derive vocab_size from the dataset
+meta_path = os.path.join(data_dir, 'meta.pkl')
+meta_vocab_size = None
+if os.path.exists(meta_path):
+    with open(meta_path, 'rb') as f:
+        meta = pickle.load(f)
+    meta_vocab_size = meta['vocab_size']
+    print(f"found vocab_size = {meta_vocab_size} (inside {meta_path})")
+
+stoi = meta['stoi']
+itos = meta['itos']
+encode = lambda s: [stoi[c] for c in s] 
+decode = lambda l: ''.join([itos[i] for i in l]) 
+
+model = GPT()  
+print("Compiling model...")
+model = torch.compile(model) # pytorch 2.0
+model.load_state_dict(torch.load(MODEL_FILE))
+m = model.to(device)
+
+examples = decode(test_data).split("\n\n")
+examples = [example for example in examples if example]
+
+correct_predictions = 0
+total_predictions = 0
+
+results = []
+
+for code_snippet in tqdm(examples):
+
+    lines = code_snippet.split('\n')
+    for i in range(1, len(lines)):
+
+        context_lines = lines[:i]
+        actual_next_line = lines[i]
+
+        context_tokens = torch.tensor(encode('\n'.join(context_lines) + '\n'), dtype=torch.long).unsqueeze(0).to(device)
+        actual_next_line_tokens = torch.tensor(encode(actual_next_line), dtype=torch.long).unsqueeze(0).to(device)
+
+        n = actual_next_line_tokens.shape[1]  # Limit to length of actual next line
+        predicted_next_line_tokens = m.generate(context_tokens, max_new_tokens=n)
+        predicted_next_line_tokens = predicted_next_line_tokens[:, -n:]
+        is_correct = torch.equal(predicted_next_line_tokens, actual_next_line_tokens)
+
+        if is_correct:
+            correct_predictions += 1
+        results.append({
+            'context': context_tokens.cpu(),
+            'actual_next_line': actual_next_line_tokens.cpu(),
+            'predicted_next_line': predicted_next_line_tokens.cpu(),
+            'is_correct': is_correct
+        })
+
+        total_predictions += 1
+
+df = pd.DataFrame(results)
+df.to_csv(RESULTS_FILE, index=False)
+
+accuracy = (correct_predictions / total_predictions) * 100
+
+# Store accuracy in a file
+with open(ACCURACY_FILE, 'w') as f:
+    f.write(f"Accuracy: {accuracy:.2f}%\n")
+
+print(accuracy)