-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi there,
Thank you for your impressive work on this project.
While exploring the dataset structure, I noticed some inconsistencies in the training data files for the GSM8K benchmark across different models. While most models seem to align with the standard 7,472 samples (usually consisting of 2 files), I found the following discrepancies in specific directories:
-
gsm8k/Llama-2-70b-chat-hf:
This directory contains 4 files instead of the expected 2. It includes outputs and predictions for both "2085" and "7472":run_2085_outputs.pkl/run_2085_predictions.npyrun_7472_outputs.pkl/run_7472_predictions.npy
-
gsm8k/Llama-2-7b-chat-hf/train:
The files here correspond to a count of 4,000, rather than the full set:run_4000_outputs.pklrun_4000_predictions.npy
-
gsm8k/gemma-7b/train:
The files here correspond to a count of 6,980:run_6980_outputs.pklrun_6980_predictions.npy
Could you please clarify the reasoning behind these different sample counts and file structures? Specifically:
- For Llama-2-70b, should I be using the
7472files and ignoring the2085ones? - For Llama-2-7b and Gemma-7b, do the files (
4000and6980) represent the complete intended training set for this project, or are they partial checkpoints/subsets?
Any guidance on which files are the correct ones to use for reproduction would be greatly appreciated.
Thanks!