Skip to content

Inconsistency in GSM8K training data files and sample counts across different models #3

@Pangyh2001

Description

@Pangyh2001

Hi there,

Thank you for your impressive work on this project.

While exploring the dataset structure, I noticed some inconsistencies in the training data files for the GSM8K benchmark across different models. While most models seem to align with the standard 7,472 samples (usually consisting of 2 files), I found the following discrepancies in specific directories:

  1. gsm8k/Llama-2-70b-chat-hf:
    This directory contains 4 files instead of the expected 2. It includes outputs and predictions for both "2085" and "7472":

    • run_2085_outputs.pkl / run_2085_predictions.npy
    • run_7472_outputs.pkl / run_7472_predictions.npy
  2. gsm8k/Llama-2-7b-chat-hf/train:
    The files here correspond to a count of 4,000, rather than the full set:

    • run_4000_outputs.pkl
    • run_4000_predictions.npy
  3. gsm8k/gemma-7b/train:
    The files here correspond to a count of 6,980:

    • run_6980_outputs.pkl
    • run_6980_predictions.npy

Could you please clarify the reasoning behind these different sample counts and file structures? Specifically:

  • For Llama-2-70b, should I be using the 7472 files and ignoring the 2085 ones?
  • For Llama-2-7b and Gemma-7b, do the files (4000 and 6980) represent the complete intended training set for this project, or are they partial checkpoints/subsets?

Any guidance on which files are the correct ones to use for reproduction would be greatly appreciated.

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions