Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluating a Model with a Local Dataset in an Offline Environment #271

Open
ankush13r opened this issue Sep 12, 2024 · 9 comments
Open

Evaluating a Model with a Local Dataset in an Offline Environment #271

ankush13r opened this issue Sep 12, 2024 · 9 comments

Comments

@ankush13r
Copy link

Hello,
Is there currently a way to evaluate a model using a dataset from a local path, instead of fetching it directly from HuggingFace? We're working in a cluster environment without internet access, and we need to evaluate the model locally.

If this feature isn't available yet, it would be a great enhancement to consider. Implementing a solution that accepts a local dataset would allow evaluations to be run offline. A potential approach could involve adding a new script argument, such as --datasets-path, so the dataset can be loaded directly from the specified location.

@Vipitis
Copy link

Vipitis commented Sep 13, 2024

theoretically, it should be possible to use HF_HUB_OFFLINE=1 and load from local cache or local path (if matching the dataset checkpoint dir). Since the base class makes use of dataset.load_dataset() here

self.dataset = load_dataset(path=self.DATASET_PATH, name=self.DATASET_NAME)

@ankush13r
Copy link
Author

But, I couldn't find any way add the path for the dataset. As you can observe here https://github.com/search?q=repo%3Abigcode-project%2Fbigcode-evaluation-harness%20DATASET_PATH&type=code the dataset path is a constant variable defined directly in the code.

@Vipitis
Copy link

Vipitis commented Sep 16, 2024

those are the checkpoint dirs from the huggingface hub. so clone the dataset repo to be that exact path locally and the load_dataset function will try local first.

@ankush13r
Copy link
Author

ankush13r commented Sep 19, 2024

Hello, thank for your response.
I have tried what you said, but i hasn't worked for me. I let you an example that I had used to run the evaluation.
I have also downloaded the dataset in /home/user/dataset.

export HF_DATASETS_CACHE=/home/user/dataset
export HF_HUB_OFFLINE=1

accelerate launch  main.py \
  --model  /path/to/the/model \
  --tasks mbpp \
  --max_length_generation 1500 \
  --temperature 1.2 \
  --do_sample True \
  --n_samples 100 \
  --batch_size 10 \
  --allow_code_execution \
  --save_generations

The error I'm getting is:

AttributeError: 'MBPP' object has no attribute 'dataset'
/gpfs/home/bsc/bigcode-evaluation-harness/bigcode_eval/base.py:30: UserWarning: Loading the dataset failed with Couldn't reach the Hugging Face Hub for dataset 'mbpp': Offline mode is enabled.. This task will use a locally downloaded dataset, not from the HF hub.                 This is expected behavior for the DS-1000 benchmark but not for other benchmarks!```

@Vipitis
Copy link

Vipitis commented Sep 19, 2024

that sees to be an issue with the actual test in this case. MBPP used to have a vanity dataset name on the hub. so there is no org. so maybe it works if you have the /mbpp/ dataset folder on the same level as main.py

the error is actually misleading since it doesn't do anything afterwards. it is just a warning for the specific ds1000 benchmark and just means the dataset couldn't be loaded. It sorta surpresses the real error message that is more helpful.

@ankush13r
Copy link
Author

Thanks it worked, I think it will work with all kind of tasks, having datasets in local machine. I would like to know if there is a way to change the path for these datasets, Since we need to save in other folder.

@Vipitis
Copy link

Vipitis commented Sep 23, 2024

Maybe symlinks? But I am not too familiar with how the load_dataset() function resolves these. Perhaps there is a way to use the HF Hub Cache instead. As that can be pointed anywhere

@ankush13r
Copy link
Author

Perfect, I'll figure it out. Thanks again!

@ggcr
Copy link

ggcr commented Jan 17, 2025

You can always download the dataset locally, store it into disk:

from datasets import load_dataset

ds = load_dataset(url)
ds.save_to_disk(dataset_dir)

And then use that directory during the run:

from datasets import load_from_disk

ds = load_from_disk(dataset_dir)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants