This repository contains evaluation tools for the Palm dataset — a Culturally Inclusive and Linguistically Diverse dataset for Arabic LLMs.
The Palm dataset provides a comprehensive benchmark for evaluating the performance of Large Language Models on Arabic language tasks across diverse dialects and cultural contexts. This repository focuses on the evaluation pipeline using the test set portion of the Palm dataset.
You can access the dataset on Hugging Face:
from datasets import load_dataset
dataset = load_dataset("UBC-NLP/palm")
- Python 3.8+
- PyTorch
- vLLM
- 4x A100 GPUs (as used in our experiments)
Install required packages:
pip install -r requirements.txt
Our evaluation pipeline consists of two main steps:
- Generate responses using your LLM
- Judge the quality of responses using LLM-as-Judge method
Before running either of the main scripts, you need to serve your LLM using vLLM:
./serve_llm.sh "llm_path" num_gpus port
Where:
llm_path
: Name or path of the LLM to serve (local path or HuggingFace model ID)num_gpus
: Number of GPUs to useport
: Port where vLLM will be served
Example:
./serve_llm.sh "models/Qwen2.5-7B-Instruct" 4 8000
Use the gen_responses.py
script to generate responses for the instructions in the test set:
python gen_responses.py \
--model_path models/Qwen2.5-7B-Instruct \
--vllm_model_id models/Qwen2.5-7B-Instruct \
--data_path data/test.jsonl \
--max_length 4096 \
--vllm_port 8000
Parameters:
model_path
: Local path or HuggingFace model IDvllm_model_id
: ID provided by vLLM when the model is serveddata_path
: Path to the test data filemax_length
: Maximum sequence length for generationvllm_port
: Server port of vLLM
The generated responses will be saved to responses/{model_id}.jsonl
.
Next, serve your Judge LLM using the same serve_llm.sh
script, then evaluate the generated responses:
python judge.py \
--output_judgements Qwen2.5-7B-Instruct \
--vllm_model_id models/Qwen2.5-72B-Instruct \
--preds_file responses/Qwen2.5-7B-Instruct.jsonl \
--max_length 2048 \
--vllm_port 8000
Parameters:
output_judgements
: Filename where judgments will be logged (without extension)vllm_model_id
: ID assigned by vLLM for the judge modelpreds_file
: File containing the generated responses from Step 1max_length
: Maximum sequence length for judgment generationvllm_port
: Server port of vLLM for the judge model
The judgments will be saved to judgements/{output_judgements}.jsonl
.
# 1. Serve the response-generating LLM
./serve_llm.sh "models/Qwen2.5-7B-Instruct" 4 8000
# 2. Generate responses
python gen_responses.py \
--model_path models/Qwen2.5-7B-Instruct \
--vllm_model_id models/Qwen2.5-7B-Instruct \
--data_path data/test.jsonl \
--max_length 4096 \
--vllm_port 8000
# 3. Stop the first LLM server and serve the judge LLM
./serve_llm.sh "models/Qwen2.5-72B-Instruct" 4 8000
# 4. Judge the responses
python judge.py \
--output_judgements Qwen2.5-7B-Instruct \
--vllm_model_id models/Qwen2.5-72B-Instruct \
--preds_file responses/Qwen2.5-7B-Instruct.jsonl \
--max_length 2048 \
--vllm_port 8000
If you use this dataset or code in your research, please cite:
@misc{alwajih2025palmculturallyinclusivelinguistically,
title={Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs},
author={Fakhraddin Alwajih and Abdellah El Mekki and Samar Mohamed Magdy and Abdelrahim A. Elmadany and Omer Nacar and El Moatez Billah Nagoudi and Reem Abdel-Salam and Hanin Atwany and Youssef Nafea and others},
year={2025},
eprint={2503.00151},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.00151},
}
This project is licensed under the CC-BY-NC-ND-4.0 License.
For questions or feedback, please open an issue on this repository.