Palm: Evaluating LLMs on a Culturally Inclusive and Linguistically Diverse Arabic Dataset

This repository contains evaluation tools for the Palm dataset — a Culturally Inclusive and Linguistically Diverse dataset for Arabic LLMs.

🌟 Overview

The Palm dataset provides a comprehensive benchmark for evaluating the performance of Large Language Models on Arabic language tasks across diverse dialects and cultural contexts. This repository focuses on the evaluation pipeline using the test set portion of the Palm dataset.

🔍 Dataset Access

You can access the dataset on Hugging Face:

from datasets import load_dataset
dataset = load_dataset("UBC-NLP/palm")

🛠️ Requirements

Python 3.8+
PyTorch
vLLM
4x A100 GPUs (as used in our experiments)

Install required packages:

pip install -r requirements.txt

📋 Evaluation Pipeline

Our evaluation pipeline consists of two main steps:

Generate responses using your LLM
Judge the quality of responses using LLM-as-Judge method

Step 0: Serve the LLM

Before running either of the main scripts, you need to serve your LLM using vLLM:

./serve_llm.sh "llm_path" num_gpus port

Where:

llm_path: Name or path of the LLM to serve (local path or HuggingFace model ID)
num_gpus: Number of GPUs to use
port: Port where vLLM will be served

Example:

./serve_llm.sh "models/Qwen2.5-7B-Instruct" 4 8000

Step 1: Generate Responses

Use the gen_responses.py script to generate responses for the instructions in the test set:

python gen_responses.py \
  --model_path models/Qwen2.5-7B-Instruct \
  --vllm_model_id models/Qwen2.5-7B-Instruct \
  --data_path data/test.jsonl \
  --max_length 4096 \
  --vllm_port 8000

Parameters:

model_path: Local path or HuggingFace model ID
vllm_model_id: ID provided by vLLM when the model is served
data_path: Path to the test data file
max_length: Maximum sequence length for generation
vllm_port: Server port of vLLM

The generated responses will be saved to responses/{model_id}.jsonl.

Step 2: Judge Responses

Next, serve your Judge LLM using the same serve_llm.sh script, then evaluate the generated responses:

python judge.py \
  --output_judgements Qwen2.5-7B-Instruct \
  --vllm_model_id models/Qwen2.5-72B-Instruct \
  --preds_file responses/Qwen2.5-7B-Instruct.jsonl \
  --max_length 2048 \
  --vllm_port 8000

Parameters:

output_judgements: Filename where judgments will be logged (without extension)
vllm_model_id: ID assigned by vLLM for the judge model
preds_file: File containing the generated responses from Step 1
max_length: Maximum sequence length for judgment generation
vllm_port: Server port of vLLM for the judge model

The judgments will be saved to judgements/{output_judgements}.jsonl.

📊 Example Workflow

# 1. Serve the response-generating LLM
./serve_llm.sh "models/Qwen2.5-7B-Instruct" 4 8000

# 2. Generate responses
python gen_responses.py \
  --model_path models/Qwen2.5-7B-Instruct \
  --vllm_model_id models/Qwen2.5-7B-Instruct \
  --data_path data/test.jsonl \
  --max_length 4096 \
  --vllm_port 8000

# 3. Stop the first LLM server and serve the judge LLM
./serve_llm.sh "models/Qwen2.5-72B-Instruct" 4 8000

# 4. Judge the responses
python judge.py \
  --output_judgements Qwen2.5-7B-Instruct \
  --vllm_model_id models/Qwen2.5-72B-Instruct \
  --preds_file responses/Qwen2.5-7B-Instruct.jsonl \
  --max_length 2048 \
  --vllm_port 8000

📝 Citation

If you use this dataset or code in your research, please cite:

@misc{alwajih2025palmculturallyinclusivelinguistically,
      title={Palm: A Culturally Inclusive and Linguistically Diverse Dataset for Arabic LLMs}, 
      author={Fakhraddin Alwajih and Abdellah El Mekki and Samar Mohamed Magdy and Abdelrahim A. Elmadany and Omer Nacar and El Moatez Billah Nagoudi and Reem Abdel-Salam and Hanin Atwany and Youssef Nafea and others},
      year={2025},
      eprint={2503.00151},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00151}, 
}

📄 License

This project is licensed under the CC-BY-NC-ND-4.0 License.

🤝 Contact

For questions or feedback, please open an issue on this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
README.md		README.md
examples.md		examples.md
gen_responses.py		gen_responses.py
guidelines.md		guidelines.md
judge.py		judge.py
palm_main.png		palm_main.png
requirements.txt		requirements.txt
serve_llm.sh		serve_llm.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Palm: Evaluating LLMs on a Culturally Inclusive and Linguistically Diverse Arabic Dataset

🌟 Overview

🔍 Dataset Access

🛠️ Requirements

📋 Evaluation Pipeline

Step 0: Serve the LLM

Step 1: Generate Responses

Step 2: Judge Responses

📊 Example Workflow

📝 Citation

📄 License

🤝 Contact

About

Releases

Packages

Languages

UBC-NLP/palm

Folders and files

Latest commit

History

Repository files navigation

Palm: Evaluating LLMs on a Culturally Inclusive and Linguistically Diverse Arabic Dataset

🌟 Overview

🔍 Dataset Access

🛠️ Requirements

📋 Evaluation Pipeline

Step 0: Serve the LLM

Step 1: Generate Responses

Step 2: Judge Responses

📊 Example Workflow

📝 Citation

📄 License

🤝 Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages