QA-Evaluation-Metrics 📊

A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.

pip install qa-metrics is all you need!

🤗 Huggingface Model and Dataset

Release Updates

Version 0.2.42 Released! (06/20/2025)
- RewardBert (ModerBert base) supports batch scores prediction to speed up prediction for RL training.
Version 0.2.35 Released! (06/18/2025)
- RewardBert (ModerBert base) trained to evaluate both short-form and long-form generations.
- RewardBert outputs a likert scale between 1-5 or normalized score between 0-1.
- Turn off nltk download verbose logs.
Version 0.2.30 Released!
- Enhanced PEDANTS with multi-pipeline support and improved edge case handling
- Introduced trained tiny-bert for QA evaluation (18MB model size)
- Added direct Huggingface model download support for TransformerMatcher

🚀 Quick Start

💡 Features

Our package offers six QA evaluation methods with varying strengths:

Method	Best For	Cost	Correlation with Human Judgment
RewardBert	General Text Generations	Free	Very High
Normalized Exact Match	Short-form QA (NQ-OPEN, HotpotQA, etc.)	Free	Good
PEDANTS	Both short & medium-form QA	Free	Very High
Neural Evaluation	Both short & long-form QA	Free	High
Open Source LLM Evaluation	All QA types	Free	High
Black-box LLM Evaluation	All QA types	Paid	Highest

📖 Documentation

1. RewardBert

Method: `compute_score`

Parameters

reference_answer (str): gold (correct) answer to the question
candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

tuple: A tuple of normalized and raw scores.

from qa_metrics.RewardBert import RewardBert

rb = RewardBert(device='cuda')
reference_answer = "The Frog Prince"
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
rb.compute_score(reference_answer, candidate_answer)
# (0.29113227128982544, 2.1645290851593018)

Method: `compute_batch_scores`

Parameters

reference_answers (list of str): A list of gold (correct) answers to the question
candidate_answer (list of str): A list of answers provided by a candidate that needs to be evaluated
batch_size (int): batch size to predict (default 1)

Returns

tuple: A tuple of a list of normalized and raw scores.

from qa_metrics.RewardBert import RewardBert

rb = RewardBert(device='cuda')
reference_answer = ["The Frog Prince"]
candidate_answer = ["The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""]
rb.compute_batch_scores(reference_answer, candidate_answer, batch_size=1)
# ([0.29113227128982544], [2.1645290851593018])

2. Normalized Exact Match

Method: `em_match`

Parameters

reference_answer (list of str): A list of gold (correct) answers to the question
candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

boolean: True if there are any exact normalized matches between gold and candidate answers

from qa_metrics.em import em_match

reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)

3. F1 Score

Method: `f1_score_with_precision_recall`

Parameters

reference_answer (str): A gold (correct) answer to the question
candidate_answer (str): The answer provided by a candidate that needs to be evaluated

Returns

dictionary: Contains the F1 score, precision, and recall between a gold and candidate answer

Method: `f1_match`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
threshold (float): F1 score threshold for considering a match (default: 0.5)

Returns

boolean: True if F1 score exceeds threshold for any gold answer

from qa_metrics.f1 import f1_match, f1_score_with_precision_recall

f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)

4. PEDANTS

Method: `get_score`

Parameters

reference_answer (str): A Gold answer
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

float: The similarity score between two strings (0 to 1)

Method: `get_highest_score`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: `get_scores`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: `evaluate`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

boolean: True if candidate answer matches any gold answer

Method: `get_question_type`

Parameters

reference_answer (list of str): List of gold answers
question (str): The question being evaluated

Returns

list: The type of the question (what, who, when, how, why, which, where)

Method: `get_judgement_type`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

list: A list revised rules applicable to judge answer correctness

from qa_metrics.pedant import PEDANT

pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)

5. Transformer Neural Evaluation

Method: `get_score`

Parameters

reference_answer (str): A Gold answer
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

float: The similarity score between two strings (0 to 1)

Method: `get_highest_score`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains the gold answer and candidate answer pair with highest matching score

Method: `get_scores`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

dictionary: Contains matching scores for all gold answer and candidate answer pairs

Method: `transformer_match`

Parameters

reference_answer (list of str): List of gold answers
candidate_answer (str): Candidate answer to evaluate
question (str): The question being evaluated

Returns

boolean: True if transformer model considers candidate answer equivalent to any gold answer

from qa_metrics.transformerMatcher import TransformerMatcher

### supports zli12321/roberta-large-qa-evaluator, `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)

6. LLM Integration

Method: `prompt_gpt`

Parameters

prompt (str): The input prompt text
model_engine (str): OpenAI model to use (e.g., 'gpt-3.5-turbo')
temperature (float): Controls randomness (0-1)
max_tokens (int): Maximum tokens in response

from qa_metrics.prompt_llm import CloseLLM

model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')

Method: `prompt_claude`

Parameters

prompt (str): The input prompt text
model_engine (str): Claude model to use
anthropic_version (str): API version
max_tokens_to_sample (int): Maximum tokens in response
temperature (float): Controls randomness (0-1)

model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')

Method: `prompt`

Parameters

message (str): The input message text
model_engine (str): Model to use
temperature (float): Controls randomness (0-1)
max_tokens (int): Maximum tokens in response

from qa_metrics.prompt_open_llm import OpenLLM

model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')

🤗 Model Hub

Our fine-tuned models are available on Huggingface:

📚 Resources

📄 Citation

@misc{li2024pedantscheapeffectiveinterpretable,
      title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence}, 
      author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
      year={2024},
      eprint={2402.11161},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2402.11161}, 
}

📝 License

This project is licensed under the MIT License.

📬 Contact

For questions or comments, please contact: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
__pycache__		__pycache__
build/lib/qa_metrics		build/lib/qa_metrics
qa_metrics		qa_metrics
.DS_Store		.DS_Store
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
__init__.py		__init__.py
setup.py		setup.py

License

zli12321/qa_metrics

Folders and files

Latest commit

History

Repository files navigation

QA-Evaluation-Metrics 📊

Release Updates

🚀 Quick Start

Table of Contents

Prerequisites

Installation

💡 Features

📖 Documentation

1. RewardBert

Method: compute_score

Method: compute_batch_scores

2. Normalized Exact Match

Method: em_match

3. F1 Score

Method: f1_score_with_precision_recall

Method: f1_match

4. PEDANTS

Method: get_score

Method: get_highest_score

Method: get_scores

Method: evaluate

Method: get_question_type

Method: get_judgement_type

5. Transformer Neural Evaluation

Method: get_score

Method: get_highest_score

Method: get_scores

Method: transformer_match

6. LLM Integration

Method: prompt_gpt

Method: prompt_claude

Method: prompt

🤗 Model Hub

📚 Resources

📄 Citation

📝 License

📬 Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Method: `compute_score`

Method: `compute_batch_scores`

Method: `em_match`

Method: `f1_score_with_precision_recall`

Method: `f1_match`

Method: `get_score`

Method: `get_highest_score`

Method: `get_scores`

Method: `evaluate`

Method: `get_question_type`

Method: `get_judgement_type`

Method: `get_score`

Method: `get_highest_score`

Method: `get_scores`

Method: `transformer_match`

Method: `prompt_gpt`

Method: `prompt_claude`

Method: `prompt`

Packages