A fast and lightweight Python package for evaluating question-answering models and prompting of black-box and open-source large language models.
pip install qa-metrics
is all you need!
-
Version 0.2.42 Released! (06/20/2025)
- RewardBert (ModerBert base) supports batch scores prediction to speed up prediction for RL training.
-
Version 0.2.35 Released! (06/18/2025)
- RewardBert (ModerBert base) trained to evaluate both short-form and long-form generations.
- RewardBert outputs a likert scale between 1-5 or normalized score between 0-1.
- Turn off nltk download verbose logs.
-
Version 0.2.30 Released!
- Enhanced PEDANTS with multi-pipeline support and improved edge case handling
- Introduced trained tiny-bert for QA evaluation (18MB model size)
- Added direct Huggingface model download support for TransformerMatcher
- Python >= 3.6
- openai >= 1.0
pip install qa-metrics
Our package offers six QA evaluation methods with varying strengths:
Method | Best For | Cost | Correlation with Human Judgment |
---|---|---|---|
RewardBert | General Text Generations | Free | Very High |
Normalized Exact Match | Short-form QA (NQ-OPEN, HotpotQA, etc.) | Free | Good |
PEDANTS | Both short & medium-form QA | Free | Very High |
Neural Evaluation | Both short & long-form QA | Free | High |
Open Source LLM Evaluation | All QA types | Free | High |
Black-box LLM Evaluation | All QA types | Paid | Highest |
Parameters
reference_answer
(str): gold (correct) answer to the questioncandidate_answer
(str): The answer provided by a candidate that needs to be evaluated
Returns
tuple
: A tuple of normalized and raw scores.
from qa_metrics.RewardBert import RewardBert
rb = RewardBert(device='cuda')
reference_answer = "The Frog Prince"
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
rb.compute_score(reference_answer, candidate_answer)
# (0.29113227128982544, 2.1645290851593018)
Parameters
reference_answers
(list of str): A list of gold (correct) answers to the questioncandidate_answer
(list of str): A list of answers provided by a candidate that needs to be evaluatedbatch_size
(int): batch size to predict (default 1)
Returns
tuple
: A tuple of a list of normalized and raw scores.
from qa_metrics.RewardBert import RewardBert
rb = RewardBert(device='cuda')
reference_answer = ["The Frog Prince"]
candidate_answer = ["The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""]
rb.compute_batch_scores(reference_answer, candidate_answer, batch_size=1)
# ([0.29113227128982544], [2.1645290851593018])
Parameters
reference_answer
(list of str): A list of gold (correct) answers to the questioncandidate_answer
(str): The answer provided by a candidate that needs to be evaluated
Returns
boolean
: True if there are any exact normalized matches between gold and candidate answers
from qa_metrics.em import em_match
reference_answer = ["The Frog Prince", "The Princess and the Frog"]
candidate_answer = "The movie \"The Princess and the Frog\" is loosely based off the Brother Grimm's \"Iron Henry\""
match_result = em_match(reference_answer, candidate_answer)
Parameters
reference_answer
(str): A gold (correct) answer to the questioncandidate_answer
(str): The answer provided by a candidate that needs to be evaluated
Returns
dictionary
: Contains the F1 score, precision, and recall between a gold and candidate answer
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatethreshold
(float): F1 score threshold for considering a match (default: 0.5)
Returns
boolean
: True if F1 score exceeds threshold for any gold answer
from qa_metrics.f1 import f1_match, f1_score_with_precision_recall
f1_stats = f1_score_with_precision_recall(reference_answer[0], candidate_answer)
match_result = f1_match(reference_answer, candidate_answer, threshold=0.5)
Parameters
reference_answer
(str): A Gold answercandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
float
: The similarity score between two strings (0 to 1)
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains the gold answer and candidate answer pair with highest matching score
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains matching scores for all gold answer and candidate answer pairs
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
boolean
: True if candidate answer matches any gold answer
Parameters
reference_answer
(list of str): List of gold answersquestion
(str): The question being evaluated
Returns
list
: The type of the question (what, who, when, how, why, which, where)
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
list
: A list revised rules applicable to judge answer correctness
from qa_metrics.pedant import PEDANT
pedant = PEDANT()
scores = pedant.get_scores(reference_answer, candidate_answer, question)
match_result = pedant.evaluate(reference_answer, candidate_answer, question)
Parameters
reference_answer
(str): A Gold answercandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
float
: The similarity score between two strings (0 to 1)
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains the gold answer and candidate answer pair with highest matching score
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
dictionary
: Contains matching scores for all gold answer and candidate answer pairs
Parameters
reference_answer
(list of str): List of gold answerscandidate_answer
(str): Candidate answer to evaluatequestion
(str): The question being evaluated
Returns
boolean
: True if transformer model considers candidate answer equivalent to any gold answer
from qa_metrics.transformerMatcher import TransformerMatcher
### supports zli12321/roberta-large-qa-evaluator, `zli12321/answer_equivalence_bert`, `zli12321/answer_equivalence_distilbert`, `zli12321/answer_equivalence_roberta`, `zli12321/answer_equivalence_distilroberta`
tm = TransformerMatcher("zli12321/answer_equivalence_tiny_bert")
match_result = tm.transformer_match(reference_answer, candidate_answer, question)
Parameters
prompt
(str): The input prompt textmodel_engine
(str): OpenAI model to use (e.g., 'gpt-3.5-turbo')temperature
(float): Controls randomness (0-1)max_tokens
(int): Maximum tokens in response
from qa_metrics.prompt_llm import CloseLLM
model = CloseLLM()
model.set_openai_api_key(YOUR_OPENAI_KEY)
result = model.prompt_gpt(prompt=prompt, model_engine='gpt-3.5-turbo')
Parameters
prompt
(str): The input prompt textmodel_engine
(str): Claude model to useanthropic_version
(str): API versionmax_tokens_to_sample
(int): Maximum tokens in responsetemperature
(float): Controls randomness (0-1)
model = CloseLLM()
model.set_anthropic_api_key(YOUR_ANTHROPIC_KEY)
result = model.prompt_claude(prompt=prompt, model_engine='claude-v1')
Parameters
message
(str): The input message textmodel_engine
(str): Model to usetemperature
(float): Controls randomness (0-1)max_tokens
(int): Maximum tokens in response
from qa_metrics.prompt_open_llm import OpenLLM
model = OpenLLM()
model.set_deepinfra_key(YOUR_DEEPINFRA_KEY)
result = model.prompt(message=prompt, model_engine='mistralai/Mixtral-8x7B-Instruct-v0.1')
Our fine-tuned models are available on Huggingface:
@misc{li2024pedantscheapeffectiveinterpretable,
title={PEDANTS: Cheap but Effective and Interpretable Answer Equivalence},
author={Zongxia Li and Ishani Mondal and Yijun Liang and Huy Nghiem and Jordan Lee Boyd-Graber},
year={2024},
eprint={2402.11161},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2402.11161},
}
This project is licensed under the MIT License.
For questions or comments, please contact: [email protected]