This repository contains the dataset, models, and evaluation code from our paper:
Which Questions Should I Ask? Utility Estimation of Questions with LLM-based Simulations
Authors: Dong-Ho Lee, Hyundong Cho, Jonathan May, Jay Pujara
Asking effective questions is essential for learning and comprehension. However, evaluating and generating useful questions is challenging due to the huge space of possible questions and the lack of direct measures of their impact.
Prior work relies on indirect proxies of question quality, which do not directly assess how much a question helps a learner.
We introduce QUEST (Question Utility Estimation with Simulated Tests), a simulation framework that quantifies the utility of a question, its contribution to learning outcomes, by modeling how it affects a simulated learner's understanding.
QUEST identifies high-utility questions and uses them to fine-tune question generation models via rejection sampling.
Across five textbook domains, we find that QUEST-trained models produce questions that lead to 20%+ higher exam scores compared to:
- prompt-based models grounded in instructional design literature, and
- models fine-tuned on indirect quality signals
-
Inference β Generates questions and evaluates their overall utility
-
Training β Supervised fine-tuning (SFT) and QUEST training
-
Evaluation β Assesses the quality of each individual question, generated from the inference step.
Set your OpenAI API key:
export OPENAI_API_KEY=sk-...
Field | Description |
---|---|
subject |
Subject name (e.g., "chemistry" ) |
chapter |
Chapter ID (e.g., "m50984" ) |
llm_parsed_results.sections |
Dictionary of section number β paragraph content |
llm_parsed_results.questions |
Dict of question info with fields below: |
ββ question |
Question text |
ββ answer |
(Optional) Reference answer |
ββ relevant_sections |
List of related section numbers (LLM-inferred) |
To generate questions and evaluate their impact on learning outcomes, run:
python run_inference.py \
--subject chemistry \
--qg_model_name gpt-4o-mini \
--evaluate_model_name gpt-4o-mini \
--mode fewshot
You can specify different prompting strategies and models:
Mode | Model | Description |
---|---|---|
default |
gpt-4o-mini |
Zero-shot baseline |
cot |
gpt-4o-mini |
Chain-of-thought prompting |
fewshot |
gpt-4o-mini |
Few-shot prompting with examples |
default |
sft-trained-model |
SFT (Supervised Fine-Tuning) |
default |
quest-trained-model |
QUEST (Utility-trained model) |
--subject
: required; one ofchemistry
,biology
,physics
,economics
, etc.--qg_model_name
: model used to generate questions--evaluate_model_name
: model used to simulate learning--mode
: prompting strategy (default
,cot
,fewshot
)--num_questions_per_section
: number of questions to generate per section (default: 1)--use_document_for_simulate
: if set, uses full document context during simulation
For subject chemistry
, output files are saved as:
output/{model_name}/chemistry_{mode}_performance_results.jsonl
: simulated utility scores by chapteroutput/{model_name}/chemistry_{mode}_qa_pairs.jsonl
: generated questionβanswer pairs
After training (see Training), use your fine-tuned model like this:
# Load fine-tuned model name
import json
metadata = json.load(open("metadata/train_metadata_all_sft.json"))
model_name = metadata["fine_tuned_model"]
print(model_name) # e.g., ft:gpt-4o-mini:your-team:custom-id
Then run:
python run_inference.py \
--subject chemistry \
--qg_model_name <fine_tuned_model_name> \
--evaluate_model_name gpt-4o-mini \
--mode default
We provide two training modes:
We fine-tune a base LLM (e.g., gpt-4o-mini
) using human-authored exam questions from textbook sections.
The prompt asks the model to generate a question that helps a student understand a specific sentence in a passage.
To prepare and run SFT training:
python run_train_sft.py --subject all --model_name gpt-4o-mini-2024-07-18
- You can replace
all
with the specific subject like--subject chemistry
.all
is forcross-subject
setting.
- The script will create training data from all but the last 5 chapters of each subject.
- Metadata and fine-tuned model information is stored under
metadata/train_metadata_*_sft.json
.
Training prompt format:
article: <full previous context>
Student is currently reading the sentence: <anchor sentence>.
Generate a question that helps the student understand the sentence better.
Output in following JSON format:
{
"question": question
}
Each fine-tuning example is a chat message pair:
user
: the prompt aboveassistant
: the expected{"question": ...}
response
Training data is saved as:
metadata/train_data_{subjects}_sft.jsonl
QUEST identifies high-utility questions using a simulated learner and fine-tunes the model only on these high-utility examples using rejection sampling.
Unlike SFT, which uses all questions, QUEST selectively fine-tunes using only questions that improve simulated comprehension scores.
python run_train_quest.py --subject all --model_name gpt-4o-mini-2024-07-18 --iterations 1 --threshold 0.1
- You can replace
all
with the specific subject (e.g.,--subject chemistry
) --iterations
: how many fine-tuning + filtering loops to run (default: 1)--threshold
: minimum utility score (between 0 and 1) for selecting questions
- Generate questions per section from textbook chapters using the current model.
- Evaluate the utility of each question using a simulated learner.
- Filter out low-utility questions (utility β€ threshold).
- Fine-tune the model on the remaining high-utility examples.
- Repeat (if multiple iterations).
-
Training data per iteration:
metadata/train_data_{subject}_iter_{i}_thresh_{threshold}.jsonl
-
Trained model metadata:
metadata/train_metadata_{subject}_iter_{i}_thresh_{threshold}.json
Each metadata file contains:
{
"iteration": 1,
"subjects": ["chemistry"],
"threshold": 0.1,
"base_model_used": "gpt-4o-mini-2024-07-18",
"fine_tuned_model": "ft:gpt-4o-mini:your-team:custom-id"
}
After generating questions with run_inference.py
, you can evaluate the individual quality of each question using run_eval.py
.
This helps analyze why certain models lead to better comprehension outcomes by inspecting each question's:
- Utility: Contribution to a simulated learnerβs performance
- Saliency: Relevance and centrality to the passage
- Expected Information Gain (EIG): Reduction in uncertainty after reading the answer
Argument | Description |
---|---|
--qa_file |
Path to the generated QA pairs file (from run_inference.py ) |
--model_name |
Model used for utility simulation and scoring (default: gpt-4o-mini ) |
--include_saliency |
If set, computes saliency scores via LLM |
--include_eig |
If set, computes Expected Information Gain (EIG) via LLM logprobs |
--output_dir |
Directory to save the evaluation results (default: q_metrics/ ) |
Each input file should be a .jsonl
file with entries like:
{
"subject": "chemistry",
"chapter": "m50984",
"section": 3,
"question": "Why is the atomic radius smaller across a period?",
"answer": "Because the increased nuclear charge pulls electrons closer to the nucleus."
}
These are typically generated from:
output/{model_name}/{subject}_{mode}_qa_pairs.jsonl
Evaluated metrics are saved in a single consolidated file:
q_metrics/{subject}_{mode}_question_metrics.jsonl
Each entry includes:
{
"subject": "chemistry",
"chapter": "m50984",
"section": 3,
"question": "Why is the atomic radius smaller across a period?",
"answer": "Because the increased nuclear charge pulls electrons closer to the nucleus.",
"utility": 0.41,
"saliency": 4,
"eig": 0.72
}
If --include_saliency
or --include_eig
is not used, the corresponding fields will be omitted.
Evaluate utility only:
python run_eval.py \
--qa_file output/gpt-4o-mini/chemistry_fewshot_qa_pairs.jsonl
Evaluate utility + saliency:
python run_eval.py \
--qa_file output/gpt-4o-mini/chemistry_fewshot_qa_pairs.jsonl \
--include_saliency
Evaluate utility + saliency + EIG:
python run_eval.py \
--qa_file output/gpt-4o-mini/chemistry_fewshot_qa_pairs.jsonl \
--include_saliency \
--include_eig
If you find this work useful in your research, please cite:
@article{lee2025good,
title={What is a good question? utility estimation with llm-based simulations},
author={Lee, Dong-Ho and Cho, Hyundong and May, Jonathan and Pujara, Jay},
journal={arXiv preprint arXiv:2502.17383},
year={2025}
}