TL;DR: We introduce M2A and GeoFact-X to evaluate and improve multilingual reasoning in LLMs by aligning internal reasoning with the input language using language-consistency rewards.
eval/
: Evaluation tools for mathematical reasoningdataset/
: GeoFact-X datasetfactual_evaluation/
: Factual reasoning evaluation scripts (GeoFact-X)data/
: Synthetic data generation scriptsscripts/
: Shell scripts for training and evaluationtrain/
: Python training scriptsutils/
: Utility functions and helpers
Use the scripts in scripts/
to launch training.
Hardware recommendations:
- Factual reasoning: ≥ 4 A100 GPUs
- Mathematical reasoning: ≥ 8 A100 GPUs
export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500 # Change if needed
export NNODES=$SLURM_NNODES
GPUS_PER_NODE=$(nvidia-smi -L | wc -l)
export WORLD_SIZE=$((GPUS_PER_NODE * NNODES))
export NODE_RANK=$SLURM_NODEID
echo "Master Node: $MASTER_ADDR"
echo "Running on $WORLD_SIZE GPUs across $NNODES nodes"
uid="$(date +%Y%m%d_%H%M%S)"
model_size=7
base_model="Qwen/Qwen2.5-${model_size}B-Instruct"
lr=1e-5
epochs=5
weight_decay=1e-4
micro_batch_size=1
gradient_accumulation_steps=1
push_to_hub=false
srun accelerate launch \
--config_file deepspeed_zero3.yaml \
--num_processes $WORLD_SIZE \
--num_machines $NNODES \
--main_process_ip $MASTER_ADDR \
--machine_rank $NODE_RANK \
--main_process_port $MASTER_PORT \
--rdzv_backend c10d \
train/math_m2a.py \
--block_size=20000 \
--train_file_path="simplescaling/s1K-1.1_tokenized" --per_device_train_batch_size=${micro_batch_size} --per_device_eval_batch_size=${micro_batch_size} --gradient_accumulation_steps=${gradient_accumulation_steps} --num_train_epochs=${epochs} \
--model_name=${base_model} \
--bf16=True --eval_strategy="no" --logging_steps=1 --save_strategy="no" --lr_scheduler_type="cosine" --learning_rate=${lr} --weight_decay=${weight_decay} --adam_beta1=0.9 --adam_beta2=0.95 \
--output_dir="ckpts/M2A-${model_size}b-ep${epochs}-${uid}" --push_to_hub=${push_to_hub} --save_only_model=True --use-liger-kernel --gradient_checkpointing=True --use_grpo --grpo_loss_coeff=0.01 --sentence_level='mean+randomNegCos' --metric='randomNegCos' --mt5_max_len=15000
Update model paths in the scripts as needed:
sh scripts/eval_geofact-x.sh
We cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524
and modified it. Setup:
cd eval/lm-evaluation-harness
pip install -e .[math,vllm]
If you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use
python eval/compute_sample_stats.py path_to_samples_file.jsonl
All our evaluation result files are at: https://hf.co/datasets/simplescaling/results
To run REBASE: commands are in eval/rebase/run.sh
tasks='mgsm_native_cot_bn,mgsm_native_cot_de,mgsm_native_cot_es,mgsm_native_cot_fr,mgsm_native_cot_ru,mgsm_native_cot_sw,mgsm_native_cot_te,mgsm_native_cot_th,mgsm_native_cot_zh,mgsm_native_cot_en,mgsm_native_cot_ja'
lm_eval --model vllm --model_args pretrained=${model_name},dtype=auto,tensor_parallel_size=${num_gpu},gpu_memory_utilization=0.90,max_model_len=20000 --tasks $tasks --batch_size auto --apply_chat_template --output_path ${output_dir} --log_samples --gen_kwargs "max_gen_toks=20000"
python3 utils/language_detector.py
This codebase is based on https://github.com/simplescaling/s1.
If you use this code for your research, please cite our paper.
@article{hwang2025learn,
title={Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning},
author={Hwang, Jaedong and Tanmay, Kumar and Lee, Seok-Jin and Agrawal, Ayush and Palangi, Hamid and Ayush, Kumar and Fiete, Ila R and Liang, Paul Pu},
journal={arXiv preprint arXiv:2507.05418},
year={2025}
}