Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

TL;DR: We introduce M2A and GeoFact-X to evaluate and improve multilingual reasoning in LLMs by aligning internal reasoning with the input language using language-consistency rewards.

Repository Structure

eval/: Evaluation tools for mathematical reasoning
dataset/: GeoFact-X dataset
factual_evaluation/: Factual reasoning evaluation scripts (GeoFact-X)
data/: Synthetic data generation scripts
scripts/: Shell scripts for training and evaluation
train/: Python training scripts
utils/: Utility functions and helpers

Training

Use the scripts in scripts/ to launch training.

Hardware recommendations:

Factual reasoning: ≥ 4 A100 GPUs
Mathematical reasoning: ≥ 8 A100 GPUs

Example: Multi-node training with Slurm

export MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_PORT=29500  # Change if needed
export NNODES=$SLURM_NNODES
GPUS_PER_NODE=$(nvidia-smi -L | wc -l)
export WORLD_SIZE=$((GPUS_PER_NODE * NNODES))
export NODE_RANK=$SLURM_NODEID

echo "Master Node: $MASTER_ADDR"
echo "Running on $WORLD_SIZE GPUs across $NNODES nodes"

uid="$(date +%Y%m%d_%H%M%S)"

model_size=7
base_model="Qwen/Qwen2.5-${model_size}B-Instruct"
lr=1e-5
epochs=5
weight_decay=1e-4
micro_batch_size=1
gradient_accumulation_steps=1
push_to_hub=false

srun accelerate launch \
  --config_file deepspeed_zero3.yaml \
  --num_processes $WORLD_SIZE \
  --num_machines $NNODES \
  --main_process_ip $MASTER_ADDR \
  --machine_rank $NODE_RANK \
  --main_process_port $MASTER_PORT \
  --rdzv_backend c10d \
  train/math_m2a.py \
    --block_size=20000 \
    --train_file_path="simplescaling/s1K-1.1_tokenized" --per_device_train_batch_size=${micro_batch_size} --per_device_eval_batch_size=${micro_batch_size} --gradient_accumulation_steps=${gradient_accumulation_steps} --num_train_epochs=${epochs} \
    --model_name=${base_model} \
    --bf16=True --eval_strategy="no" --logging_steps=1 --save_strategy="no" --lr_scheduler_type="cosine" --learning_rate=${lr} --weight_decay=${weight_decay} --adam_beta1=0.9 --adam_beta2=0.95 \
    --output_dir="ckpts/M2A-${model_size}b-ep${epochs}-${uid}" --push_to_hub=${push_to_hub} --save_only_model=True --use-liger-kernel  --gradient_checkpointing=True --use_grpo --grpo_loss_coeff=0.01 --sentence_level='mean+randomNegCos' --metric='randomNegCos' --mt5_max_len=15000

Evaluation

Factual Reasoning

Update model paths in the scripts as needed:

sh scripts/eval_geofact-x.sh

Mathematical Reasoning

We cloned lm-evaluation-harness at commit 4cec66e4e468d15789473d6d63c3a61a751fa524 and modified it. Setup:

cd eval/lm-evaluation-harness
pip install -e .[math,vllm]

If you want to compute statistics (avg thinking tokens etc) for an evaluation run you can use python eval/compute_sample_stats.py path_to_samples_file.jsonl

All our evaluation result files are at: https://hf.co/datasets/simplescaling/results

To run REBASE: commands are in eval/rebase/run.sh

MGSM for fast evaluation

tasks='mgsm_native_cot_bn,mgsm_native_cot_de,mgsm_native_cot_es,mgsm_native_cot_fr,mgsm_native_cot_ru,mgsm_native_cot_sw,mgsm_native_cot_te,mgsm_native_cot_th,mgsm_native_cot_zh,mgsm_native_cot_en,mgsm_native_cot_ja'
lm_eval --model vllm --model_args pretrained=${model_name},dtype=auto,tensor_parallel_size=${num_gpu},gpu_memory_utilization=0.90,max_model_len=20000 --tasks $tasks --batch_size auto --apply_chat_template --output_path ${output_dir} --log_samples --gen_kwargs "max_gen_toks=20000"

Measure Language Performance

python3 utils/language_detector.py

Acknowledgement

This codebase is based on https://github.com/simplescaling/s1.

Citation

If you use this code for your research, please cite our paper.

@article{hwang2025learn,
      title={Learn Globally, Speak Locally: Bridging the Gaps in Multilingual Reasoning},
      author={Hwang, Jaedong and Tanmay, Kumar and Lee, Seok-Jin and Agrawal, Ayush and Palangi, Hamid and Ayush, Kumar and Fiete, Ila R and Liang, Paul Pu},
      journal={arXiv preprint arXiv:2507.05418},
      year={2025}
    }

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
data		data
dataset		dataset
eval		eval
factual_evaluation/GeoFact-X		factual_evaluation/GeoFact-X
scripts		scripts
train		train
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
deepspeed_zero3.yaml		deepspeed_zero3.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

Repository Structure

Training

Example: Multi-node training with Slurm

Evaluation

Factual Reasoning

Mathematical Reasoning

MGSM for fast evaluation

Measure Language Performance

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Languages

License

jd730/M2A

Folders and files

Latest commit

History

Repository files navigation

Learn Globally, Speak Locally:Bridging the Gaps in Multilingual Reasoning

Repository Structure

Training

Example: Multi-node training with Slurm

Evaluation

Factual Reasoning

Mathematical Reasoning

MGSM for fast evaluation

Measure Language Performance

Acknowledgement

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Learn Globally, Speak Locally:
Bridging the Gaps in Multilingual Reasoning

Packages