TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Introduction

Does every token in the CoT output contribute equally to deriving the answer? —— We say NO!

We introduce TokenSkip, a simple yet effective approach that enables LLMs to selectively skip redundant tokens during Chain-of-Thought generation and learn shortcuts between critical reasoning tokens, thereby allowing for controllable CoT compression with adjustable ratios.

TokenSkip constructs compressed CoT training data with various compression ratios, by pruning unimportant tokens from original CoT trajectories. Then, it conducts a general supervised fine-tuning process on target LLMs with this training data, enabling LLMs to automatically trim redundant tokens during reasoning.

This method is distinguished by its low training cost. For Qwen2.5-14B-Instruct, TokenSkip fine-tunes only 0.2% of the model's parameters using LoRA. The size of the compressed CoT training data is no larger than that of the original training set, with 7,473 examples in GSM8K and 7,500 in MATH. The training is completed in approximately 2.5 hours for the 14B model on two 3090 GPUs. These characteristics make TokenSkip an efficient and reproducible approach, suitable for use in efficient and cost-effective LLM deployment.

We observe that as the model scale increases, there is less performance degradation at higher compression ratios, indicating that larger LLMs are better at identifying shortcuts between critical reasoning tokens, enabling more efficient CoT generation. Notably, Qwen2.5-14B-Instruct exhibits almost NO performance drop (less than 0.4%) with 40% token trimming. Even at a compression ratio of 0.5, the model maintains strong reasoning capabilities, with only 2% performance degradation.

Update

2025.2.22: We have released the code for CoT compression and the instructions for SFT🔥!

2025.2.17: We have released the evaluation scripts and checkpoints for TokenSkip. Check it out!

Todo

Release checkpoints for Qwen2.5-Instruct series
Release evaluation code on GSM8K and MATH-500
Release code for compressed CoT data construction
Add instructions for SFT (LoRA) on LLaMA-Factory
Investigations on TokenSkip with larger model scales

Model Weights

Download corresponding model weights and modify the checkpoint path in eval.sh.

LoRA Adapter	Link
TokenSkip-Qwen2.5-3B-Instruct-GSM8K	huggingface
TokenSkip-Qwen2.5-7B-Instruct-GSM8K	huggingface
TokenSkip-Qwen2.5-14B-Instruct-GSM8K	huggingface

Installation

conda create -n tokenskip python=3.12
conda activate tokenskip
cd TokenSkip
pip install -r requirements.txt

Token Pruning

1.Obtain the original CoT outputs of the training data, using the target LLM

Modify the command lines in eval.sh (e.g., set DATA_TYPE to train) and run evaluation.

python ./evaluation.py --output-dir "outputs/Qwen2.5-7B-Instruct/gsm8k/" \
    --model-path "/your_model_path/Qwen2.5-7B-Instruct" --tokenizer-path ${MODEL_PATH} \
    --model-size "7b" --model-type "qwen" --data-type "train"  \
    --max_num_examples 100000000000000 --max_new_tokens 512 \
    --eval_batch_size 32 --temperature 0.0 --seed 42 --benchmark "gsm8k"

The original CoT outputs of the target LLM will be stored in outputs/.../Original.

2.Prune original CoTs using LLMLingua

Download the model weights for LLMLingua-2 and modify the checkpoint path in LLMLingua.py.

Run LLMLingua to obtain compressed CoTs with various compression ratios.

python ./LLMLingua.py

The compressed CoTs will be stored in outputs/.../Compression.

3.Convert training data to LLaMA-Factory format

Run get_llamafactory_input to convert the training data into the format of LLaMA-Factory.

python ./get_llamafactory_input.py

The converted data will be stored in outputs/mydataset.json.

For reference, we provide our processed training data in datasets/gsm8k/llamafactory_inputs/.

Training

TokenSkip follows the general LoRA SFT pipeline of LLaMA-Factory. Here's how to set it up:

Git clone LLaMA-Factory and install the required environments.
Place the training data under LLaMA-Factory/data/ and register it in data/dataset_info.json.
To fine-tune the target LLM with LoRA, run the following command:

CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/train_lora/myllama3_lora_sft_compressed_gsm8k_llmlingua2_qwen.yaml

We provide our training configs in configs/examples/train_lora for your reference.

Inference

Modify and run command lines in eval.sh, the results will be stored in outputs/.

python ./evaluation.py --output-dir "outputs/Qwen2.5-7B-Instruct/gsm8k/" \
    --model-path "/your_model_path/Qwen2.5-7B-Instruct" --tokenizer-path ${MODEL_PATH} \
    --model-size "7b" --model-type "qwen" --data-type "test"  \
    --max_num_examples 100000000000000 --max_new_tokens 512 \
    --eval_batch_size 32 --temperature 0.0 --seed 42 --benchmark "gsm8k" \
    --adapter-path "/your_model_path/TokenSkip-Qwen2.5-7B-Instruct-GSM8K" \
    --compression_ratio 0.5 --use_adapter

Q&A

Frequently asked questions about the re-implementation of TokenSkip can be found in Q&A.

Contributing

We warmly welcome contributions and discussions related to TokenSkip! If you have any suggestions for improvements or ideas you'd like to discuss, please don't hesitate to open an issue. This will allow us to collaborate and discuss your ideas in detail.

Acknowledgments

This codebase is built from DeepSeek-Math and LLMLingua.

Citation

If you find the resources in this repository useful, please cite our paper:

@misc{xia2025tokenskip,
      title={TokenSkip: Controllable Chain-of-Thought Compression in LLMs}, 
      author={Heming Xia and Yongqi Li and Chak Tou Leong and Wenjie Wang and Wenjie Li},
      year={2025},
      eprint={2502.12067},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.12067}, 
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Introduction

Update

Todo

Model Weights

Installation

Token Pruning

Training

Inference

Q&A

Contributing

Acknowledgments

Citation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
assets		assets
configs		configs
data_processing		data_processing
datasets		datasets
eval		eval
LICENSE		LICENSE
LLMLingua.py		LLMLingua.py
Readme.md		Readme.md
eval.sh		eval.sh
evaluation.py		evaluation.py
get_llamafactory_input.py		get_llamafactory_input.py
requirements.txt		requirements.txt

License

hemingkx/TokenSkip

Folders and files

Latest commit

History

Repository files navigation

TokenSkip: Controllable Chain-of-Thought Compression in LLMs

Introduction

Update

Todo

Model Weights

Installation

Token Pruning

Training

Inference

Q&A

Contributing

Acknowledgments

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages