SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning

SAFE-QAQ is an end-to-end framework for audio-text fraud detection that leverages reinforcement learning to enable slow-thinking decision-making. Below are instructions for setting up the environment, training the model, and running experiments.

News

[2026.01] SAFE-QAQ has been accepted by ACL 2026.

Overview

This repository contains the source code for SAFE-QAQ, which consists of three main stages:

Rule-Based Reinforcement Learning (Stage 1): Train a rule-based RL model.
Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL) (Stage 2): Refine the model using rejection sampling and LCRL techniques.
Real-Time Fine-Tuning (Stage 3): Fine-tune the model for real-time inference.

The prompts for both real-time inference and training are defined in prompt.py.

Resources

SAFE-QAQ uses the TeleAntiFraud audio-text fraud detection dataset. Dataset downloads, benchmark resources, and evaluation utilities are available in the TeleAntiFraud repository linked above.

Environment Setup

To set up the environment, follow the instructions provided in ms-swift.

Training and Inference Pipeline

Stage 1: Rule-Based Reinforcement Learning

Train the initial rule-based RL model with:

bash run_swift_grpo_stage1.sh

Stage 2: Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL)

Rejection Sampling: Generate samples with:
```
bash sample.sh
```
Then process the sampled data with:
```
bash process_samples.sh
```
Fine-Tuning with RSFT: Fine-tune the model on the processed data:
```
bash run_swift_sft_stage2_RSFT.sh
```
Length-Constrained Reinforcement Learning (LCRL): Further refine the model with LCRL:
```
bash run_swift_grpo_stage2_LCRL.sh
```

Stage 3: Real-Time Fine-Tuning

Run real-time fine-tuning with:

bash run_swift_grpo_stage3.sh

Additional Notes

The prompt.py file contains the definitions of prompts used during training and real-time inference.
Ensure all dependencies are installed as per the ms-swift documentation before running the scripts.

Citation

@inproceedings{ma2025teleantifraud,
  title={TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection},
  author={Ma, Zhiming and Wang, Peidong and Huang, Minhua and Wang, Jinpeng and Wu, Kai and Lv, Xiangzhao and Pang, Yachun and Yang, Yin and Tang, Wenjie and Kang, Yuchen},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={5853--5862},
  year={2025}
}

@article{wang2026safe,
  title={SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning},
  author={Wang, Peidong and Ma, Zhiming and Dai, Xin and Liu, Yongkang and Feng, Shi and Yang, Xiaocui and Hu, Wenxing and Wang, Zhihao and Pan, Mingjun and Yuan, Li and others},
  journal={arXiv preprint arXiv:2601.01392},
  year={2026}
}

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
LICENSE		LICENSE
plugin.py		plugin.py
process_samples.py		process_samples.py
process_samples.sh		process_samples.sh
prompt.py		prompt.py
readme.md		readme.md
run_swift_grpo_stage1.sh		run_swift_grpo_stage1.sh
run_swift_grpo_stage2_LCRL.sh		run_swift_grpo_stage2_LCRL.sh
run_swift_grpo_stage3.sh		run_swift_grpo_stage3.sh
run_swift_sft_stage2_RSFT.sh		run_swift_sft_stage2_RSFT.sh
sample.py		sample.py
sample.sh		sample.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning

News

Overview

Resources

Environment Setup

Training and Inference Pipeline

Stage 1: Rule-Based Reinforcement Learning

Stage 2: Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL)

Stage 3: Real-Time Fine-Tuning

Additional Notes

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning

News

Overview

Resources

Environment Setup

Training and Inference Pipeline

Stage 1: Rule-Based Reinforcement Learning

Stage 2: Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL)

Stage 3: Real-Time Fine-Tuning

Additional Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages