Skip to content

Control-derek/SAFE-QAQ

Repository files navigation

SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning

SAFE-QAQ is an end-to-end framework for audio-text fraud detection that leverages reinforcement learning to enable slow-thinking decision-making. Below are instructions for setting up the environment, training the model, and running experiments.

arXiv Hugging Face Dataset ModelScope Dataset TeleAntiFraud Repository


News

  • [2026.01] SAFE-QAQ has been accepted by ACL 2026.

Overview

This repository contains the source code for SAFE-QAQ, which consists of three main stages:

  1. Rule-Based Reinforcement Learning (Stage 1): Train a rule-based RL model.
  2. Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL) (Stage 2): Refine the model using rejection sampling and LCRL techniques.
  3. Real-Time Fine-Tuning (Stage 3): Fine-tune the model for real-time inference.

The prompts for both real-time inference and training are defined in prompt.py.

Resources

SAFE-QAQ uses the TeleAntiFraud audio-text fraud detection dataset. Dataset downloads, benchmark resources, and evaluation utilities are available in the TeleAntiFraud repository linked above.


Environment Setup

To set up the environment, follow the instructions provided in ms-swift.


Training and Inference Pipeline

Stage 1: Rule-Based Reinforcement Learning

Train the initial rule-based RL model with:

bash run_swift_grpo_stage1.sh

Stage 2: Rejection Sampling Fine-Tuning (RSFT) and Length-Constrained Reinforcement Learning (LCRL)

  1. Rejection Sampling: Generate samples with:

    bash sample.sh

    Then process the sampled data with:

    bash process_samples.sh
  2. Fine-Tuning with RSFT: Fine-tune the model on the processed data:

    bash run_swift_sft_stage2_RSFT.sh
  3. Length-Constrained Reinforcement Learning (LCRL): Further refine the model with LCRL:

    bash run_swift_grpo_stage2_LCRL.sh

Stage 3: Real-Time Fine-Tuning

Run real-time fine-tuning with:

bash run_swift_grpo_stage3.sh

Additional Notes

  • The prompt.py file contains the definitions of prompts used during training and real-time inference.
  • Ensure all dependencies are installed as per the ms-swift documentation before running the scripts.

Citation

@inproceedings{ma2025teleantifraud,
  title={TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection},
  author={Ma, Zhiming and Wang, Peidong and Huang, Minhua and Wang, Jinpeng and Wu, Kai and Lv, Xiangzhao and Pang, Yachun and Yang, Yin and Tang, Wenjie and Kang, Yuchen},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={5853--5862},
  year={2025}
}

@article{wang2026safe,
  title={SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning},
  author={Wang, Peidong and Ma, Zhiming and Dai, Xin and Liu, Yongkang and Feng, Shi and Yang, Xiaocui and Hu, Wenxing and Wang, Zhihao and Pan, Mingjun and Yuan, Li and others},
  journal={arXiv preprint arXiv:2601.01392},
  year={2026}
}

About

[ACL '26] SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors