Nam V. Nguyen, Dien X. Tran, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le
📌 About • 🔍 Checkpoints • 🚀 Quick Start • 🏋️♂️ Training • 🧪 Pipeline • 📖 Citation
The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation.
To address these challenges, we introduce SemViQA, a novel Vietnamese fact-checking framework integrating:
- Semantic-based Evidence Retrieval (SER): Combines TF-IDF with a Question Answering Token Classifier (QATC) to enhance retrieval precision while reducing inference time.
- Two-step Verdict Classification (TVC): Uses hierarchical classification optimized with Cross-Entropy and Focal Loss, improving claim verification across three categories:
- Supported ✅
- Refuted ❌
- Not Enough Information (NEI) 🤷♂️
- 1st place in the UIT Data Science Challenge 🏅
- State-of-the-art performance on:
- ISE-DSC01 → 78.97% strict accuracy
- ViWikiFC → 80.82% strict accuracy
- SemViQA Faster: 7x speed improvement over the standard model 🚀
These results establish SemViQA as a benchmark for Vietnamese fact verification, advancing efforts to combat misinformation and ensure information integrity.
We are making our SemViQA experiment checkpoints publicly available to support the Vietnamese fact-checking research community. By sharing these models, we aim to:
- Facilitate reproducibility: Allow researchers and developers to validate and build upon our results.
- Save computational resources: Enable fine-tuning or transfer learning on top of pre-trained and fine-tuned models instead of training from scratch.
- Encourage further improvements: Provide a strong baseline for future advancements in Vietnamese misinformation detection.
Method | Model | ViWikiFC | ISE-DSC01 |
---|---|---|---|
TC | InfoXLMlarge | Link | Link |
XLM-Rlarge | Link | Link | |
Ernie-Mlarge | Link | Link | |
BC | InfoXLMlarge | Link | Link |
XLM-Rlarge | Link | Link | |
Ernie-Mlarge | Link | Link | |
QATC | InfoXLMlarge | Link | Link |
ViMRClarge | Link | Link | |
QA origin | InfoXLMlarge | Link | Link |
ViMRClarge | Link | Link |
git clone https://github.com/DAVID-NGUYEN-S16/SemViQA.git
cd SemViQA
We recommend using Python 3.11 in a virtual environment (venv
) or Anaconda.
Using venv
:
python -m venv semviqa_env
source semviqa_env/bin/activate # On MacOS/Linux
semviqa_env\Scripts\activate # On Windows
Using Anaconda
:
conda create -n semviqa_env python=3.11 -y
conda activate semviqa_env
pip install --upgrade pip
pip install transformers==4.42.3
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt
Train different components of SemViQA using the provided scripts:
bash scripts/tc.sh
bash scripts/bc.sh
bash scripts/qatc.sh
Use the trained models to predict test data:
bash scripts/pipeline.sh
Our development is based on our previous works:
SemViQA is the final version we have developed for verifying fact-checking in Vietnamese, achieving state-of-the-art (SOTA) performance compared to any other system for Vietnamese.
If you use SemViQA in your research, please cite our work:
@misc{nguyen2025semviqasemanticquestionanswering,
title={SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking},
author={Nam V. Nguyen and Dien X. Tran and Thanh T. Tran and Anh T. Hoang and Tai V. Duong and Di T. Le and Phuc-Lu Le},
year={2025},
eprint={2503.00955},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2503.00955},
}