SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Authors:

Nam V. Nguyen, Dien X. Tran, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le

📌 About • 🔍 Checkpoints • 🚀 Quick Start • 🏋️‍♂️ Training • 🧪 Pipeline • 📖 Citation

📌 About

The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation.

To address these challenges, we introduce SemViQA, a novel Vietnamese fact-checking framework integrating:

Semantic-based Evidence Retrieval (SER): Combines TF-IDF with a Question Answering Token Classifier (QATC) to enhance retrieval precision while reducing inference time.
Two-step Verdict Classification (TVC): Uses hierarchical classification optimized with Cross-Entropy and Focal Loss, improving claim verification across three categories:
- Supported ✅
- Refuted ❌
- Not Enough Information (NEI) 🤷‍♂️

🏆 Achievements

1st place in the UIT Data Science Challenge 🏅
State-of-the-art performance on:
- ISE-DSC01 → 78.97% strict accuracy
- ViWikiFC → 80.82% strict accuracy
SemViQA Faster: 7x speed improvement over the standard model 🚀

These results establish SemViQA as a benchmark for Vietnamese fact verification, advancing efforts to combat misinformation and ensure information integrity.

🔍 Checkpoints

We are making our SemViQA experiment checkpoints publicly available to support the Vietnamese fact-checking research community. By sharing these models, we aim to:

Facilitate reproducibility: Allow researchers and developers to validate and build upon our results.
Save computational resources: Enable fine-tuning or transfer learning on top of pre-trained and fine-tuned models instead of training from scratch.
Encourage further improvements: Provide a strong baseline for future advancements in Vietnamese misinformation detection.

Method	Model	ViWikiFC	ISE-DSC01
TC	InfoXLM_large	Link	Link
	XLM-R_large	Link	Link
	Ernie-M_large	Link	Link
BC	InfoXLM_large	Link	Link
	XLM-R_large	Link	Link
	Ernie-M_large	Link	Link
QATC	InfoXLM_large	Link	Link
QATC	ViMRC_large	Link	Link
QA origin	InfoXLM_large	Link	Link
QA origin	ViMRC_large	Link	Link

🚀 Quick Start

📥 Installation

1️⃣ Clone this repository

git clone https://github.com/DAVID-NGUYEN-S16/SemViQA.git
cd SemViQA

2️⃣ Set up Python environment

We recommend using Python 3.11 in a virtual environment (venv) or Anaconda.

Using venv:

python -m venv semviqa_env
source semviqa_env/bin/activate  # On MacOS/Linux
semviqa_env\Scripts\activate      # On Windows

Using Anaconda:

conda create -n semviqa_env python=3.11 -y
conda activate semviqa_env

3️⃣ Install dependencies

pip install --upgrade pip
pip install transformers==4.42.3
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🏋️‍♂️ Training

Train different components of SemViQA using the provided scripts:

1️⃣ Three-Class Classification Training

bash scripts/tc.sh

2️⃣ Binary Classification Training

bash scripts/bc.sh

3️⃣ QATC Model Training

bash scripts/qatc.sh

🧪 Pipeline

Use the trained models to predict test data:

bash scripts/pipeline.sh

Acknowledgment

Our development is based on our previous works:

SemViQA is the final version we have developed for verifying fact-checking in Vietnamese, achieving state-of-the-art (SOTA) performance compared to any other system for Vietnamese.

📖 Citation

If you use SemViQA in your research, please cite our work:

@misc{nguyen2025semviqasemanticquestionanswering,
      title={SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking}, 
      author={Nam V. Nguyen and Dien X. Tran and Thanh T. Tran and Anh T. Hoang and Tai V. Duong and Di T. Le and Phuc-Lu Le},
      year={2025},
      eprint={2503.00955},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00955}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 67 Commits
claim_verification		claim_verification
data_processing		data_processing
evidence_retrieval		evidence_retrieval
pipelines		pipelines
scripts		scripts
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Authors:

📌 About

🏆 Achievements

🔍 Checkpoints

🚀 Quick Start

📥 Installation

1️⃣ Clone this repository

2️⃣ Set up Python environment

3️⃣ Install dependencies

🏋️‍♂️ Training

1️⃣ Three-Class Classification Training

2️⃣ Binary Classification Training

3️⃣ QATC Model Training

🧪 Pipeline

Acknowledgment

📖 Citation

About

Uh oh!

Releases

Packages

Languages

HTAnh2003/SemViQA

Folders and files

Latest commit

History

Repository files navigation

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Authors:

📌 About

🏆 Achievements

🔍 Checkpoints

🚀 Quick Start

📥 Installation

1️⃣ Clone this repository

2️⃣ Set up Python environment

3️⃣ Install dependencies

🏋️‍♂️ Training

1️⃣ Three-Class Classification Training

2️⃣ Binary Classification Training

3️⃣ QATC Model Training

🧪 Pipeline

Acknowledgment

📖 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages