Skip to content

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Notifications You must be signed in to change notification settings

HTAnh2003/SemViQA

 
 

Repository files navigation

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Authors:

Nam V. Nguyen, Dien X. Tran, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le

📌 About🔍 Checkpoints🚀 Quick Start🏋️‍♂️ Training🧪 Pipeline📖 Citation


📌 About

The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation.

To address these challenges, we introduce SemViQA, a novel Vietnamese fact-checking framework integrating:

  • Semantic-based Evidence Retrieval (SER): Combines TF-IDF with a Question Answering Token Classifier (QATC) to enhance retrieval precision while reducing inference time.
  • Two-step Verdict Classification (TVC): Uses hierarchical classification optimized with Cross-Entropy and Focal Loss, improving claim verification across three categories:
    • Supported
    • Refuted
    • Not Enough Information (NEI) 🤷‍♂️

🏆 Achievements

  • 1st place in the UIT Data Science Challenge 🏅
  • State-of-the-art performance on:
    • ISE-DSC0178.97% strict accuracy
    • ViWikiFC80.82% strict accuracy
  • SemViQA Faster: 7x speed improvement over the standard model 🚀

These results establish SemViQA as a benchmark for Vietnamese fact verification, advancing efforts to combat misinformation and ensure information integrity.


🔍 Checkpoints

We are making our SemViQA experiment checkpoints publicly available to support the Vietnamese fact-checking research community. By sharing these models, we aim to:

  • Facilitate reproducibility: Allow researchers and developers to validate and build upon our results.
  • Save computational resources: Enable fine-tuning or transfer learning on top of pre-trained and fine-tuned models instead of training from scratch.
  • Encourage further improvements: Provide a strong baseline for future advancements in Vietnamese misinformation detection.
Method Model ViWikiFC ISE-DSC01
TC InfoXLMlarge Link Link
XLM-Rlarge Link Link
Ernie-Mlarge Link Link
BC InfoXLMlarge Link Link
XLM-Rlarge Link Link
Ernie-Mlarge Link Link
QATC InfoXLMlarge Link Link
ViMRClarge Link Link
QA origin InfoXLMlarge Link Link
ViMRClarge Link Link

🚀 Quick Start

📥 Installation

1️⃣ Clone this repository

git clone https://github.com/DAVID-NGUYEN-S16/SemViQA.git
cd SemViQA

2️⃣ Set up Python environment

We recommend using Python 3.11 in a virtual environment (venv) or Anaconda.

Using venv:

python -m venv semviqa_env
source semviqa_env/bin/activate  # On MacOS/Linux
semviqa_env\Scripts\activate      # On Windows

Using Anaconda:

conda create -n semviqa_env python=3.11 -y
conda activate semviqa_env

3️⃣ Install dependencies

pip install --upgrade pip
pip install transformers==4.42.3
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

🏋️‍♂️ Training

Train different components of SemViQA using the provided scripts:

1️⃣ Three-Class Classification Training

bash scripts/tc.sh

2️⃣ Binary Classification Training

bash scripts/bc.sh

3️⃣ QATC Model Training

bash scripts/qatc.sh

🧪 Pipeline

Use the trained models to predict test data:

bash scripts/pipeline.sh

Acknowledgment

Our development is based on our previous works:

SemViQA is the final version we have developed for verifying fact-checking in Vietnamese, achieving state-of-the-art (SOTA) performance compared to any other system for Vietnamese.

📖 Citation

If you use SemViQA in your research, please cite our work:

@misc{nguyen2025semviqasemanticquestionanswering,
      title={SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking}, 
      author={Nam V. Nguyen and Dien X. Tran and Thanh T. Tran and Anh T. Hoang and Tai V. Duong and Di T. Le and Phuc-Lu Le},
      year={2025},
      eprint={2503.00955},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00955}, 
}

About

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 68.7%
  • Python 30.0%
  • Shell 1.3%