Skip to content

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Notifications You must be signed in to change notification settings

HTAnh2003/SemViQA

This branch is 1 commit ahead of, 48 commits behind DAVID-NGUYEN-S16/SemViQA:main.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

6cf731c Β· Mar 5, 2025

History

67 Commits
Feb 28, 2025
Feb 19, 2025
Feb 21, 2025
Feb 19, 2025
Feb 22, 2025
Mar 5, 2025
Feb 10, 2025

Repository files navigation

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Authors:

Nam V. Nguyen, Dien X. Tran, Thanh T. Tran, Anh T. Hoang, Tai V. Duong, Di T. Le, Phuc-Lu Le

πŸ“Œ About β€’ πŸ” Checkpoints β€’ πŸš€ Quick Start β€’ πŸ‹οΈβ€β™‚οΈ Training β€’ πŸ§ͺ Pipeline β€’ πŸ“– Citation


πŸ“Œ About

The rise of misinformation, exacerbated by Large Language Models (LLMs) like GPT and Gemini, demands robust fact-checking solutions, especially for low-resource languages like Vietnamese. Existing methods struggle with semantic ambiguity, homonyms, and complex linguistic structures, often trading accuracy for efficiency. We introduce SemViQA, a novel Vietnamese fact-checking framework integrating Semantic-based Evidence Retrieval (SER) and Two-step Verdict Classification (TVC). Our approach balances precision and speed, achieving state-of-the-art results with 78.97% strict accuracy on ISE-DSC01 and 80.82% on ViWikiFC, securing 1st place in the UIT Data Science Challenge. Additionally, SemViQA Faster improves inference speed 7x while maintaining competitive accuracy. SemViQA sets a new benchmark for Vietnamese fact verification, advancing the fight against misinformation.

To address these challenges, we introduce SemViQA, a novel Vietnamese fact-checking framework integrating:

  • Semantic-based Evidence Retrieval (SER): Combines TF-IDF with a Question Answering Token Classifier (QATC) to enhance retrieval precision while reducing inference time.
  • Two-step Verdict Classification (TVC): Uses hierarchical classification optimized with Cross-Entropy and Focal Loss, improving claim verification across three categories:
    • Supported βœ…
    • Refuted ❌
    • Not Enough Information (NEI) πŸ€·β€β™‚οΈ

πŸ† Achievements

  • 1st place in the UIT Data Science Challenge πŸ…
  • State-of-the-art performance on:
    • ISE-DSC01 β†’ 78.97% strict accuracy
    • ViWikiFC β†’ 80.82% strict accuracy
  • SemViQA Faster: 7x speed improvement over the standard model πŸš€

These results establish SemViQA as a benchmark for Vietnamese fact verification, advancing efforts to combat misinformation and ensure information integrity.


πŸ” Checkpoints

We are making our SemViQA experiment checkpoints publicly available to support the Vietnamese fact-checking research community. By sharing these models, we aim to:

  • Facilitate reproducibility: Allow researchers and developers to validate and build upon our results.
  • Save computational resources: Enable fine-tuning or transfer learning on top of pre-trained and fine-tuned models instead of training from scratch.
  • Encourage further improvements: Provide a strong baseline for future advancements in Vietnamese misinformation detection.
Method Model ViWikiFC ISE-DSC01
TC InfoXLMlarge Link Link
XLM-Rlarge Link Link
Ernie-Mlarge Link Link
BC InfoXLMlarge Link Link
XLM-Rlarge Link Link
Ernie-Mlarge Link Link
QATC InfoXLMlarge Link Link
ViMRClarge Link Link
QA origin InfoXLMlarge Link Link
ViMRClarge Link Link

πŸš€ Quick Start

πŸ“₯ Installation

1️⃣ Clone this repository

git clone https://github.com/DAVID-NGUYEN-S16/SemViQA.git
cd SemViQA

2️⃣ Set up Python environment

We recommend using Python 3.11 in a virtual environment (venv) or Anaconda.

Using venv:

python -m venv semviqa_env
source semviqa_env/bin/activate  # On MacOS/Linux
semviqa_env\Scripts\activate      # On Windows

Using Anaconda:

conda create -n semviqa_env python=3.11 -y
conda activate semviqa_env

3️⃣ Install dependencies

pip install --upgrade pip
pip install transformers==4.42.3
pip install torch==2.3.0 torchvision==0.18.0 torchaudio==2.3.0 --index-url https://download.pytorch.org/whl/cu118
pip install -r requirements.txt

πŸ‹οΈβ€β™‚οΈ Training

Train different components of SemViQA using the provided scripts:

1️⃣ Three-Class Classification Training

bash scripts/tc.sh

2️⃣ Binary Classification Training

bash scripts/bc.sh

3️⃣ QATC Model Training

bash scripts/qatc.sh

πŸ§ͺ Pipeline

Use the trained models to predict test data:

bash scripts/pipeline.sh

Acknowledgment

Our development is based on our previous works:

SemViQA is the final version we have developed for verifying fact-checking in Vietnamese, achieving state-of-the-art (SOTA) performance compared to any other system for Vietnamese.

πŸ“– Citation

If you use SemViQA in your research, please cite our work:

@misc{nguyen2025semviqasemanticquestionanswering,
      title={SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking}, 
      author={Nam V. Nguyen and Dien X. Tran and Thanh T. Tran and Anh T. Hoang and Tai V. Duong and Di T. Le and Phuc-Lu Le},
      year={2025},
      eprint={2503.00955},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.00955}, 
}

About

SemViQA: A Semantic Question Answering System for Vietnamese Information Fact-Checking

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 68.7%
  • Python 30.0%
  • Shell 1.3%