ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023

Paper: https://aclanthology.org/2023.findings-eacl.79.pdf

Contributors

Tran Cong Dao
Pham Nhut Huy
Nguyen Tuan Anh
Hy Truong Son (Correspondent / PI)

Main components

Pre-training
Model
Fine-tuning

Pre-training

Code architecture

bash: bash scripts to run the pipeline
config: model_config (json files)
dataset: datasets folder (both store original txt dataset and the pointer to memory of datasets.load_from_disk)
source: main python files to run pre-training tokenizers
tokenizer: folder to store tokenizers

Pre-tokenizer

Split the original txt datasets into train, validation and test sets with 90%, 5%, 5%.
Using the PyVi library to segment the datasets
Save datasets to disk

Pre-train_tokenizer

Load datasets
Train the tokenizers with SentencePiece models
Save tokenizers

Pre-train_model

Load datasets
Load tokenizers
Pre-train DeBERTa-v3

Model

Fine-tuning

Code architecture

POS tagging and NER (POS_NER)
Question Answering (QA and QA2)
Open-domain Question Answering (OPQA)

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
fine-tuning		fine-tuning
model		model
pre-training		pre-training
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md
ViDeBERTa.png		ViDeBERTa.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023

Contributors

Main components

Pre-training

Code architecture

Pre-tokenizer

Pre-train_tokenizer

Pre-train_model

Model

Fine-tuning

Code architecture

About

Releases

Packages

Contributors 2

Languages

License

HySonLab/ViDeBERTa

Folders and files

Latest commit

History

Repository files navigation

ViDeBERTa: A powerful pre-trained language model for Vietnamese, EACL 2023

Contributors

Main components

Pre-training

Code architecture

Pre-tokenizer

Pre-train_tokenizer

Pre-train_model

Model

Fine-tuning

Code architecture

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages