This repository contains pre-trained BERT models trained on the Portuguese language. BERT-Base and BERT-Large Cased variants were trained on the BrWaC (Brazilian Web as Corpus), a large Portuguese corpus, for 1,000,000 steps, using whole-word mask. Model artifacts for TensorFlow and PyTorch can be found below.
The models are a result of an ongoing Master's Program. The text submission for Qualifying Exam is also included in the repository in PDF format, which contains more details about the pre-training procedure, vocabulary generation and downstream usage in the task of Named Entity Recognition.
The base and large models are available at Hugging Face
The models were benchmarked on three tasks (Sentence Textual Similarity, Recognizing Textual Entailment and Named Entity Recognition) and compared to previous published results and Multilingual BERT. Metrics are: Pearson's correlation for STS and F1-score for RTE and NER.
Task | Test Dataset | BERTimbau-Large | BERTimbau-Base | mBERT | Previous SOTA |
---|---|---|---|---|---|
STS | ASSIN2 | 0.852 | 0.836 | 0.809 | 0.83 [1] |
RTE | ASSIN2 | 90.0 | 89.2 | 86.8 | 88.3 [1] |
NER | MiniHAREM (5 classes) | 83.7 | 83.1 | 79.2 | 82.3 [2] |
NER | MiniHAREM (10 classes) | 78.5 | 77.6 | 73.1 | 74.6 [2] |
Code and instructions to reproduce the Named Entity Recognition experiments are in ner_evaluation/
directory.
Our PyTorch artifacts are compatible with the 🤗Huggingface Transformers library and are also available on the Community models:
from transformers import AutoModel, AutoTokenizer
# Using the community model
# BERT Base
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-base-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-base-portuguese-cased')
# BERT Large
tokenizer = AutoTokenizer.from_pretrained('neuralmind/bert-large-portuguese-cased')
model = AutoModel.from_pretrained('neuralmind/bert-large-portuguese-cased')
# or, using BertModel and BertTokenizer directly
from transformers import BertModel, BertTokenizer
tokenizer = BertTokenizer.from_pretrained('path/to/vocab.txt', do_lower_case=False)
model = BertModel.from_pretrained('path/to/bert_dir') # Or other BERT model class
We would like to thank Google for Cloud credits under a research grant that allowed us to train these models.
[1] Multilingual Transformer Ensembles for Portuguese Natural Language Task
[2] Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition
@InProceedings{souza2020bertimbau,
author="Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto",
editor="Cerri, Ricardo and Prati, Ronaldo C.",
title="BERTimbau: Pretrained BERT Models for Brazilian Portuguese",
booktitle="Intelligent Systems",
year="2020",
publisher="Springer International Publishing",
address="Cham",
pages="403--417",
isbn="978-3-030-61377-8"
}
@article{souza2019portuguese,
title={Portuguese Named Entity Recognition using BERT-CRF},
author={Souza, F{\'a}bio and Nogueira, Rodrigo and Lotufo, Roberto},
journal={arXiv preprint arXiv:1909.10649},
url={http://arxiv.org/abs/1909.10649},
year={2019}
}