Portuguese-NLP

List of resources and tools developed with focus on Portuguese.

Datasets

#PraCegoVer - multi-modal dataset with Portuguese captions based on posts from Instagram.
18th-century Portuguese medical texts
AG_news pt - automatic translation of the AG's corpus of news articles.
Alpaca data pt-br - Stanford Alpaca dataset translated into Brazilian Portuguese using the Helsinki-NLP/opus-mt-tc-big-en-pt model.
AspectBR - Aspect-based annotated dataset of web consumer reviews.
ASSIN - a dataset with semantic similarity score and entailment annotations. (HuggingFace)
ASSIN 2 - sequence of ASSIN. (HuggingFace)
Automated Essay Score (AES) ENEM Dataset - Benchmark for automatic essay scoring in Portuguese (HuggingFace)
Aya Dataset PT - CohereForAI Aya Dataset filtrado para português (PT).
BlogSet-BR - a collection of posts gathered from Blogspot platform written by Brazillian users.
BLUEX - A benchmark based on Brazilian Leading Universities Entrance eXams.
BoolQ - tradução automática do BoolQ.
br-quad-2.0 - Stanford Question Answering Dataset (SQuAD) 2.0 translated to Brazilian Portuguese (PT-BR) language.
Brands.Br - a Portuguese Reviews Corpus
Brazilian Court Decisions - collection of 4043 Ementa (summary) court decisions and their metadata from the Tribunal de Justiça de Alagoas (TJAL), the State Supreme Court of Alagoas (Brazil).
Brazilian E-Commerce - Brazilian E-Commerce Public Dataset by Olist store.
Brazilian Headlines Sentiments - Dataset containing sentiment analysis of Brazilian news agencies headlines.
Brazilian Portuguese Literature Corpus - 3.7 million word corpus of Brazilian literature published between 1840-1908.
Brazilian Portuguese Narrative Essays Dataset - Dataset for Automatic Essay Scoring of Brazilian Portuguese Narrative Essays.
Brazilian Portuguese Sentiment Analysis Datasets.
Brazilian TCU's judgments - Judgments of Federal Court of Accounts - Brazil (TCU).
BrWaC - Brazilian Portuguese Web as Corpus.
BrWac2Wiki - a dataset for multi-document summarization in Portuguese.
B2W-Reviews01 - product reviews.
Canarim - A Large-Scale Dataset of Web Pages in the Portuguese Language (huggingface)
Carolina - Corpus Geral do Português Brasileiro Contemporâneo (huggingface).
Capes - parallel corpus of theses and dissertations abstracts in English and Portuguese.
CC100-Portuguese - Created by Conneau & Wenzek et al. at 2020. This dataset is one of the 100 corpora of monolingual data that was processed from the January-December 2018 Commoncrawl snapshots from the CC-Net repository.
CETENFolha - news from the newspaper Folha de S. Paulo.
CHAVE - collection for Information Retrieval and Question Answering.
CINTIL Corpus - a linguistically interpreted corpus of Portuguese.
ClinicalNER - Clinical Named Entity Recognition in Portuguese.
Complexidade Textual para Estágios Escolares do Sistema Educacional Brasileiro.
CORAA - dataset for Automatic Speech Recognition.
CORAA SER - Emotion Recognition from Brazilian Portuguese Informal Spontaneous Speech.
CrawlPT_dedup - CrawlPT (deduplicated) is composed by three corpora: brWaC, C100-PT, OSCAR-2301.
CSTNews - a corpus with 50 clusters of news texts with their multi-document summaries, as well as several discourse and semantic annotations.
C-ORAL-BRASIL - This project is dedicated to the study of Brazilian Portuguese spontaneous speech and, more broadly, to the compilation of spoken corpora.
DANTEStocks - Corpus of stock market tweets written in Brazilian Portuguese and annotated with named entities according to HAREM's taxonomy.
DEEPAGÉ - Answering Questions in Portuguese about the Brazilian Environment.
DNLT-BP - Datasets of Neuropsychological Language Tests in Brazilian Portuguese.
ENEM Challenge - Consists of the writing of an essay and an objective part containing 180 multiple choice questions.
ENEM-2022 and ENEM-2023 - These projects encompass all multiple-choice questions from the last two editions of the Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities.
Essay-BR - Essay-BR: a corpus of essays for the Brazilian Portuguese language.
Extended Essay-BR - Extended version of the Essay-BR corpus.
FACTCK.BR - A dataset to study Fake News in Portuguese.
FactNews - dataset to predict sentence-level factuality of news reporting.
fake voices - deepfakes in Brazilian Portuguese created with XTTS model.
Fake.Br - aligned true and fake news written in Brazilian Portuguese (Hugginface).
Central_de_fatos - (Huggingface).
FakeNewsSet - (HuggingFace).
Fakepedia-Corpus - fake news dataset.
FakeRecogna - dataset comprised of real and fake news (Huggingface).
FakeWhatsApp.Br - An annotated Corpus of WhatsApp messages in PT-BR for automatic detection of textual misinformation.
FKTC - FaKe news Text Collections.
Floresta Sintá(c)tica - treebank for Portuguese.
HAREM first - evaluation contest for named entity recognizers in Portuguese.
HAREM second - evaluation contest for named entity recognizers in Portuguese.
HateBR - large-scale expert annotated corpus of Brazilian Instagram comments for hate speech and offensive language detection on the web and social media.
Historical Portuguese Corpora - tools and resources for manipulation of historical corpora and management of historical dictionaries.
IMDB pt - Tradução atomática do IMBD.
Iudicium Textum Dataset - contains legal documents created by Brazilian Federal Supreme Court in its integral composition (paper).
LeNER-Br - a Dataset for Named Entity Recognition in Brazilian Legal Text.
LegalPT_dedup - LegalPT (deduplicated) aggregates the maximum amount of publicly available legal data in Portuguese.
Lex2Kids - lexicon in Portuguese most heard by children.
Mac-Morpho - Brazilian Portuguese texts annotated with part-of-speech tags.
MilkQA - a dataset of dense questions for the task of answer selection.
Minutes of Central Bank of Brazil - Minutes of the Monetary Policy Committee of the Central Bank of Brazil.
NER in Brazilian Portuguese tweets - Twitter messages in pt-br annotated for the entities PER, LOC and ORG.
NERDE - Documents from CADE's jurisprudence annotated for the entities ORG, PER, TEMPO, LOC, LEG (legislation), DOCS (documents), VALOR.
News-Crawl-PT - Monolingual News Crawl used for WMT.
News of the site Folha de São Paulo - news of the Brazilian Newspaper Folha de São Paulo.
News published in Brazil - news compilation of the Globo group.
OAB exams - Brazilian version of the BAR exam (USA) (HuggingFace).
Parallel Corpora from Revista Pesquisa FAPESP - Portuguese-English and Portuguese-Spanish bilingual collections of the online issues of the scientific news Brazilian magazine Revista Pesquisa FAPESP.
NURC-SP
Pirá - A Bilingual Portuguese-English Dataset for Question-Answering about the Ocean.
PL-corpus - part of the UlyssesNER-Br, a corpus of Brazilian Legislative Documents for NER with quality baselines.
PLUE - Portuguese translation of the GLUE benchmark and Scitail dataset.
POeTiSA - POrtuguese processing - Towards Syntactic Analysis and parsing.
politiquices - Datasets related with the politiquices.pt project.
PorSimplesSent - of aligned sentences pairs to investigate sentence readability assessment.
PortiLexicon-UD - a lexicon for Brazilian Portuguese according to Universal Dependencies.
Portuguese-Hate-Speech-Dataset - Portuguese dataset for hate speech detection composed of 5,668 tweets with binary annotations (i.e. 'hate' vs. 'no-hate') (HuggingFace)
Portuguese Legal Sentences - Collection of Legal Sentences from the Portuguese Supreme Court of Justice.
Portuguese Presidential Elections - This dataset contains tweets and users mostly from the Portuguese Twittersphere.
PraCegoVer - multi-modal dataset containing images associated to Portuguese captions based on posts from Instagram.
Priberam Fine-Grained Opinion Corpus - a Portuguese fine-grained dependency opinion mining corpus.
Propbank - Contains instances annotated with semantic role labels (SRL).
Projeto ACDC - Internet Access to Corpora.
Puntuguese - A Corpus of Puns in Portuguese with Micro-editions (HuggingFace)
QA-Portuguese - Adaptation from MQA dataset Portuguese split (QA entailment pairs).
Quati - This dataset aims to support Brazilian Portuguese (pt-br) Information Retrieval (IR) systems development, providing document passagens originally created in pt-br, as well as queries (topics) created by native speakers.
REBEL-Portuguese - Datasets de relações a partir da Wikipedia.
ReLi - REsenha de LIvros.
RePro: A Benchmark Dataset for Opinion Mining for Brazilian Portuguese - A Benchmark Dataset for Opinion Mining for Brazilian Portuguese. (HuggingFace)
Rhetalho - corpus annotated with Daniel Marcu's RSTTool.
SemClinBr - multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks.
SESAME - corpus for NER in portuguese.
SIGARRA News Corpus - SIGARRA information system at the University of Porto.
SIMPLEX-PB - A Lexical Simplification Database and Benchmark for Portuguese.
SIMPLEX-PB-2.0 - improved version of SIMPLEX-PB.
SIMPLEX-PB-3.0 - new version of SIMPLEX-PB.
Spotify Subset - classifying language variations in Brazilian Portuguese
SQUAD-PT v1.1 - Portuguese translation of the SQuAD dataset.
SQUAD-PT v1.1-pt-br - Brazilian Portuguese translation of the SQuAD dataset, translated by Deep Learning Brasil.
SQUAD-PT v2.0 - Portuguese translation of SQuAD 2.0 dataset.
SST-2 pt - Automatic translation of the Stanford Sentiment Treebank.
TeMário - news texts and the corresponding human summaries for summarization purposes.
Textual Complexity Corpus - Textual Complexity Corpus for School Internships in the Brazilian Educational System.
ToLD-Br - Toxic Language Detection in Social Media for Brazilian Portuguese (github).
TTS-Portuguese Corpus - Text To Speech Portuguese.
TweetSentBR - Tweets in Brazilian Portuguese.
Tweets for Sentiment Analysis.
UD_Portuguese-Bosque - Universal Dependencies (UD) Portuguese treebank.
UD_Portuguese-CINTIL - Universal Dependencies (UD) Portuguese treebank.
UD_Portuguese-GSD - Universal Dependencies (UD) Portuguese treebank.
UD_Portuguese-PetroGold - Universal Dependencies (UD) Portuguese treebank.
UD_Portuguese-PUD - Universal Dependencies (UD) Portuguese treebank.
UlyssesNER-Br - Corpus of Brazilian Legislative Documents for Named Entity Recognition
UTLCorpus - a corpus of online reviews in Brazilian Portuguese annotated with helpfulness classification.
Winograd Schema Challenge - Solver for the Portuguese-based Winograd Schema Challenge.
WizardVicuna-PTBR-Instruct-Clean - Wizard Vicuna PT-Br Instruct Clean dataset.

Multilingual datasets

A Multilingual Dataset for Investigating Stereotypes and Negative Attitudes Towards Migrant Groups in Large Language Models
askD - ELI5 dataset adapted on Medical Questions (AskDocs) subreddit.
English-Portuguese Sentences - English-Portuguese Sentences from the Tatoeba Project.
EUR-Lex - multilingual corpus in all the official languages of the European Union.
Europarl - European Parliament Proceedings Parallel Corpus 1996-2011.
Europarl-ST - Multilingual Speech Translation Corpus, that contains paired audio-text samples for Speech Translation, constructed using the debates carried out in the European Parliament in the period between 2008 and 2012.
mc4 - multilingual colossal, cleaned version of Common Crawl's web crawl corpus. Based on Common Crawl dataset.
mfaq - multilingual corpus of Frequently Asked Questions parsed from the Common Crawl.
MKQA - Multilingual Knowledge Questions & Answers (github).
MQA - multilingual corpus of Questions and Answers (MQA) parsed from the Common Crawl.
MMARCO - Multilingual version of the MS MARCO passage ranking dataset.
mRobust - Multilingual version of the TREC 2004 Robust passage ranking dataset
MultiCoNER - a large multilingual dataset for Named Entity Recognition.
MuST-C - multilingual speech translation corpus.
OpenSubtitles - collection of translated movie subtitles.
OSCAR - Open Super-large Crawled Aggregated coRpus.
Tatoeba - a large database of sentences and translations.
TED2020 - contains a crawl of nearly 4000 TED and TED-X transcripts from July 2020.
TSAR-2022-Shared-Task - TSAR2022 Shared Task on Lexical Simplification.
WikiANN - multilingual named entity recognition dataset consisting of Wikipedia articles annotated with LOC (location), PER (person), and ORG (organisation) tags in the IOB2 format.
WikiLingua - Multilingual abstractive summarization dataset extracted from WikiHow.
WikiMatrix - Parallel Sentences in 1620 Language Pairs from Wikipedia.
Wikiner - Learning multilingual named entity recognition from Wikipedia.
WikiNEuRal - Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).
Wikipedia - Wikipedia dataset containing cleaned articles of all languages.
XFORMAL - A Benchmark for Multilingual Formality Style Transfer.
XLSUM - 1.35 million professionally annotated article-summary pairs from BBC.

Lexicon

BATS-PT - manual translation of the lexicographic portion of the Bigger Analogy Test Set (BATS) to Portuguese
br.ispell - Ispell dictionary for brazilian portuguese (github).
Conceptnet - an open, multilingual knowledge graph.
DicSin - Dictionary of synonyms and antonyms.
lexiconPT - R package that provides lexicons for Portuguese Text Analysis.
lexicons - Dictionaries of names, surnames, acronyms and it's extensions, stop-words, etc.
LIWC - Linguistic Inquiry and Word Count (dictionary)
Onto.PT - Ontologia Lexical para o Português.
OpenWordnet-PT - an open access wordnet for Portuguese (site).
OpLexicon - a sentiment lexicon for the Portuguese language.
palavras - Word list of Brazillian Portuguese.
PAPEL.
pt-br - Wordlist, verbs, conjugations, term frequencies.
PT-LKB - Large Portuguese Lexical-Semantic Knowledge Base
PULO - Portuguese Unified Lexical Ontology.
SentiLex-PT - a sentiment lexicon for Portuguese.
Stopwords - Portuguese stopwords collection.
Tep2.
Unitex-PB - lexical resources.
VaLexPB - a lexicon of Brazilian Portuguese verb valences.
VerbNet.Br 1.0 - verbal lexicon of Brazilian Portuguese.
wikidict-dsl-pt - Wikidata Bilingual DSL Dictionaries.
Wordnetaffectbr - vocabulary of emotions words.
Wordnet.Br - Portuguese WordNet.

Models

Albertina PT-BR - It is an encoder of the BERT family for the Portuguese language - the American variant from Brazil.
Albertina PT-PT - It is an encoder of the BERT family for the Portuguese language - the European variant from Portugal.
Alpaca-LoRA-PTBR - Low-Rank LLaMA Instruct-Tuning.
BART - BART pre-treinado em português.
BERTimbau - BERTimbau Base is a pretrained BERT model for Brazilian Portuguese that achieves state-of-the-art performances on three downstream NLP tasks: Named Entity Recognition, Sentence Textual Similarity and Recognizing Textual Entailment (Github).
BioBERTpt - fine-tuned BERT models trained on the clinical domain for Portuguese language (Github).
Cabrita - A portuguese finetuned instruction LLaMA (Github).
DeBERTinha - A DeBERTa V3 XSmall adapted to the Brazilian Portuguese language (Github).
Electra - Electra model trained on BRWAC.
Gervasio-PT-BR - It is a decoder of the GPT family for the Portuguese language - the American variant from Brazil.
Gervasio-PT-PT - It is a decoder of the GPT family for the Portuguese language - the European variant from Portugal.
GlórIA 1.3B - A Portuguese European-focused Large Language Model (HuggingFace)
GPT2 small - GPorTuguese-2 (Portuguese GPT-2 small) is a state-of-the-art language model for Portuguese based on the GPT-2 small model.
GPT-Neo small - a finetuned version from GPT-Neo 125M by EletheurAI to Portuguese language.
GPT2-Bio-PT - a biomedical finetuned version from GPorTuguese-2 (Github).
NERDE-base - BERTimbau finetuned to NER on Judicial Documents.
roberta-pt-br
RoBERTaCrawlPT-base - RoBERTaCrawlPT-base is a generic Portuguese Masked Language Model pretrained from scratch from the CrawlPT corpora
RoBERTaLexPT-base - Portuguese Masked Language Model pretrained from scratch from the LegalPT and CrawlPT corpora
Sabiá - Sabiá-7B is Portuguese language model developed by Maritaca AI.
Sabiá 2 - Language model trained on Portuguese text, especially in the Brazilian domain.
T5 - T5 model on Brazilian Portuguese data.
tgf-xlm-roberta-base-pt-br (Github)
Wav2vec - Fine-tuned facebook/wav2vec2-large-xlsr-53 on Portuguese using the train and validation splits of Common Voice 6.1.

Multilingual Models

Bloom - BigScience Large Open-science Open-access Multilingual Language Model.
mBert - Pretrained model on the top 104 languages with the largest Wikipedia using a masked language modeling (MLM) objective.
mDeBERTa
mGPT - Multilingual GPT model. An autoregressive GPT-like model.
mMiniLM - mMiniLM-L6-v2 Reranker finetuned on mMARCO
mT5 - Multilingual T5. A massively multilingual pre-trained text-to-text transformer.
XLM-RoBERTa - XLM-RoBERTa model pre-trained on 2.5TB of filtered CommonCrawl data containing 100 languages.
LaBSE - Language-agnostic BERT Sentence Encoder (LaBSE) is a BERT-based model trained for sentence embedding for 109 languages.

Word Embeddings

fastText - Multi-lingual word vectors.
LASER - Language-Agnostic SEntence Representations.
NILC-Embeddings - Word embeddings trained in Portuguese by USP.
MUSE - Multilingual Unsupervised and Supervised Embeddings.
word vectors - Pre-trained word vectors of 30+ languages.

Metrics

Coh-Metrix-Port - an adaptation of the Coh-Metrix text analysis tool to the Brazilian Portuguese language.
NILC-Metrix - it gathers the metrics developed over more than a decade in NILC Lab.

Leaderboards

Open PT LLM Leaderboard - Open PT LLM Leaderboard aims to provide a benchmark for the evaluation of Large Language Models (LLMs) in the Portuguese language across a variety of tasks and datasets.

Frameworks

Institutions

Brasileiras em PLN.
HAILab-PUCPR - A pioneering research group aiming to develop solutions for health care using Natural Language Processing and Machine Learning.
Linguateca.
NILC.
NLPortuguês - Devoted to creating NLP courses in brazilian portuguese.
NLX-Group.
PLN PUCRS.

Tools

Apertium-por - Apertium linguistic data for Portuguese.
Autocorrect - Spelling corrector in python.
BrGram - Computational grammar fragment of Brazilian Portuguese in the LFG formalism implemented in XLE.
Dicio API - Portuguese dictionary API.
dict-pt-br - dictionary for Brazilian Portuguese.
Languagetool - Style and Grammar Checker for 25+ Languages.
LegalNLP - Natural Language Processing Methods for the Brazilian Legal Language.
LexML Parser - parser for legal documents.
LX parser - statistical constituency parser for Portuguese.
metaphone-ptbr - Metaphone algorithm for the Portuguese language.
mlconjug3 - a Python library to conjugate verbs in Portuguese and other languages.
MorphoBr - Resources for morphological analysis of Portuguese.
OpCluster - Automatic extraction and clustering of fine-grained opinions.
Phonemizer - Simple text to phones converter for multiple languages.
PorGram - Open source computational grammar for Portuguese in the HPSG formalism.
pymetaphone-br - Metaphone algorithm package for the Portuguese language.
pysentimiento - Multilingual toolkit for Sentiment Analysis and Social NLP tasks.
pyspellchecker - Multilingual Spell Checking.
RBAMR - A Rule-Based AMR Parser for Portuguese.
Verbecc - Complete Conjugation of any Verb using Machine Learning for French, Spanish, Portuguese, Italian and Romanian.

Other lists

Annotated Semantic Relationships Datasets
Linguistic datasets - Linguistic Datasets for Portuguese.
NER-datasets for Portuguese
NILC
NILC 2
NILC 3
Opinando - Opinion Mining for Portuguese.
Portuguese dataset List

Name		Name	Last commit message	Last commit date
Latest commit History 136 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Portuguese-NLP

Datasets

Multilingual datasets

Lexicon

Models

Multilingual Models

Word Embeddings

Metrics

Leaderboards

Frameworks

Institutions

Tools

Other lists

Other links

About

Contributors 8

ajdavidl/Portuguese-NLP

Folders and files

Latest commit

History

Repository files navigation

Portuguese-NLP

Datasets

Multilingual datasets

Lexicon

Models

Multilingual Models

Word Embeddings

Metrics

Leaderboards

Frameworks

Institutions

Tools

Other lists

Other links

About

Topics

Resources

Stars

Watchers

Forks

Contributors 8