Jopara (Guarani-dominant mixed with Spanish) sentiment analysis corpus.
We created two Corpora: balanced and unbalanced. The balanced (1526) was created from the unbalanced (3941).
Tweet IDs, the content can rehydrate (i.e. request the full Tweet) using the Twitter APIs. See Redistribution of Twitter content. Consider hydrated them back into full datasets using a Hydrator (e.g., DocNow).
Go to Harvard Dataverse repository. .
If you are curious and want to know how we download these particular tweets, go to this GitHub repository (see also tweets-downloader repository).
(Complement) Naïve Bayes and SVMs for unbalanced datasets.
Go to GitHub repository.
BiLSTM-CNN, CNN-BiLSTM char-word-embeddings, also pre-trained non-contextualized representations (Spanish, Guarani, and Multilingual): FastText’s word and BPEmb’s subword vectors.
Go to Google Colab notebook.
Fine-tuning: Spanish BERT (BETO), Multilingual BERT (102 languages) and XLM (15 languages) (none of the language models used considered Guarani during pre-training).
Go to Google Colab notebook.
Please, cite this paper On the logistical difficulties and findings of Jopara Sentiment Analysis:
Marvin Agüero-Torales, David Vilares, and Antonio López-Herrera. 2021. On the logistical difficulties and findings of Jopara Sentiment Analysis. In Proceedings of the CALCS 2021 (co-located with NAACL 2021) Fifth Workshop on Computational Approaches to Linguistic Code Switching, pages 95–102, Online. Association for Computational Linguistics.
@inproceedings{aguero-torales-etal-2021-logistical,
title = "On the logistical difficulties and findings of Jopara Sentiment Analysis",
author = {Ag{\"u}ero-Torales, Marvin and
Vilares, David and
L{\'o}pez-Herrera, Antonio},
booktitle = "Proceedings of the Fifth Workshop on Computational Approaches to Linguistic Code-Switching",
month = jun,
year = "2021",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.calcs-1.12",
doi = "10.18653/v1/2021.calcs-1.12",
pages = "95--102",
abstract = "This paper addresses the problem of sentiment analysis for Jopara, a code-switching language between Guarani and Spanish. We first collect a corpus of Guarani-dominant tweets and discuss on the difficulties of finding quality data for even relatively easy-to-annotate tasks, such as sentiment analysis. Then, we train a set of neural models, including pre-trained language models, and explore whether they perform better than traditional machine learning ones in this low-resource setup. Transformer architectures obtain the best results, despite not considering Guarani during pre-training, but traditional machine learning models perform close due to the low-resource nature of the problem.",
}