Sense-specific word embeddings for Portuguese

Implementation of Sense-specific word embeddings for Portuguese

This repository consists of preprocessing and evaluation scripts used in the paper entitled Sense-specific word embeddings for Portuguese. The preprocessing script cleaned corpora, tokenized and sentenced it. Evaluation scripts can be used to measure the representativeness of a sense embedding model.

About the paper

Paper can be read:

Trained embeddings models

Abstract

Word embeddings are numerical vectors which can represent words or concepts in a low-dimensional continuous space. These vectors are able to capture useful syntactic and semantic information, such as regularities in natural language. Although very useful in many applications, the traditional approaches for generating word embeddings like Word2Vec, GloVe, Wang2Vec and FastText have a strict drawback: they produce a single vector representation for a given word ignoring the fact that ambiguous words can assume different meanings for which different vectors should be generated. This mixture of meanings can be a problem for several applications. For example, in a language understanding task, by using the embedding of an ambiguous word like the Portuguese word banco (bank), all possible meanings of it -- such as financial institution, blood bank, or an item of furniture -- will be mixed in a single numerical vector, causing a wrong semantic interpretation of the sentence in which it occurs. In this paper we present the first experiments carried out for generating sense-specific word embeddings for Portuguese, in which, instead of word occurrences, word senses are represented in sense vectors. Our experiments show that sense vectors outperform traditional word vectors in syntactic and semantic tasks, proving that the language resource generated here can improve the performance of NLP tasks in Portuguese.

Installation

virtualenv venv -p python3
source venv/bin/activate
pip install -r requirements.txt
python -m spacy download pt

Usage

Trained embeddings models

Download the pre-trained sense vectors and add them to the models folder.

Preprocessing text file (in order to train embedding models)

Script used for cleaning corpus

All emails are mapped to a EMAIL token. All numbers are mapped to 0 token. All urls are mapped to URL token. Different quotes are standardized. Different hiphen are standardized. HTML strings are removed. All text between brackets are removed. All sentences shorter than 5 tokens were removed.

python preprocessing.py <input_file.txt> <output_file.txt>

Annotate the corpus with PoS tags with the nlpnet tool

python postagging.py <input_folder.txt> <output_folder.txt>

Syntactic and Semantic analogies evaluation

This method is similar to that one developed by nlx-group

python analogies.py -m <embedding_model.txt> -t <testset.txt> -r

Brazilian Portuguese testsets

Only syntactic analogies

python analogies.py -m <embedding_model.txt> -t datasets/analogies/testset/LX-4WAnalogiesBr_syntactic.txt -r

Only semantic analogies

python analogies.py -m <embedding_model.txt> -t datasets/analogies/testset/LX-4WAnalogiesBr_semantic.txt -r

All analogies

python analogies.py -m <embedding_model.txt> -t datasets/analogies/testset/LX-4WAnalogiesBr.txt -r

European Portuguese testsets

Only syntactic analogies

python analogies.py -m <embedding_model.txt> -t datasets/analogies/testset/LX-4WAnalogies_syntactic.txt -r

Only semantic analogies

python analogies.py -m <embedding_model.txt> -t datasets/analogies/testset/LX-4WAnalogies_semantic.txt -r

All analogies

python analogies.py -m <embedding_model.txt> -t datasets/analogies/testset/LX-4WAnalogies.txt -r

Semantic Similarity evaluation

Sentence Similarity

python evaluate.py <embedding_model.txt> --lang

Parameter --lang can be set depending on portuguese variant chosen.

Brazilian Portuguese

br

European Portuguese

pt

Word Sense Disambiguation evaluation

python lexical_sample.py

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
corpora		corpora
datasets		datasets
models		models
mssg/evaluation		mssg/evaluation
sense2vec		sense2vec
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sense-specific word embeddings for Portuguese

About the paper

Abstract

Contents

Installation

Usage

Trained embeddings models

Preprocessing text file (in order to train embedding models)

Syntactic and Semantic analogies evaluation

Brazilian Portuguese testsets

European Portuguese testsets

Semantic Similarity evaluation

Word Sense Disambiguation evaluation

About

Releases

Packages

Languages

eduamf/sense-embeddings

Folders and files

Latest commit

History

Repository files navigation

Sense-specific word embeddings for Portuguese

About the paper

Abstract

Contents

Installation

Usage

Trained embeddings models

Preprocessing text file (in order to train embedding models)

Syntactic and Semantic analogies evaluation

Brazilian Portuguese testsets

European Portuguese testsets

Semantic Similarity evaluation

Word Sense Disambiguation evaluation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages