Multilingual-Unsupervised-Embeddings

The main goal of this project is to share some insights and feedbacks to align two different words embedding and then translate some words from English to French using MUSE (unsupervised approach).

First, we start building a word embedding from The Reuters data set using fastText. Then we choose French fastText Wikipedia embedding as a target.

In the data directory, you will find the Reuters data set and their word embeddings. Note that we used 300 as a dimension. After aligning these two words embedding, we found that the performance is very bad. The accuracy equals 0.13, based on 734 words using the 10-Nearest neighbor.

It was expected since the Reuters data set has only 33995 words and this method relay on the co-occurrence of words. You can find all the details about this in the log file.

So we decided to change the first embedding and replace it with wiki.en.vec. We kept the same target. The result was good, the accuracy was 0.786 based on 1500 words using 1-Nearest Neighbor. Using the CSLS* metric leads to 0.822 accuracy. (log file)

CSLS* : CROSS-DOMAIN SIMILARITY LOCAL SCALING simply is a new metric developed to get a better mapping, read more

After getting the mapping, we can translate some words using translate_wiki.py , and also it’s possible to get the best dictionary. These two approaches rely on the 1-nearest neighbor.

You can download embeddings of this experiment from here

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
data		data
dumped/debug		dumped/debug
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
dictionary_wiki.py		dictionary_wiki.py
transalte_wiki.py		transalte_wiki.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual-Unsupervised-Embeddings

About

Releases

Packages

Languages

License

garawalid/Multilingual-Unsupervised-Embeddings

Folders and files

Latest commit

History

Repository files navigation

Multilingual-Unsupervised-Embeddings

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages