Skip to content

Align two embeddings (EN - FR) using MUSE (Unsupervised)

License

Notifications You must be signed in to change notification settings

garawalid/Multilingual-Unsupervised-Embeddings

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multilingual-Unsupervised-Embeddings

The main goal of this project is to share some insights and feedbacks to align two different words embedding and then translate some words from English to French using MUSE (unsupervised approach).

First, we start building a word embedding from The Reuters data set using fastText. Then we choose French fastText Wikipedia embedding as a target.

In the data directory, you will find the Reuters data set and their word embeddings. Note that we used 300 as a dimension. After aligning these two words embedding, we found that the performance is very bad. The accuracy equals 0.13, based on 734 words using the 10-Nearest neighbor.

It was expected since the Reuters data set has only 33995 words and this method relay on the co-occurrence of words. You can find all the details about this in the log file.

So we decided to change the first embedding and replace it with wiki.en.vec. We kept the same target. The result was good, the accuracy was 0.786 based on 1500 words using 1-Nearest Neighbor. Using the CSLS* metric leads to 0.822 accuracy. (log file)

CSLS* : CROSS-DOMAIN SIMILARITY LOCAL SCALING simply is a new metric developed to get a better mapping, read more

After getting the mapping, we can translate some words using translate_wiki.py , and also it’s possible to get the best dictionary. These two approaches rely on the 1-nearest neighbor.

You can download embeddings of this experiment from here

About

Align two embeddings (EN - FR) using MUSE (Unsupervised)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages