Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fasttextB_embeddings_300d.npy file #2

Open
cairomo opened this issue Jun 16, 2021 · 9 comments
Open

fasttextB_embeddings_300d.npy file #2

cairomo opened this issue Jun 16, 2021 · 9 comments

Comments

@cairomo
Copy link

cairomo commented Jun 16, 2021

hi, I am running the basic learner ./run_main.sh 0 DeepXML EURLex-4k 0 108 and everything is going fine except I don't have the fasttext embeddings file.

The error output is Embedding File not found. Check path or set 'init' to null. Where/how was the .npy embeddings file created? Is it from the pretrained word vectors on fasttext's website?

would appreciate any info to illuminate this issue! thanks

@kunaldahiya
Copy link
Collaborator

Hi,

Thanks for tryring out DeepXML. In general: the embedding files are created using pre-trained model available at the fasttext's website.

You can use the following link to download the embedding file for EURLex-4K: https://owncloud.iitd.ac.in/nextcloud/index.php/s/5XsZAKLbHfbpfZA

Please let me know, if you need anything else.

@cairomo
Copy link
Author

cairomo commented Jun 16, 2021

thanks for your insight! does that mean that there are different embeddings for every dataset?

the way I tried to generate the embedding files was using for example wki.en.vec, reading it in as a np 0 dimensional array, and then saving that to a .npy file. it didn't give the same results as the embedding file that you shared, what did you do differently?

@kunaldahiya
Copy link
Collaborator

The embedding file in our case contains a V x D matrix, where V is the vocabulary dimension and D is the dimensionality. In other words, there is a vector for each token in the dataset. So, the embedding file would be different for each dataset as the vocabulary will be different.

We use the FastText model to compute embedding for each token in vocabulary, which is then passed to our model.

@cairomo
Copy link
Author

cairomo commented Jun 23, 2021

thanks for the clarification. so what I ended up doing was something like:
model = fasttext.train_unsupervised(corpus_file, dim=dim)

and then using vocab = model.words, creating a np array of V (len(vocab)) x D where each row is
wordvec = model.get_word_vector(word) for every word in the vocabulary.

Am I understanding your process correctly?

@kunaldahiya
Copy link
Collaborator

Hi

I have added an example here which computes embeddings from pre-trained fasttext model. You are free to train your own model provided your corpus is: (i) large enough, (ii) general english or relevant to the task.

@cairomo cairomo closed this as completed Jul 6, 2021
@cairomo cairomo reopened this Jul 6, 2021
@kunaldahiya
Copy link
Collaborator

kunaldahiya commented Jul 6, 2021

Hi

Please re-install pyxclib. The latest version contains required files. See this link.

@khatrimann
Copy link

Hey,
By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo
The link above has the file missing in it

@kunaldahiya
Copy link
Collaborator

Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it

Hi,

You can follow this example to get embeddings for a given vocabulary: https://github.com/kunaldahiya/pyxclib/blob/master/xclib/examples/get_ftx_embeddings.py

@kunaldahiya kunaldahiya reopened this Nov 28, 2024
@khatrimann
Copy link

Hey, By any chance is the npy file still available with anyone of you? @kunaldahiya @cairomo The link above has the file missing in it

Hi,

You can follow this example to get embeddings for a given vocabulary: https://github.com/kunaldahiya/pyxclib/blob/master/xclib/examples/get_ftx_embeddings.py

I tried to do this but eurlex only had bow files features and not the text corpus.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants