Simple way to do word2vec arithmetic #5302

amueller · 2020-04-13T22:06:53Z

amueller
Apr 13, 2020

It would be cool to have a simple interface for similarity queries or arithmetic with word2vec. Related to #276.

gensim allows something like
w.most_similar(positive=['woman', 'king'], negative=['man'], topn=3)
which is not super easy with spacy.

The best I could come up with based on #276 is

from sklearn.metrics.pairwise import cosine_similarity

queries = [w for w in nlp.vocab if w.is_lower and w.prob >= -15]

def cos_sim(a, b):
    return cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))

def most_similar_vec(vec, count=10):
    by_similarity = sorted(queries, key=lambda w: cos_sim(w.vector, vec), reverse=True)
    return [w.orth_ for w in by_similarity[:count]]

vec = nlp('woman').vector + nlp('king').vector - nlp("man").vector
most_similar_vec(vec)

Though I guess it's plausible to say this is out of scope for spacy.

svlandeg · 2020-04-14T09:07:58Z

svlandeg
Apr 14, 2020
Maintainer

Hi @amueller, impeccable timing! @koaning has just today released their whatlies package that offers interactive visualisations and support for mathematical operations & transformations of word embeddings: https://spacy.io/universe/project/whatlies. There's a tutorial here: https://www.youtube.com/watch?v=FwkwC7IJWO0&list=PL75e0qA87dlG-za8eLI6t0_Pbxafk-cxb&index=9&t=0s

It also supports sense2vec, if you hadn't seen that package yet :-)

Hope this covers your use-case, if not, perhaps let Vincent know ;-)

0 replies

koaning · 2020-04-14T10:31:10Z

koaning
Apr 14, 2020

Feel free to leave an issue on the github if you encounter a bug: https://rasahq.github.io/whatlies/

0 replies

koaning · 2020-04-14T10:33:16Z

koaning
Apr 14, 2020

~~@andreasgrv~~ @amueller One thing about this, whatlies currently does not support 'most similar' yet in all language models. The sense2vec stuff does support it though.

This is a feature that I would love to have there but I haven't given it serious thought yet on how to make it performant. My initial idea was very similiar to yours but I wonder if I might be able to use tools like annoy to keep things lightweight.

An annoying (get it?) thing here is that technically, I also support utterances of multiple tokens. But for your use-case I could also just ignore them.

Added an issue on github for whatlies if you're interested in a discussion; koaning/whatlies#24

0 replies

andreasgrv · 2020-04-14T11:41:52Z

andreasgrv
Apr 14, 2020

@koaning You probably meant to @ amueller :)

0 replies

koaning · 2020-04-14T11:43:32Z

koaning
Apr 14, 2020

d0h.

0 replies

koaning · 2020-04-15T06:58:42Z

koaning
Apr 15, 2020

I ended up working on it a bit this evening.

I'm using pairwise_distances from sklearn so you can pass a compatible metric for it to sort on. Will add some tests this week and push to pypi. Feedback is welcome.

0 replies

amueller · 2020-04-16T16:26:58Z

amueller
Apr 16, 2020
Author

Cool! If the query is one of the vectors, it might make sense to exclude it, i.e. not to have "king" be the vector most similar to "king". Though that might require checking for zero distance which is awkward. Not sure how gensim does that. I was a bit surprised that "king" was the answer to the second query when I ran it but I guess that's just a property of this particular embedding?

0 replies

koaning · 2020-04-16T21:21:44Z

koaning
Apr 16, 2020

@amueller I pushed those changes live yesterday, so you should be able to play with it. Documentation here.

If the query is one of the vectors, it might make sense to exclude it, i.e. not to have "king" be the vector most similar to "king".

That should become a setting I think, but aye. Deserves to be added.

I was a bit surprised that "king" was the answer to the second query when I ran it but I guess that's just a property of this particular embedding?

You're correct that this depends on the dataset that it was trained on as well as the algorithm that generated the embeddings ... but ... in my experience it's pretty common. But I have to admit that I made the whatlies package (not a joke) to make it easy for me to properly confirm it.

An working on this now;

Also ... since this thread is getting specific ... let's move future talks on this topic to the repo here.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simple way to do word2vec arithmetic #5302

{{title}}

Replies: 8 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Simple way to do word2vec arithmetic #5302

amueller Apr 13, 2020

Replies: 8 comments

svlandeg Apr 14, 2020 Maintainer

koaning Apr 14, 2020

koaning Apr 14, 2020

andreasgrv Apr 14, 2020

koaning Apr 14, 2020

koaning Apr 15, 2020

amueller Apr 16, 2020 Author

koaning Apr 16, 2020

amueller
Apr 13, 2020

svlandeg
Apr 14, 2020
Maintainer

koaning
Apr 14, 2020

koaning
Apr 14, 2020

andreasgrv
Apr 14, 2020

koaning
Apr 14, 2020

koaning
Apr 15, 2020

amueller
Apr 16, 2020
Author

koaning
Apr 16, 2020