Spacy fine tuning of embeddings #2762

GeoffNN · 2018-09-14T21:11:50Z

GeoffNN
Sep 14, 2018

Feature description

SpaCy has great tokenization/lemmatization. It would be great to use this, and current word vectors to fine tune the embeddings (given by spaCy's vocabulary) on another domain specific dataset. For now, docs recommend to retrain word embeddings from scratch using gensim and reloading the vectors, but this does not leverage spaCy's functionalities from my understanding.

Could the feature be a custom component or spaCy plugin?

Could be

honnibal · 2018-09-14T22:34:07Z

honnibal
Sep 14, 2018
Maintainer

If you get something working you can always make it a custom component or plugin, sure. But I think you might have some of the concepts a bit crossed. The tokenizer couldn't really be used to supervise the embeddings in any way I could see. You could use the lemmatizer to key the vectors table differently. You might find our work on sense2vec interesting in this respect: https://github.com/explosion/sense2vec

0 replies

GeoffNN · 2018-09-14T23:05:26Z

GeoffNN
Sep 14, 2018
Author

Shouldn't the vocabulary of an embedding model ideally be the set of tokens that appear in the training corpus? Given rarity of some words/tokens, of course those could be mapped to a same token, but other than that?

0 replies

ssurya1696 · 2021-03-26T09:34:26Z

ssurya1696
Mar 26, 2021

Is there any way to fine tune sense2vec model?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spacy fine tuning of embeddings #2762

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{title}}

{{title}}

Select a reply

Spacy fine tuning of embeddings #2762

GeoffNN Sep 14, 2018

Feature description

Could the feature be a custom component or spaCy plugin?

Replies: 3 comments

honnibal Sep 14, 2018 Maintainer

GeoffNN Sep 14, 2018 Author

ssurya1696 Mar 26, 2021

GeoffNN
Sep 14, 2018

honnibal
Sep 14, 2018
Maintainer

GeoffNN
Sep 14, 2018
Author

ssurya1696
Mar 26, 2021