Spacy fine tuning of embeddings #2762
Replies: 3 comments
-
If you get something working you can always make it a custom component or plugin, sure. But I think you might have some of the concepts a bit crossed. The tokenizer couldn't really be used to supervise the embeddings in any way I could see. You could use the lemmatizer to key the vectors table differently. You might find our work on sense2vec interesting in this respect: https://github.com/explosion/sense2vec |
Beta Was this translation helpful? Give feedback.
-
Shouldn't the vocabulary of an embedding model ideally be the set of tokens that appear in the training corpus? Given rarity of some words/tokens, of course those could be mapped to a same token, but other than that? |
Beta Was this translation helpful? Give feedback.
-
Is there any way to fine tune sense2vec model? |
Beta Was this translation helpful? Give feedback.
-
Feature description
SpaCy has great tokenization/lemmatization. It would be great to use this, and current word vectors to fine tune the embeddings (given by spaCy's vocabulary) on another domain specific dataset. For now, docs recommend to retrain word embeddings from scratch using gensim and reloading the vectors, but this does not leverage spaCy's functionalities from my understanding.
Could the feature be a custom component or spaCy plugin?
Could be
Beta Was this translation helpful? Give feedback.
All reactions