You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Having discovered the undocumented feature that common words like
I'm
we're
don't
etc are OOV in the common GloVe pretrained models
(while words like o'clock are in so you can't just split on apostrophe/single quotes)
and seeing no docs except some vague references that Stanford parser with undocumented switches MIGHT have been used to generate the common pretrained GloVe models
and finding ZERO comments from Google about how they preprocessed the text used for Word2Vec's Google News pretrained model
it seems to me that GenSim would do people a lot of good by making tokenizers matching each of their most popular included pretrained models so that users are writing NLP programs that speak the same language as their models rather than comparing apples to oranges.
The text was updated successfully, but these errors were encountered:
A desire for help here has come up a lot – & at times I've shared my observations about what can be deduced from the limited statements, & observable contents, of pre-trained vector sets like the 'GoogleNews' release.
However, without disclosures (or better yet code) from the original researchers who prepared such pretrained vectors, all such efforts will only ever be gradually-approximating their practices, with lingering exceptions & caveats generating more questions.
Also: it often seems to be beginner & small-data projects that are most-eager to re-use pretrained vectors from elsewhere, under the assumption those must be the "right" thing, or better than what they'd achieve. But: many times that's not the case.
For example, GoogleNews was trained on an internal Google corpus of news articles 11+ years ago. It used a statistical model for creating multiword-tokens whose exact parameters/word-frequencies/multigram-frequencies has never been disclosed. For many current projects, word-vectors trained on more-recent domain-specific data via understood & conciously-chosen proprocessing – even much less data! – will likely generate better vocabulary & relevant-word-sense coverage than Google's old work.
So while I'd see some value in a "best guess" function to mimic the tokenizing choices of those commonly-used pretrained sets – as a research effort, or contribution – I'd also prefer it prominently-disclaimered as non-official, & not-necessarily-an-endorsement of preferring those vectors, and that tokenization, for anyone's particular purpose.
At this point, devising such helpers would be a sort of software-archeology/mystery project, and I'd not see it as any sort of urgent priority. But, it might make a good new-contributor, student, or hackathon project – especially if eventual integration includes good surrounding docs/discussion/demos of the limits/considerations involved in reusing another project's vectors/preprocessing choices.
gojomo
changed the title
add functions to reproduct preprocessing behind GoogleNews, GLoVe, etc pretrained word-vectors
add functions to reproduce preprocessing matching GoogleNews, GLoVe, etc pretrained word-vectors
Jul 21, 2023
Suggested on project discussion list (https://groups.google.com/g/gensim/c/CsER2XBs8P4/m/f2EntuXRAgAJ):
The text was updated successfully, but these errors were encountered: