Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing first letters in precompiled word vectors file: GoogleNews-vectors-negative300.bin #25

Open
GoogleCodeExporter opened this issue Jul 21, 2015 · 2 comments

Comments

@GoogleCodeExporter
Copy link

What steps will reproduce the problem?
--Using the python word2vec module in ipython. Loaded the model from 
GoogleNews-vectors-negative300.bin using the command:
model = word2vec.load("~/Downloads/GoogleNews-vectors-negative300.bin")

What is the expected output? What do you see instead?
The vocab of the model looks like it is made of of english words that have been 
stripped of their first character. As a result, many common words are missing. 
Correctly spelled words which are found in vocab actually represent collisions 
created when removing the first character.

For instance:

model.cosine('out')  

returns:

{'out': [('outs', 0.8092596703376076),
  ('eavyweight_bout', 0.65542583911176289),
  ('ightweight_bout', 0.64856198153561295),
  ('ndercard_bout', 0.62005739361720136),
  ('iddleweight_bout', 0.61811559624397572),
  ('assily_Jirov', 0.61172633394627596),
  ('atchweight_bout', 0.60739346001729411),
  ('uper_middleweight_bout', 0.60237084554945242),
  ('eatherweight_bout', 0.60183827323165029),
  ("KO'd", 0.60002383627451883)]}

The string 'out' actually represents the english word 'bout' which has been 
correctly grouped with other boxing terms. Note the similar terms are also 
missing their first characters.

Another example:

model.cosine('aul')

returns

{'aul': [('ohn', 0.82979825790046702),
  ('eter', 0.750119256790031),
  ('ark', 0.71162490811744983),
  ('ndrew', 0.66359523924163855),
  ('hris', 0.66228796043431837),
  ('ichard', 0.66142257169136376),
  ('hilip', 0.6576444040097873),
  ('ichael', 0.64312885937086905),
  ('on', 0.64042190735670823),
  ('avid', 0.63592487085268301)]}  


This group of words is gibberish but represents a cluster of common male names. 
The full names like 'john' however are not present in the vocabularly.

It looks like the vector representation is doing a very good job of capturing 
the linguistic structure. However, the inconvenient absence of first characters 
creates many unfortunate collisions. 

Original issue reported on code.google.com by [email protected] on 15 Dec 2014 at 7:54

@GoogleCodeExporter
Copy link
Author

I noticed this problem myself. In the file wordvectors.py on line 171 they read 
an extra character after each vector. This just sends the first letter to 
nowhere. If you comment this line out then it works i.e. 

171: #fin.read(1)  # newline

Original comment by [email protected] on 31 Mar 2015 at 3:00

@GoogleCodeExporter
Copy link
Author

I noticed this problem myself. In the file wordvectors.py on line 171 they read 
an extra character after each vector. This just sends the first letter to 
nowhere. If you comment this line out then it works i.e. 

171: #fin.read(1)  # newline


The suggestion above worked for me. Note that you have to undo this if you were 
to read binary files other than the google news one. If you continue with 
commenting out the new line you will get corrupt vocabs like this :

vocab
-->
array([u'\nthe', u'\nof', u'\nand', u'\nto', u'\nin', u'\nor', u'\na',
       u'\nfor', u'\nany', u'\nby', u'\nas', u'\nThe', u'\nbe', u'\nsuch',
       u'\nshall', u'\nCompany', u'\nis', u'\non', u'\n._.'], 
      dtype='<U78')

Original comment by [email protected] on 4 Jun 2015 at 2:40

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant