Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ConceptNetNumberbatch word embeddings support #14

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

zgornel
Copy link

@zgornel zgornel commented Sep 17, 2018

This pull adds support for ConceptNetNumberbatch. Three distinct files formats are available and supported:

  • multilingual, gzipped .txt file, word and embeddings on each line
  • english, gzipped .txt file, word and embeddings on each line
  • multilingual, HDF5, each embedding is a Vector{Int8}

Conceptnet word keys for the multilingual datasets are of the form /c/<language>/word which makes direct acces a bit unwieldy and searching for example for word fails. Also, misspellings i.e. word.,wordd fails as well. A more heuristic method of retrieving the best match would be advised at this point :)

@oxinabox
Copy link
Member

Thanks for this.
It might be a little while before I can properly review this.
Feel free to ping me if you think I have forgotten.

I think we might want to add more smarts to the return type.
I think we can't get away with just returning a struct.
We need some methods.
We'ld also need this for interpolation of OOV words in #1.

I do not like the use of :compressed as a language.
I think that should also use a multilingual marker.

It is also clear to me that as we add more embedding types,
the need to parallelize them into separate testing environments grows.

@zgornel
Copy link
Author

zgornel commented Sep 18, 2018

Thanks for the feedback. A few remarks:

  • No problems on the time issue; this branch is available anyway for quick use of Conceptnet. If more refinements are required, it can be merged later ;) Some bits have to be improved as well...
  • More merhods are needed indeed; OOV interpolation is a must, will look into that and backport anything worthwile
  • Also, support for out-of-language would be a nice feature i.e. get the embedding of an equivalent word fron another language; no ideea how feasible is this; fasttext has language detection so at least OOV words can be correctly interpolated
  • :compressed is a quick hack
  • paralelization is also a nice to have however, I see more problematic the fact that the tests download many GBs into a temporary folder; for me it was quite difficult and had to remove existing tests. It would be good if mini-datasets i.e. 15 embeddings would be used; for conceptnet I have crafted such datasets specifically for testing...

@oxinabox
Copy link
Member

oxinabox commented Sep 18, 2018

paralelization is also a nice to have however, I see more problematic the fact that the tests download many GBs into a temporary folder;

I am not sure what you mean, the tests delete their downloads automatically.
If they are not please raise a separate issue.
(I think there might be a windows related bug, but I am not sure as I don't use windows.)

for me it was quite difficult and had to remove existing tests. It would be good if mini-datasets i.e. 15 embeddings would be used; for conceptnet I have crafted such datasets specifically for testing...

Yes, ideally, we would just test on mini-datasets.
Though that does have the downside of them not being real; it is likely worth it for the faster tests.
We'll find out pretty quickly if they are indeed not matching up to the real formats.
Feel free to make such a PR.

@zgornel
Copy link
Author

zgornel commented Sep 18, 2018

I am not sure what you mean, the tests delete their downloads automatically.
If they are not please raise a separate issue.
(I think there might be a windows related bug, but I am not sure as I don't use windows.)

On many unix-like systems /tmp is mounted through tmpfs in RAM. On lower-memory machines this can be an issue as the ram would get filled with the test's data. So, even though an elegant approach, it has its issues and imho tests should be able to run on low memory machines (can have issues with CI as well as those machines have low resources at their disposal, at least the free ones).

@dellison
Copy link
Collaborator

@zgornel Thanks! I'll start taking a look at this too.

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.

I think we might want to add more smarts to the return type.
I think we can't get away with just returning a struct.
We need some methods.
We'ld also need this for interpolation of OOV words in #1.

I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like.

@oxinabox oxinabox mentioned this pull request Sep 19, 2018
@oxinabox
Copy link
Member

oxinabox commented Sep 19, 2018

@zgornel Thanks! I'll start taking a look at this too.

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), ...

That is my understanding too.
Languages.jl already has (its own) language detection. (Probably not state of the art, but very workable)
And once #6 is done, then that will be easy to use together with it.

I think we might want to add more smarts to the return type.
I think we can't get away with just returning a struct.
We need some methods.
We'ld also need this for interpolation of OOV words in #1.

I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like.

Yes, Lets. #14 #16

@dellison
Copy link
Collaborator

@oxinabox I think that issue link just goes to this PR. Issue is #16

cnt = 0
indices = Int[]
for (index, row) in enumerate(data)
word, embedding = _parseline(row)
Copy link
Collaborator

@dellison dellison Sep 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can probably get a small efficiency gain here if you wait to actually parse the rest of the line as floats until you know that you are looking at a "keep word" (inside the if).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, thx!

open(file, "r") do fid
vocab_size, vector_size = map(x->parse(Int,x), split(readline(fid)))
max_stored_vocab_size = _get_vocab_size(vocab_size, max_vocab_size)
data = readlines(fid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readlines loads the whole file into memory. I think it would be better to remove this line and iterate through the file with enumerate(eachline(fid)) instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, will do ;)

@zgornel
Copy link
Author

zgornel commented Sep 19, 2018

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.

That's it, I was referring to the pretrained model which can be downloaded here. Since the multilingual conceptnet file uses a word of the form /c/<language>/<matchable expression>, detecting the language first can help pre-filter the embeddings before OOV search. In the cases where two words appear in different languages, it can help greatly, as well as with speed.

@oxinabox I was not aware that Languages.jl has language identification, that's great.

zgornel added a commit to zgornel/Embeddings.jl that referenced this pull request Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants