ConceptNetNumberbatch word embeddings support #14

zgornel · 2018-09-17T19:14:57Z

This pull adds support for ConceptNetNumberbatch. Three distinct files formats are available and supported:

multilingual, gzipped .txt file, word and embeddings on each line
english, gzipped .txt file, word and embeddings on each line
multilingual, HDF5, each embedding is a Vector{Int8}

Conceptnet word keys for the multilingual datasets are of the form /c/<language>/word which makes direct acces a bit unwieldy and searching for example for word fails. Also, misspellings i.e. word.,wordd fails as well. A more heuristic method of retrieving the best match would be advised at this point :)

oxinabox · 2018-09-18T02:07:16Z

Thanks for this.
It might be a little while before I can properly review this.
Feel free to ping me if you think I have forgotten.

I think we might want to add more smarts to the return type.
I think we can't get away with just returning a struct.
We need some methods.
We'ld also need this for interpolation of OOV words in #1.

I do not like the use of :compressed as a language.
I think that should also use a multilingual marker.

It is also clear to me that as we add more embedding types,
the need to parallelize them into separate testing environments grows.

zgornel · 2018-09-18T08:38:45Z

Thanks for the feedback. A few remarks:

No problems on the time issue; this branch is available anyway for quick use of Conceptnet. If more refinements are required, it can be merged later ;) Some bits have to be improved as well...
More merhods are needed indeed; OOV interpolation is a must, will look into that and backport anything worthwile
Also, support for out-of-language would be a nice feature i.e. get the embedding of an equivalent word fron another language; no ideea how feasible is this; fasttext has language detection so at least OOV words can be correctly interpolated
:compressed is a quick hack
paralelization is also a nice to have however, I see more problematic the fact that the tests download many GBs into a temporary folder; for me it was quite difficult and had to remove existing tests. It would be good if mini-datasets i.e. 15 embeddings would be used; for conceptnet I have crafted such datasets specifically for testing...

oxinabox · 2018-09-18T09:12:29Z

paralelization is also a nice to have however, I see more problematic the fact that the tests download many GBs into a temporary folder;

I am not sure what you mean, the tests delete their downloads automatically.
If they are not please raise a separate issue.
(I think there might be a windows related bug, but I am not sure as I don't use windows.)

for me it was quite difficult and had to remove existing tests. It would be good if mini-datasets i.e. 15 embeddings would be used; for conceptnet I have crafted such datasets specifically for testing...

Yes, ideally, we would just test on mini-datasets.
Though that does have the downside of them not being real; it is likely worth it for the faster tests.
We'll find out pretty quickly if they are indeed not matching up to the real formats.
Feel free to make such a PR.

zgornel · 2018-09-18T09:31:49Z

I am not sure what you mean, the tests delete their downloads automatically.
If they are not please raise a separate issue.
(I think there might be a windows related bug, but I am not sure as I don't use windows.)

On many unix-like systems /tmp is mounted through tmpfs in RAM. On lower-memory machines this can be an issue as the ram would get filled with the test's data. So, even though an elegant approach, it has its issues and imho tests should be able to run on low memory machines (can have issues with CI as well as those machines have low resources at their disposal, at least the free ones).

dellison · 2018-09-19T04:52:04Z

@zgornel Thanks! I'll start taking a look at this too.

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.

I think we might want to add more smarts to the return type.
I think we can't get away with just returning a struct.
We need some methods.
We'ld also need this for interpolation of OOV words in #1.

I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like.

oxinabox · 2018-09-19T05:09:25Z

@zgornel Thanks! I'll start taking a look at this too.
fasttext has language detection so at least OOV words can be correctly interpolated
Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), ...

That is my understanding too.
Languages.jl already has (its own) language detection. (Probably not state of the art, but very workable)
And once #6 is done, then that will be easy to use together with it.

I think we might want to add more smarts to the return type.
I think we can't get away with just returning a struct.
We need some methods.
We'ld also need this for interpolation of OOV words in #1.

I agree. But to me, it seems like this is a separate feature that this PR doesn't (necessarily) depend on. @oxinabox do you have anything specific in mind already? Otherwise, maybe we should open another issue to discuss what a generic API might look like.

Yes, Lets. ~~#14~~ #16

dellison · 2018-09-19T07:09:49Z

@oxinabox I think that issue link just goes to this PR. Issue is #16

dellison · 2018-09-19T07:20:04Z

src/conceptnet.jl

+        cnt = 0
+        indices = Int[]
+        for (index, row) in enumerate(data)
+            word, embedding = _parseline(row)


I think you can probably get a small efficiency gain here if you wait to actually parse the rest of the line as floats until you know that you are looking at a "keep word" (inside the if).

dellison · 2018-09-19T07:24:06Z

src/conceptnet.jl

+    open(file, "r") do fid
+        vocab_size, vector_size = map(x->parse(Int,x), split(readline(fid)))
+        max_stored_vocab_size = _get_vocab_size(vocab_size, max_vocab_size)
+        data = readlines(fid)


readlines loads the whole file into memory. I think it would be better to remove this line and iterate through the file with enumerate(eachline(fid)) instead.

Indeed, will do ;)

zgornel · 2018-09-19T08:07:56Z

fasttext has language detection so at least OOV words can be correctly interpolated

Can you clarify what you mean by this? My understanding is that while it's quite possible to use the fasttext library to train a classifier for a language identification task (like they show here), the pretrained fasttext embeddings themselves are all monolingual- i.e. each language is trained separately and the embedding space is not shared among languages, with any OOV interpolation also being language-specific as it is computed from subword char ngrams. Maybe I'm missing your point, though.

That's it, I was referring to the pretrained model which can be downloaded here. Since the multilingual conceptnet file uses a word of the form /c/<language>/<matchable expression>, detecting the language first can help pre-filter the embeddings before OOV search. In the cases where two words appear in different languages, it can help greatly, as well as with speed.

@oxinabox I was not aware that Languages.jl has language identification, that's great.

oxinabox mentioned this pull request Sep 19, 2018

A more featureful API #16

Open

dellison reviewed Sep 19, 2018

View reviewed changes

zgornel added a commit to zgornel/Embeddings.jl that referenced this pull request Sep 19, 2018

Implemented review remarks, JuliaText#14

8b57559

zgornel added 5 commits October 4, 2018 11:38

Added ConceptNetNumberbatch initial support

5a40305

Added HDF5 to REQUIRE

1c56957

Bugfixes

fd87028

Another round of bugfixes

66639c0

Implemented review remarks, JuliaText#14

9d86441

zgornel force-pushed the conceptnet_support branch from 8b57559 to 9d86441 Compare October 4, 2018 09:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ConceptNetNumberbatch word embeddings support #14

ConceptNetNumberbatch word embeddings support #14

zgornel commented Sep 17, 2018

oxinabox commented Sep 18, 2018

zgornel commented Sep 18, 2018 •

edited

Loading

oxinabox commented Sep 18, 2018 •

edited

Loading

zgornel commented Sep 18, 2018 •

edited

Loading

dellison commented Sep 19, 2018

oxinabox commented Sep 19, 2018 •

edited

Loading

dellison commented Sep 19, 2018

dellison Sep 19, 2018 •

edited

Loading

zgornel Sep 19, 2018

dellison Sep 19, 2018

zgornel Sep 19, 2018

zgornel commented Sep 19, 2018 •

edited

Loading

ConceptNetNumberbatch word embeddings support #14

Are you sure you want to change the base?

ConceptNetNumberbatch word embeddings support #14

Conversation

zgornel commented Sep 17, 2018

oxinabox commented Sep 18, 2018

zgornel commented Sep 18, 2018 • edited Loading

oxinabox commented Sep 18, 2018 • edited Loading

zgornel commented Sep 18, 2018 • edited Loading

dellison commented Sep 19, 2018

oxinabox commented Sep 19, 2018 • edited Loading

dellison commented Sep 19, 2018

dellison Sep 19, 2018 • edited Loading

Choose a reason for hiding this comment

zgornel Sep 19, 2018

Choose a reason for hiding this comment

dellison Sep 19, 2018

Choose a reason for hiding this comment

zgornel Sep 19, 2018

Choose a reason for hiding this comment

zgornel commented Sep 19, 2018 • edited Loading

zgornel commented Sep 18, 2018 •

edited

Loading

oxinabox commented Sep 18, 2018 •

edited

Loading

zgornel commented Sep 18, 2018 •

edited

Loading

oxinabox commented Sep 19, 2018 •

edited

Loading

dellison Sep 19, 2018 •

edited

Loading

zgornel commented Sep 19, 2018 •

edited

Loading