Sentence and paragraph (etc) distances? #36

robertfeldt · 2021-03-16T08:22:46Z

Thanks for this package; very useful.

Would it make sense to include simple multi-word distance metrics like MOWE (mean/median of word embeddings) etc in this package or is that already available in other packages of JuliaText? I didn't find it but seems a quite common use case for people that download Embeddings.jl. An alternative might be to make these part instead of StringDistances.jl.

The text was updated successfully, but these errors were encountered:

oxinabox · 2021-03-16T18:30:54Z

I agree that is a common use. I mean my PhD thesis was on the fact that such simple linear combinations of word embeddings often out peform more sophisticated methods.

But I am not sure it is worth including in the package.
The package is intentionally the bare minimum just handling data loading.
It doesn't even handle looking up index for words.
The user is left to do that by writing somehting like

const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))
get_embedding(word) = embtable.embeddings[:, get_word_index[word]]

Which allows them to do something fancier if they have for example loaded there words into a PooledArray etc.

Similarly, thingsl like sums of embeddings are also 1 liners.

sowe(words) = sum(get_embedding, words)
mowe(words) = mean(get_embedding, words)

and if they want to do something fancier to handle out of vocabulary etc then they are free to do so

robertfeldt · 2021-03-17T08:14:40Z

Yes, I saw your thesis (but haven't read it all).

Sure, it's simple enough to keep it out. I figured not everyone who needs sentence/paragraph distances would know about sowe/mowe so having it in a package might make it easier but maybe many do. Anyway, no problem.

BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically?

oxinabox · 2021-03-24T13:23:22Z

BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically?

Straight mowe/sowe is so simple to implement it should be the first thing you try (possibly after plain BoW).
I am not sure that any kind of processing sentence wise would give much gain it might.
But it might not.
It seems like it would be annoying since you need to deal with difference sentneces in different order.
and different numbers of sentences.
Maybe though at that point oen can just go straight up to a more fancy model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sentence and paragraph (etc) distances? #36

Sentence and paragraph (etc) distances? #36

robertfeldt commented Mar 16, 2021

oxinabox commented Mar 16, 2021

robertfeldt commented Mar 17, 2021 •

edited

Loading

oxinabox commented Mar 24, 2021

Sentence and paragraph (etc) distances? #36

Sentence and paragraph (etc) distances? #36

Comments

robertfeldt commented Mar 16, 2021

oxinabox commented Mar 16, 2021

robertfeldt commented Mar 17, 2021 • edited Loading

oxinabox commented Mar 24, 2021

robertfeldt commented Mar 17, 2021 •

edited

Loading