Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentence and paragraph (etc) distances? #36

Open
robertfeldt opened this issue Mar 16, 2021 · 3 comments
Open

Sentence and paragraph (etc) distances? #36

robertfeldt opened this issue Mar 16, 2021 · 3 comments

Comments

@robertfeldt
Copy link

Thanks for this package; very useful.

Would it make sense to include simple multi-word distance metrics like MOWE (mean/median of word embeddings) etc in this package or is that already available in other packages of JuliaText? I didn't find it but seems a quite common use case for people that download Embeddings.jl. An alternative might be to make these part instead of StringDistances.jl.

@oxinabox
Copy link
Member

I agree that is a common use. I mean my PhD thesis was on the fact that such simple linear combinations of word embeddings often out peform more sophisticated methods.

But I am not sure it is worth including in the package.
The package is intentionally the bare minimum just handling data loading.
It doesn't even handle looking up index for words.
The user is left to do that by writing somehting like

const get_word_index = Dict(word=>ii for (ii,word) in enumerate(embtable.vocab))
get_embedding(word) = embtable.embeddings[:, get_word_index[word]]

Which allows them to do something fancier if they have for example loaded there words into a PooledArray etc.

Similarly, thingsl like sums of embeddings are also 1 liners.

sowe(words) = sum(get_embedding, words)
mowe(words) = mean(get_embedding, words)

and if they want to do something fancier to handle out of vocabulary etc then they are free to do so

@robertfeldt
Copy link
Author

robertfeldt commented Mar 17, 2021

Yes, I saw your thesis (but haven't read it all).

Sure, it's simple enough to keep it out. I figured not everyone who needs sentence/paragraph distances would know about sowe/mowe so having it in a package might make it easier but maybe many do. Anyway, no problem.

BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically?

@oxinabox
Copy link
Member

BTW, would you recommend straight mowe/sowe on all the words (well potentially excluding stop words etc) of a paragraph or rather do pairwise on sentences and then aggregate in some way based on sentence similarities? I haven't explored it much for larger batches of text and my intuition tells me that just taking the mean would loose "resolution" at some point. Do you know of some papers investigating this empirically?

Straight mowe/sowe is so simple to implement it should be the first thing you try (possibly after plain BoW).
I am not sure that any kind of processing sentence wise would give much gain it might.
But it might not.
It seems like it would be annoying since you need to deal with difference sentneces in different order.
and different numbers of sentences.
Maybe though at that point oen can just go straight up to a more fancy model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants