Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sentiment Analysis - Error on composed words #3

Open
alex-lairan opened this issue Oct 7, 2019 · 7 comments
Open

Sentiment Analysis - Error on composed words #3

alex-lairan opened this issue Oct 7, 2019 · 7 comments

Comments

@alex-lairan
Copy link

Hi,

I use sentiment analysis for testing purposes, and I found something with composed words.

I have this code :

require "cadmium"

sentiment = Cadmium.sentiment
pp sentiment.analyze "I realy don't like mosquitoes"
pp "I realy don't like mosquitoes".is_negative?

The result is :

{score: 2,
 comparative: 0,
 tokens: ["I", "realy", "do", "n't", "like", "moskitoes"],
 words: ["like"],
 positive: ["like"],
 negative: []}
false

Here, the don't is not followed.
I know is a bad English, but it's something you can found on twitter.

I don't know if I'm using it in a wrong way.

@watzon
Copy link
Member

watzon commented Oct 8, 2019

Seems like a problem with the tokenizer. I'll look into it.

@hugoabonizio
Copy link

hugoabonizio commented Oct 18, 2019

Using the pragmatic tokenizer the token don't is recognized, but I think there's a problem with the negation identification which I addressed in cadmiumcr/cadmium#27.

sentiment.tokenizer = Cadmium.pragmatic_tokenizer.new

{score: 2,
 comparative: 0.4,
 tokens: ["i", "realy", "don't", "like", "mosquitoes"],
 words: ["like"],
 positive: ["like"],
 negative: []}
false

@watzon
Copy link
Member

watzon commented Oct 19, 2019

The problem with the Pragmatic Tokenizer is that it's much much slower than the other ones. I do not recommend using it internally for anything.

@hugoabonizio
Copy link

@watzon it also works with aggressive_tokenizer, but the behavior varies a lot depending on the tokenizer.

@watzon
Copy link
Member

watzon commented Oct 22, 2019

Yeah the agressive_tokenizer would probably be the one to use

@rmarronnier
Copy link
Member

@watzon : Can we move this issue to cadmiumcr/sentiment repo ? It makes more sense :-)

@watzon
Copy link
Member

watzon commented Nov 3, 2019

Yes, it should definitely be moved

@watzon watzon transferred this issue from cadmiumcr/cadmium Nov 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants