Semestral project for data mining from web.
CSFD is Czech imdb-like movie database.
Goal of this project is to compare how the comment sentiment matches the stars rating.
- Crawler gets comments from the first 300 most incostintent-rated movies (to have all types of ratins)
- We run affin sentiment analysis on the comment text, normalize it to have a value from
0
to5
and compare it with the number of stars assigned to comment. - For sentiment analysis I'm using https://github.com/VilemR/affin.cz affin dictionary.
Match in 1688 cases, star rating matched with sentiment in 14.8788012340238% of comments
Rating difference average: 1.6520052886734244
Interesting results - the comments with maximal difference between thee stars rating and sentiment were homonyms, irony or user errors - for example:
Slušnej nářez.
has maximal sentiment value, but the user rated it as total trash.
- Install dependencies
yarn
- crawl comments data
yarn start:crawl
- run sentimental analysis on data
yarn start:sentiment