Skip to content

大众点评爬虫,爬取评论数据、评论者信息

Notifications You must be signed in to change notification settings

46319943/DianpingScraper

Repository files navigation

Dianping Scraper version 1

This code was first written in 2020. The original purpose is to scrape the comment from Dianping.com. However, as more tasks were planned to be done, the functionality of this repo became really a mess. These individual tasks should be implemented in separated repos.

Moreover, the webpage of dianping.com has changed and the text is no longer encrypted, so the OCR module is no longer needed.

Therefore, to adapt for the new version of dianping.com and make the repo clean, a new implementation of the DianpingScraper is done in the version 2.

Description

DianpingScraper is used to scrape the comments from dianping.com. It use the browser automation and OCR technique to get the text encrypted using custom css font.

Quick Start

  • The entry point is scrape_dianping.js written in nodejs.
  • The OCR is implemented in ocr_server.py using python.

Extension

Some analysis is conducted using the scraped data. The analysis includes:

  • Image label tagging using Google Vision API
  • Text sentiment analysis.
  • SHAP analysis for the sentiment analysis model.

About

大众点评爬虫,爬取评论数据、评论者信息

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published