GitHub - lichunhong2010/Douban-Comments-Spider: 这是一个豆瓣评论的爬虫，包括电影，音乐和书籍的短片并以词云的方式输出。

GetID_Douban.py

Get a Douban id according to the film name,music name,or book name that you provid.

Douban_id():

在main函数中调用，需要自己创造对象，并将参数传进来。

def init(self,name,sort='movie'):

param name：电影名，音乐名或书本名。 param sort：分类，电影(movie)，图书(book)，音乐(music)。

def getID(self):

需要通过对象手动调用。根据用户提供的名字和分类查找，拿到对应的id并返回值。

主要用xml和正则表达式。

getComments.py

将Douban_id()获取的id和suburl拼凑出完整的短评url，拿到数据并保存在本地。返回值为文件保存的路径。

Keywords.py

将保存在文件中的评论信息，进行清洗。清洗出的关键词生成词云。用到文件夹下的ChineseStopWords.txt，将所有的中文虚词剔除，可以自己做或者从网上下载。simhei.ttf词云字体类型。

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
__pycache__		__pycache__
comments_infor		comments_infor
screenshorts		screenshorts
ChineseStopWords.txt		ChineseStopWords.txt
GetID_Douban.py		GetID_Douban.py
Keywords.py		Keywords.py
README.md		README.md
__init__.py		__init__.py
getComments.py		getComments.py
main.py		main.py
simhei.ttf		simhei.ttf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GetID_Douban.py

Douban_id():

def init(self,name,sort='movie'):

def getID(self):

getComments.py

Keywords.py

comments_infor

screenshorts

分类目录:

影片，电影，图书:

评论保存文件：

词云显示:

About

Releases

Packages

Languages

lichunhong2010/Douban-Comments-Spider

Folders and files

Latest commit

History

Repository files navigation

GetID_Douban.py

Douban_id():

def init(self,name,sort='movie'):

def getID(self):

getComments.py

Keywords.py

comments_infor

screenshorts

分类目录:

影片，电影，图书:

评论保存文件：

词云显示:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages