Replies: 5 comments 1 reply
-
这个的修改很简单,在batch_get_note_comments这个方法修改。 async def batch_get_note_comments(self, note_list: List[str]):
"""Batch get note comments"""
# 这里加个参数判断就行了
utils.logger.info(
f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
for note_id in note_list:
task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
task_list.append(task)
await asyncio.gather(*task_list) |
Beta Was this translation helpful? Give feedback.
-
您好,为了实现将评论关闭只爬取notes,通过chatgpt修改如下:
之后爬取有两个文件,一个为不含评论的note内容,另一个依然包含comments,第一个文件note内容经过多次尝试都只能爬取20条,请问要实现更多的爬取,哪里需要修改呢 |
Beta Was this translation helpful? Give feedback.
-
# config.py
# 是否开启爬取子评论的配置
ENABLE_CRALER_COMMENTS = False
# MediaCrawler/media_platform/xhs/core.py
async def batch_get_note_comments(self, note_list: List[str]):
"""Batch get note comments"""
#这里从配置文件读一个变量就可以了
if not config.ENABLE_CRALER_COMMENTS:
return
utils.logger.info(
f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")
semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
task_list: List[Task] = []
for note_id in note_list:
task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
task_list.append(task)
await asyncio.gather(*task_list) |
Beta Was this translation helpful? Give feedback.
-
配置文件加一个选项用于是否开启子评论,batch_get_note_comments这个函数里加一个if判断就可以了。 |
Beta Was this translation helpful? Give feedback.
-
现在仓库已经默认集成了是否开启评论爬虫的选项了 |
Beta Was this translation helpful? Give feedback.
-
需求:能不能有个参数控制是否爬取评论内容。
场景:只需要小红书图文内容,并不需要对每个图文的评论进行详细分析
痛点:爬取起来评论的数量非常多,比较费时间,很多时候评论没大用处,或者说大多数情况下分析的深度还没到那种
Beta Was this translation helpful? Give feedback.
All reactions