能不能关掉爬评论？ #149

zaneven · 2024-03-10T13:00:17Z

zaneven
Mar 10, 2024

需求：能不能有个参数控制是否爬取评论内容。
场景：只需要小红书图文内容，并不需要对每个图文的评论进行详细分析
痛点：爬取起来评论的数量非常多，比较费时间，很多时候评论没大用处，或者说大多数情况下分析的深度还没到那种

NanmiCoder · 2024-03-10T13:53:12Z

NanmiCoder
Mar 10, 2024
Maintainer

这个的修改很简单，在batch_get_note_comments这个方法修改。

    async def batch_get_note_comments(self, note_list: List[str]):
        """Batch get note comments"""
       # 这里加个参数判断就行了
        utils.logger.info(
            f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")
        semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
        task_list: List[Task] = []
        for note_id in note_list:
            task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
            task_list.append(task)
        await asyncio.gather(*task_list)

0 replies

zoixs · 2024-03-13T08:46:36Z

zoixs
Mar 13, 2024

您好，为了实现将评论关闭只爬取notes，通过chatgpt修改如下：

` async def batch_get_note_comments(self, note_list: List[str], perform_logging: bool = False):
"""Batch get note comments"""
if perform_logging:
utils.logger.info(
f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")

    semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
    task_list: List[Task] = []
    for note_id in note_list:
        task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
        task_list.append(task)
    await asyncio.gather(*task_list)`

之后爬取有两个文件，一个为不含评论的note内容，另一个依然包含comments，第一个文件note内容经过多次尝试都只能爬取20条，请问要实现更多的爬取，哪里需要修改呢

0 replies

NanmiCoder · 2024-03-13T15:06:38Z

NanmiCoder
Mar 13, 2024
Maintainer

# config.py
# 是否开启爬取子评论的配置
ENABLE_CRALER_COMMENTS = False


# MediaCrawler/media_platform/xhs/core.py
async def batch_get_note_comments(self, note_list: List[str]):
  """Batch get note comments"""
  #这里从配置文件读一个变量就可以了
  if not config.ENABLE_CRALER_COMMENTS:
      return

  utils.logger.info(
      f"[XiaoHongShuCrawler.batch_get_note_comments] Begin batch get note comments, note list: {note_list}")
  semaphore = asyncio.Semaphore(config.MAX_CONCURRENCY_NUM)
  task_list: List[Task] = []
  for note_id in note_list:
      task = asyncio.create_task(self.get_comments(note_id, semaphore), name=note_id)
      task_list.append(task)
  await asyncio.gather(*task_list)

0 replies

NanmiCoder · 2024-03-13T15:07:21Z

NanmiCoder
Mar 13, 2024
Maintainer

配置文件加一个选项用于是否开启子评论，batch_get_note_comments这个函数里加一个if判断就可以了。

0 replies

NanmiCoder · 2024-03-16T03:53:44Z

NanmiCoder
Mar 16, 2024
Maintainer

需求：能不能有个参数控制是否爬取评论内容。场景：只需要小红书图文内容，并不需要对每个图文的评论进行详细分析痛点：爬取起来评论的数量非常多，比较费时间，很多时候评论没大用处，或者说大多数情况下分析的深度还没到那种

现在仓库已经默认集成了是否开启评论爬虫的选项了

1 reply

zaneven Mar 17, 2024
Author

感谢，自己不是很懂Python，搞了老半天没你这个简单

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

能不能关掉爬评论？ #149

{{title}}

Replies: 5 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

能不能关掉爬评论？ #149

zaneven Mar 10, 2024

Replies: 5 comments · 1 reply

NanmiCoder Mar 10, 2024 Maintainer

zoixs Mar 13, 2024

NanmiCoder Mar 13, 2024 Maintainer

NanmiCoder Mar 13, 2024 Maintainer

NanmiCoder Mar 16, 2024 Maintainer

zaneven Mar 17, 2024 Author

zaneven
Mar 10, 2024

Replies: 5 comments 1 reply

NanmiCoder
Mar 10, 2024
Maintainer

zoixs
Mar 13, 2024

NanmiCoder
Mar 13, 2024
Maintainer

NanmiCoder
Mar 13, 2024
Maintainer

NanmiCoder
Mar 16, 2024
Maintainer

zaneven Mar 17, 2024
Author