Blacklist by Keyword #1243
Replies: 4 comments 1 reply
-
In this example, bbot is crawling +7k links in this format: https://www.bluehost.com/cdn-cgi/challenge-platform/h/b/jsd/r/879e95894fc60a61 If there was a feature that we could stop crawling during scan, same as |
Beta Was this translation helpful? Give feedback.
-
Agreed this would be a good feature to have. Converting to issue. |
Beta Was this translation helpful? Give feedback.
-
Raw idea: If there was a module in bbot, that was responsible for blacklisting and also could prevent HTTPX crawling similar links, I think that would help a lot in crawling duration. For example something like urless based on some configuration options check each URL before HTTPX wants to crawl it and decides if HTTPX has already crawled a similar link before or not and then allow for crawling or skip it. It can have some similar features: We define some keywords to blacklist them in crawling, for example /cdn-cgi/challenge-platform/ It only allows to crawl one language and skip others. It skips similar links of posts and articles and products. |
Beta Was this translation helpful? Give feedback.
-
Closing this issue as we have this now, via the |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
In some programs, we need to blacklist specific path such as https://www.example.com/blog/
However, it seems this is not possible with bbot, I wanted to suggest if it's possible add blacklist based on keyword.
So, if I add blog , then it won't scan or crawl any links that have blog in it.
Thanks 🙏
Update: I was also thinking about a way to limit crawling of similar links. For example, a site can have 100k products. I want to crawl only one of them, because the others are similar to this. one. Or for example a site can have 50k posts, but I want to crawl one of them. That would be great if it's possible to implement this.
Beta Was this translation helpful? Give feedback.
All reactions