Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

panthony · 2018-04-03T14:59:18Z

What is the current behavior?

The Crawl-Delay is ignored.

What is the expected behavior?

The Crawl-Delay should be honored, it can be retrieved using getCrawlDelay() on the robots parser.

What is the motivation / use case for changing the behavior?

A bot is bound to respect all the directives of the robots.txt

The text was updated successfully, but these errors were encountered:

yujiosaka · 2018-04-03T15:47:21Z

@panthony
Crawler-Delay is not part of the standard, so there is no way we can tell the number is seconds, minutes, hours or days.
Probably providing robots.txt should be the direct solution to your use case: #192

panthony · 2018-04-04T07:25:55Z

@yujiosaka You are right, this is not part of the standard.

But it looks like everyone agree that it is expected to be as a number of seconds and if the crawler may not obey it out of the box we should have some way to enforce it.

It would be sad to be banned from accessing a site because we did not obey their rules :)

I do not quite see how providing a robots.txt could be a solution?

Or you meant like I could configure the delay of the crawler according to the robots.txt I provide?

yujiosaka added the feature label Apr 3, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

panthony commented Apr 3, 2018

yujiosaka commented Apr 3, 2018

panthony commented Apr 4, 2018

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

Crawler should honor the Crawl-Delay if obeyRobotsTxt:true #194

Comments

panthony commented Apr 3, 2018

yujiosaka commented Apr 3, 2018

panthony commented Apr 4, 2018