You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For some reason, every time when I ran our crawl controller, there were always some random pages which failed to be crawled. By reading the logs, it says it can`t fetch content of these pages, but if I manually open the pages in browser, everything seems fine. e.g. https://www.sloans.com/inventory/john-deere-1790-173523.
Notes: I tried to set the politenessDelay with null, 3 seconds and 30 seconds. It seems the greater the delay time is, more random pages fail to be crawled. If the politenessDelay is null, there are usually only a few pages which fail to be crawled.
Please let me know if it is something we can fix on our end or it is related to crawler4j, thanks!
Search for: " Can't fetch content of" in the WebCrawler
Override that method and dump more information - why wasn't it been fetched
?
What was the http response code ?
Any other information ?
That will help you understand the problem.
Try also fetching only that page - did it succeed ?
For some reason, every time when I ran our crawl controller, there were always some random pages which failed to be crawled. By reading the logs, it says it can`t fetch content of these pages, but if I manually open the pages in browser, everything seems fine. e.g. https://www.sloans.com/inventory/john-deere-1790-173523.
Notes: I tried to set the politenessDelay with null, 3 seconds and 30 seconds. It seems the greater the delay time is, more random pages fail to be crawled. If the politenessDelay is null, there are usually only a few pages which fail to be crawled.
Please let me know if it is something we can fix on our end or it is related to crawler4j, thanks!
The logs is in the following.
2022-07-25 01:20:05.802 WARN 29552 --- [Crawler 2] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.sloans.com/inventory/john-deere-1790-173523
The text was updated successfully, but these errors were encountered: