Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can't fetch content of random pages #471

Open
NingMorris opened this issue Jul 25, 2022 · 1 comment
Open

Can't fetch content of random pages #471

NingMorris opened this issue Jul 25, 2022 · 1 comment

Comments

@NingMorris
Copy link

NingMorris commented Jul 25, 2022

For some reason, every time when I ran our crawl controller, there were always some random pages which failed to be crawled. By reading the logs, it says it can`t fetch content of these pages, but if I manually open the pages in browser, everything seems fine. e.g. https://www.sloans.com/inventory/john-deere-1790-173523.
Notes: I tried to set the politenessDelay with null, 3 seconds and 30 seconds. It seems the greater the delay time is, more random pages fail to be crawled. If the politenessDelay is null, there are usually only a few pages which fail to be crawled.
Please let me know if it is something we can fix on our end or it is related to crawler4j, thanks!

The logs is in the following.
2022-07-25 01:20:05.802 WARN 29552 --- [Crawler 2] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.sloans.com/inventory/john-deere-1790-173523

Repository owner deleted a comment from liukuan1 Jul 25, 2022
@Chaiavi
Copy link
Contributor

Chaiavi commented Oct 11, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants