Cannot fetch content of some website but python can. #469

ryan701212 · 2022-05-26T03:15:10Z

I wanna fetch the website "https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt", but failed. I spent one day to solve, but still not worked. Can somebody help? Thanks. My code as follows:

public class Controller {

public void Run() throws Exception
{
String crawlStorageFolder = "h:";
    int numberOfCrawlers = 3;

    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);
    config.setUserAgentString("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36");
    // Instantiate the controller for this crawl.
    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    // For each crawl, you need to add some seed urls. These are the first
    // URLs that are fetched and then the crawler starts following links
    // which are found in these pages
    controller.addSeed("https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt");
    //controller.addSeed("https://github.com/hemin1003/java-spider");
	
	// The factory which creates instances of crawlers.
    CrawlController.WebCrawlerFactory<ArrowWebCrawler> factory = ArrowWebCrawler::new;
    
    // Start the crawl. This is a blocking operation, meaning that your code
    // will reach the line after this only when crawling is finished.
    controller.start(factory, numberOfCrawlers);
}

}

The message is in the following.
2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1)
2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default"
2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316)
2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false )
2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt
2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started
2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure...
2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...

The text was updated successfully, but these errors were encountered:

liukuan1 · 2022-05-26T03:15:38Z

您的邮件我已收到，我将及时查看！谢谢！

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot fetch content of some website but python can. #469

Cannot fetch content of some website but python can. #469

ryan701212 commented May 26, 2022 •

edited

Loading

liukuan1 commented May 26, 2022 via email

Cannot fetch content of some website but python can. #469

Cannot fetch content of some website but python can. #469

Comments

ryan701212 commented May 26, 2022 • edited Loading

liukuan1 commented May 26, 2022 via email

ryan701212 commented May 26, 2022 •

edited

Loading