Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot fetch content of some website but python can. #469

Open
ryan701212 opened this issue May 26, 2022 · 1 comment
Open

Cannot fetch content of some website but python can. #469

ryan701212 opened this issue May 26, 2022 · 1 comment

Comments

@ryan701212
Copy link

ryan701212 commented May 26, 2022

I wanna fetch the website "https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt", but failed. I spent one day to solve, but still not worked. Can somebody help? Thanks. My code as follows:

public class Controller {

public void Run() throws Exception
{
String crawlStorageFolder = "h:";
    int numberOfCrawlers = 3;

    CrawlConfig config = new CrawlConfig();
    config.setCrawlStorageFolder(crawlStorageFolder);
    config.setUserAgentString("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36");
    // Instantiate the controller for this crawl.
    PageFetcher pageFetcher = new PageFetcher(config);
    RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
    RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
    CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
    // For each crawl, you need to add some seed urls. These are the first
    // URLs that are fetched and then the crawler starts following links
    // which are found in these pages
    controller.addSeed("https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt");
    //controller.addSeed("https://github.com/hemin1003/java-spider");
	
	// The factory which creates instances of crawlers.
    CrawlController.WebCrawlerFactory<ArrowWebCrawler> factory = ArrowWebCrawler::new;
    
    // Start the crawl. This is a blocking operation, meaning that your code
    // will reach the line after this only when crawling is finished.
    controller.start(factory, numberOfCrawlers);
}

}

The message is in the following.
2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1)
2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default"
2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316)
2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false )
2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt
2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started
2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure...
2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...

@liukuan1
Copy link

liukuan1 commented May 26, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants