You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
public void Run() throws Exception
{
String crawlStorageFolder = "h:";
int numberOfCrawlers = 3;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
config.setUserAgentString("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36");
// Instantiate the controller for this crawl.
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
// For each crawl, you need to add some seed urls. These are the first
// URLs that are fetched and then the crawler starts following links
// which are found in these pages
controller.addSeed("https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt");
//controller.addSeed("https://github.com/hemin1003/java-spider");
// The factory which creates instances of crawlers.
CrawlController.WebCrawlerFactory<ArrowWebCrawler> factory = ArrowWebCrawler::new;
// Start the crawl. This is a blocking operation, meaning that your code
// will reach the line after this only when crawling is finished.
controller.start(factory, numberOfCrawlers);
}
}
The message is in the following.
2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1)
2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default"
2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316)
2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false )
2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt
2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started
2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure...
2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...
The text was updated successfully, but these errors were encountered:
I wanna fetch the website "https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt", but failed. I spent one day to solve, but still not worked. Can somebody help? Thanks. My code as follows:
public class Controller {
}
The message is in the following.
2022-05-26 10:05:50.293 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Starting Webcrawler1Application using Java 13.0.2 on DESKTOP-TJDKVUQ with PID 73168 (H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1\bin\main started by Ryan in H:\workspace-spring-tool-suite-4-4.6.1.RELEASE\webcrawler-1)
2022-05-26 10:05:50.295 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : No active profile set, falling back to 1 default profile: "default"
2022-05-26 10:05:50.654 INFO 73168 --- [ main] c.e.webcrawler.Webcrawler1Application : Started Webcrawler1Application in 0.587 seconds (JVM running for 1.316)
2022-05-26 10:05:50.811 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Deleted contents of: h:\frontier ( as you have configured resumable crawling to false )
2022-05-26 10:05:51.492 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : File not found: tld-names.txt
2022-05-26 10:05:51.501 INFO 73168 --- [ main] edu.uci.ics.crawler4j.url.TLDList : Obtained 8433 TLD from packaged file tld-names.txt
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 1 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 2 started
2022-05-26 10:06:11.822 INFO 73168 --- [ main] e.u.i.crawler4j.crawler.CrawlController : Crawler 3 started
2022-05-26 10:06:31.966 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:31.967 WARN 73168 --- [ Crawler 1] e.uci.ics.crawler4j.crawler.WebCrawler : Can't fetch content of: https://www.arrow.com/en/categories/diodes-transistors-and-thyristors/bipolar-transistors/rf-bjt
2022-05-26 10:06:41.824 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : It looks like no thread is working, waiting for 10 seconds to make sure...
2022-05-26 10:06:51.827 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : No thread is working and no more URLs are in queue waiting for another 10 seconds to make sure...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : All of the crawlers are stopped. Finishing the process...
2022-05-26 10:07:01.828 INFO 73168 --- [ Thread-1] e.u.i.crawler4j.crawler.CrawlController : Waiting for 10 seconds before final clean up...
The text was updated successfully, but these errors were encountered: