Skip to content

TeunK/javaCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

Web Crawler

Technologies & Requirements

Building the project

Building and Executing

Navigate to the javacrawler folder containing the \src folder, and run:

mvn package

Finally, the runnable file can be executed by executing the following command:

$ java -jar target/javacrawler-1.0-SNAPSHOT.jar

Flags

Alternatively, the following flags are provided:

  • --url=<siteUrl> : Default is https://monzo.com
  • --crawlers=<crawlerCount> : Default is 25
  • --txt_output=<textfilename> : Default is sitemap.txt
  • --html_output=<htmlfilename> : Default is visualised.html

Example

$ java -jar target/javacrawler-1.0-SNAPSHOT.jar --url=https://monzo.com --crawlers=10 --txt_output=result.txt --visual_output=webgraph.html

  • This will scrape https://monzo.com
  • This will use up to 10 concurrent threads working in the pool
  • This will store the result in text-format in result.txt (inside the project's root folder, see console output for exact location details)
  • This will store the visual graph in webgraph.html (inside the project's root folder, see console output for exact location details)

Considerations

  • Some feedback will be provided to the user, eg. when receiving bad input. This feedback could be more explicit to mention in more detail what exactly it is that went wrong.
  • A more sophisticated logging system could be set-up by splitting different levels of logging priority into different streams and separating levels of concern. Eg. any low level importance logs can be written to a verbose log file, whereas high importance log levels (such as exceptions) can be written to a separate file or even be thrown into a messaging queue, for some kind of logging service to catch up.
  • External pages will not be crawled nor considered as child node for any given url.

Screenshots and Visuals

Site Map Result

Test Result

Test results

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published