The project is build with Scala 2.10.2 and sbt 0.13.0, both can be installed using this install script
To try the example navigate to the project folder and run sbt "project scraper-demo" run
which will start the example scraper
If you use SBT, you just have to edit build.sbt
and add the following:
libraryDependencies += "nl.razko" %% "scraper" % "0.4.1"
If you want to use bleeding edge versions using snapshots then add the Sonatype snapshots to the resolvers:
resolvers += "Sonatype Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"
libraryDependencies += "nl.razko" %% "scraper" % "0.4.1-SNAPSHOT"
The webscraper provides a simple DSL to write scrape rules
import org.rovak.scraper.ScrapeManager._
import org.jsoup.nodes.Element
object Google {
val results = "#res li.g h3.r a"
def search(term: String) = {
"http://www.google.com/search?q=" + term.replace(" ", "+")
}
}
// Open the search results page for the query "php elephant"
scrape from Google.search("php elephant") open { implicit page =>
// Iterate through every result link
Google.results each { x: Element =>
val link = x.select("a[href]").attr("abs:href").substring(28)
if (link.isValidURL) {
// Iterate through every found link in the found page
scrape from link each (x => println("found: " + x))
}
}
}
A spider is a scraper which recursively loads a page and opens every link it finds. It will keep scraping until all pages within the allowed domains are visited once.
The following snippet demonstrates a basic spider which crawls a website and provides hooks to do something with the data
new Spider {
startUrls ::= "http://events.stanford.edu/"
allowedDomains ::= "events.stanford.edu"
onReceivedPage ::= { page: WebPage =>
// Page received
}
onLinkFound ::= { link: Href =>
println(s"Found link ${link.url} with name ${link.name}")
}
}.start()
The spider can be extended by providing traits, if you want to scrape emails then
add the EmailSpider trait which offers a new onEmailFound
hook in which emails can be collected.
new Spider with EmailSpider {
startUrls ::= "http://events.stanford.edu/"
allowedDomains ::= "events.stanford.edu"
onEmailFound ::= { email: String =>
// Email found
}
onReceivedPage ::= { page: WebPage =>
// Page received
}
onLinkFound ::= { link: Href =>
println(s"Found link ${link.url} with name ${link.name}")
}
}.start()
Multiple spiders can be mixed together
new Spider with EmailSpider with SitemapSpider {
startUrls ::= "http://events.stanford.edu/"
allowedDomains ::= "events.stanford.edu"
sitemapUrls ::= "http://events.stanford.edu/sitemap.xml"
onEmailFound ::= { email: String =>
println("Found email: " + email)
}
onReceivedPage ::= { page: WebPage =>
// Page received
}
onLinkFound ::= { link: Href =>
println(s"Found link ${link.url} with name ${link.name}")
}
}.start()