Scala Webscraper 0.4.1

Getting started

The project is build with Scala 2.10.2 and sbt 0.13.0, both can be installed using this install script

To try the example navigate to the project folder and run sbt "project scraper-demo" run which will start the example scraper

Installation

If you use SBT, you just have to edit build.sbt and add the following:

libraryDependencies += "nl.razko" %% "scraper" % "0.4.1"

If you want to use bleeding edge versions using snapshots then add the Sonatype snapshots to the resolvers:

resolvers += "Sonatype Snapshots" at "http://oss.sonatype.org/content/repositories/snapshots/"

libraryDependencies += "nl.razko" %% "scraper" % "0.4.1-SNAPSHOT"

DSL

The webscraper provides a simple DSL to write scrape rules

import org.rovak.scraper.ScrapeManager._
import org.jsoup.nodes.Element

object Google {
  val results = "#res li.g h3.r a"
  def search(term: String) = {
    "http://www.google.com/search?q=" + term.replace(" ", "+")
  }
}

// Open the search results page for the query "php elephant"
scrape from Google.search("php elephant") open { implicit page =>

  // Iterate through every result link
  Google.results each { x: Element =>
  
    val link = x.select("a[href]").attr("abs:href").substring(28)
    if (link.isValidURL) {

      // Iterate through every found link in the found page
      scrape from link each (x => println("found: " + x))
    }
  }
}

Spiders

A spider is a scraper which recursively loads a page and opens every link it finds. It will keep scraping until all pages within the allowed domains are visited once.

The following snippet demonstrates a basic spider which crawls a website and provides hooks to do something with the data

new Spider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

The spider can be extended by providing traits, if you want to scrape emails then add the EmailSpider trait which offers a new onEmailFound hook in which emails can be collected.

new Spider with EmailSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"

  onEmailFound ::= { email: String =>
    // Email found
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

Multiple spiders can be mixed together

new Spider with EmailSpider with SitemapSpider {
  startUrls ::= "http://events.stanford.edu/"
  allowedDomains ::= "events.stanford.edu"
  sitemapUrls ::= "http://events.stanford.edu/sitemap.xml"

  onEmailFound ::= { email: String =>
    println("Found email: " + email)
  }

  onReceivedPage ::= { page: WebPage =>
    // Page received
  }

  onLinkFound ::= { link: Href =>
    println(s"Found link ${link.url} with name ${link.name}")
  }
}.start()

Documentation

API

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
project		project
scraper-demo/src/main/scala		scraper-demo/src/main/scala
scraper-server/src/main/scala		scraper-server/src/main/scala
scraper/src/main/scala/org/rovak/scraper		scraper/src/main/scala/org/rovak/scraper
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scala Webscraper 0.4.1

Getting started

Installation

DSL

Spiders

Documentation

About

Releases 1

Packages

Contributors 2

Languages

License

Rovak/ScalaWebscraper

Folders and files

Latest commit

History

Repository files navigation

Scala Webscraper 0.4.1

Getting started

Installation

DSL

Spiders

Documentation

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages