Faucet is plugin-driven web scraping library written in PHP.
Add cristiangraz/faucet
in your composer.json
file:
{
"require": {
"cristiangraz/faucet": "*"
}
}
<?php
use Faucet\Client;
use Faucet\Plugins\Schema;
require_once(__DIR__ . '/../vendor/autoload.php');
$client = new Client;
$response = $client->get('http://www.example.org/');
if (false === $response->isSuccessful()) {
echo 'Request Failed. Response: ' . $response->getStatusCode();
exit;
}
// Scraping code here
<?php
// Using xpath: find() or findOne()
$links = $client->find('//div[@id="wrapper"]/a');
// Using css selectors: select() or selectOne()
$links = $client->select('div#wrapper a');
// The length attribute on Faucet\Dom\Elements always has the number of nodes returned
echo 'Links Found: ' . $links->length . '<br /><br />';
// Links is a Faucet\Dom\Elements object, but it implements the Iterator interface
foreach ($links as $link) {
// $link is a Faucet\Dom\Element object
echo '<a href="' . $link->getAttribute('href') . '">' . $link->getText() . '</a>';
// You can also access attributes as object properties
echo '<a href="' . $link->href . '">' . $link->getText() . '</a>';
}
<?php
$divs = $client->select('div.items');
foreach ($divs as $div) {
// $div - the Faucet\Dom\Element object - supports all of the find/select methods within the context of itself
$title = $div->findOne('h4')->getText();
// ... more code here
}
Validators allow you to validate attributes of the nodes using Zend\Validators
<?php
use Zend\Validator\Regex;
// Instantiate client ...
// Faucet\Dom\Elements has a validate() method that takes the attribute and an array of validators
$links = $client->select('div#wrapper a')
->validate('href', array(
new Zend\Validator\Regex('#^https?://#')
));
// Can also validate text using _text
$links = $client->select('div#wrapper a')
->validate('_text', array(
new Zend\Validator\Regex('#^https?://#')
));
Faucet comes with a plugin architecture that makes scraping much faster/easier, and keeps the core library simple. The initial plugins are:
HTML
: Convenience plugin for accessing links, page title, canonical url, etcSchema
: Parses schema.org markupSEO
: Provides information on follow/nofollow links, whether or not a page is indexable, etcOpenGraph
: Parses Facebook Opengraph tags
Plugins must contain an alias via a getAlias()
method. You can then grab them like this:
<?php
$opengraph = $client->getPlugin('facebook.opengraph');
echo $opengraph->getTag('og.video.width');
Each plugin has access to the Faucet\Dom\Parser
object, so all of the logic to parse common types of pages/elements can be contained within your plugin. To use your plugin, you have to register it with the Faucet Client using registerPlugins()
<?php
$client->registerPlugins(array(
new Schema
));
$client->get('http://example.com/some/recipe.html');
$schema = $client->getPlugin('schema');
print_r($schema->getSchemas());
Sites are a type of plugin, but are different from normal plugins in that they are specific to certain sites only. Here's an example of how you would scrape Craigslist using the craigslist site:
<?php
use Mechanize\Client;
use Mechanize\Plugins\Sites\Craigslist;
$client = new Client;
// Register the plugin
$client->registerPlugins(array(
new Craigslist
));
$client->get('http://phoenix.craigslist.org/cpg/');
$c = $client->getPlugin('craigslist');
$posts = $c->getPosts();
// Or for only yesterday's posts
$posts = $c->getPostings(new \DateTime('yesterday'));
foreach ($posts as $post) {
$client->get($post['url']);
// This works because the plugin always has access to Faucet\Dom\Parser object
$post = $c->getPost();
print_r($post);
}
Sometimes you need to grab section headings and elements, and associate each element with the correct heading that they are after. This is how the Craigslist site scraper works. Here's an example of how to use:
<?php
// Creates this Xpath: //h4[@class="ban"] | //p[@class="row"]
// Will select BOTH h4.ban nodes and p.row nodes
$elements = $this->select(array('h4.ban', 'p.row'));
$results = array();
foreach ($elements as $element) {
if ($element->getTag() === 'h4') {
// It's a heading.
$key = $element->getText();
continue;
}
// Key now groups your elements by the h4 header
$results[$key][] = array(
'text' => $element->getText(),
'href' => $client->getAbsoluteUrl($element->href)
);
}
If you want to filter out your data as you grab it, you can use Zend\Filters
<?php
$elements = $this->select('p.title');
foreach ($elements as $element) {
echo $element->getAttribute('_text', array(
new Zend\Filter\StringToLower,
new Zend\Filter\StripNewlines
));
}
If you are scraping a site and want the site's html, but first need to strip out all meta tags (for example)
<?php
$client->get('http://www.example.com');
// find() returns Faucet\Dom\Elements object, with access to the parser
// remove() removes the element(s) from the parser object
$client->find('/html/head/meta')->remove();
// Grabs the DOM after any changes
echo $client->getContents();