PuppetScraper is a opinionated wrapper library for utilizing Puppeteer to scrape pages easily, bootstrapped using Jared Palmer's tsdx.
Most people create a new scraping project by require
-ing Puppeteer and create their own logic to scrape pages, and that logic will get more complicated when trying to use multiple pages.
PuppetScraper allows you to just pass the URLs to scrape, the function to evaluate (the scraping logic), and how many pages (or tabs) to open at a time. Basically, PuppetScraper abstracts the need to create multiple page instances and retrying the evaluation logic.
Version 0.1.0 note: PuppetScraper was initially made as a project template rather than a wrapper library, but the core logic is still the same with various improvements and without extra libraries, so you can include PuppetScraper in your project easily using npm
or yarn
.
Here's a basic example on scraping the entries on first page Hacker News:
// examples/hn.js
const { PuppetScraper } = require('puppet-scraper');
const ps = await PuppetScraper.launch();
const data = await ps.scrapeFromUrl({
url: 'https://news.ycombinator.com',
evaluateFn: () => {
let items = [];
document.querySelectorAll('.storylink').forEach((node) => {
items.push({
title: node.innerText,
url: node.href,
});
});
return items;
},
});
console.log({ data });
await ps.close();
View more examples on the examples
directory.
Install puppet-scraper
via npm
or yarn
:
$ npm install puppet-scraper
--- or ---
$ yarn add puppet-scraper
Install peer dependency puppeteer
or Puppeteer equivalent (chrome-aws-lambda
, untested):
$ npm install puppeteer
--- or ---
$ yarn add puppeteer
Create the PuppetScraper instance, either launching a new browser instance, connect or use an existing browser instance:
const { PuppetScraper } = require('puppet-scraper');
const Puppeteer = require('puppeteer');
// launches a new browser instance
const instance = await PuppetScraper.launch();
// connect to an existing browser instance
const external = await PuppetScraper.connect({
browserWSEndpoint: 'ws://127.0.0.1:9222/devtools/browser/...',
});
// use an existing browser instance
const browser = await Puppeteer.launch();
const existing = await PuppetScraper.use({ browser });
launch
and connect
has the same props with Puppeteer.launch
and Puppeteer.connect
, but with an extra concurrentPages
and maxEvaluationRetries
property:
const { PuppetScraper } = require('puppet-scraper');
const instance = await PuppetScraper.launch({
concurrentPages: 3,
maxEvaluationRetries: 10
headless: false,
});
concurrentPages
is for how many pages/tabs will be opened and use for scraping.
maxEvaluationRetries
is for how many times the page will try to evaluate the given function on evaluateFn
(see below), where if the evaluation throws an error, the page will reload and try to re-evaluate again.
If concurrentPages
and maxEvaluationRetries
is not determined, it will use the default values:
export const DEFAULT_CONCURRENT_PAGES = 3;
export const DEFAULT_EVALUATION_RETRIES = 10;
As shown like the example above, use .scrapeFromUrl
and pass an object with the following properties:
url: string
, page URL to be openedevaluateFn: function
, function to evaluate (scraper method)pageOptions: object
,Puppeteer.DirectNavigationOptions
props to override page behaviors
const data = await instance.scrapeFromUrl({
url: 'https://news.ycombinator.com',
evaluateFn: () => {
let items = [];
document.querySelectorAll('.storylink').forEach((node) => {
items.push({
title: node.innerText,
url: node.href,
});
});
return items;
},
});
pageOptions
defaults the waitUntil
property to networkidle0
, which you can read more on the API documentation.
Same as .scrapeFromUrl
but passes urls
property which contain string
s of URL:
urls: string[]
, page URLs to be openedevaluateFn: function
, function to evaluate (scraper method)pageOptions: object
,Puppeteer.DirectNavigationOptions
props to override page behaviors
const urls = Array.from({ length: 5 }).map(
(_, i) => `https://news.ycombinator.com/news?p=${i + 1}`,
);
const data = await ps.scrapeFromUrls({
urls,
evaluateFn: () => {
let items = [];
document.querySelectorAll('.storylink').forEach((node) => {
items.push({
title: node.innerText,
url: node.href,
});
});
return items;
},
});
When there's nothing left to do, don't forget to close the instance with closes the browser:
await instance.close();
PuppetScraper also exposes the browser instance if you want to do things manually:
const browser = instance.___internal.browser;
Thanks goes to these wonderful people (emoji key):
Griko Nibras 💻 🚧 |
This project follows the all-contributors specification. Contributions of any kind welcome!