Skip to content
This repository has been archived by the owner on Jan 13, 2023. It is now read-only.

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Griko Nibras committed Oct 8, 2019
1 parent 68eb21a commit d56cfde
Showing 1 changed file with 30 additions and 9 deletions.
39 changes: 30 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,40 +1,61 @@
<!-- markdownlint-disable MD033 MD041 MD036 MD026 -->
<!-- markdownlint-disable MD033 MD041 MD036 -->

<div align='center'>

![scrappeteer](scrappeteer.svg)

Puppeteer project template made for web scraping

> _This is a work in progress but usable template. <br /> Future version of scrappeteer will be published as an npm package. Probably._
<br />

</div>

---

## What's this?
- [What's this](#whats-this)
- [Why does this exist](#why-does-this-exist)
- [How to use it](#how-to-use-it)
- [Why not use a proper crawler](#why-not-use-a-proper-crawler)
- [How do I contribute](#how-do-i-contribute)
- [License](#license)

---

## What's this

- [GitHub project template](https://github.com/grikomsn/scrappeteer/generate) for your next scraping project
- [Puppeteer-based](https://github.com/GoogleChrome/puppeteer) scraper, crawler, or whatever your call it
- Uses multiple pages or 'workers' to scrape in bulk
- Saves results to `.csv` using [papaparse](https://www.papaparse.com/)

## Why does this exist?
## Why does this exist

- Browser-based scraping, because some websites are just plain picky
- Dead simple scraper, because too much features is too much
- Learning project, because I like making things from scratch

## How to use it?
## How to use it

- Modify `src/config.ts` with your own values and additional Puppeter configuration
- Modify `src/urls.ts` with your own URL collection
- Modify `src/extractor.ts` with your own page evaluation to extract elements
- Or just modify `src/app.ts` with your needs
- Modify [`src/config.ts`](src/config.ts) with your own values and additional Puppeter configuration
- Modify [`src/urls.ts`](src/urls.ts) with your own URL collection
- Modify [`src/extractor.ts`](src/extractor.ts) with your own page evaluation to extract elements
- Or just modify [`src/app.ts`](src/app.ts) with your own needs
- Run `ts-node src/app.ts` or `yarn start` to start scraping
- Saved results will be stored in `data/results.csv`

Published template contains extractor for scraping 100 first pages of [Hacker News](https://news.ycombinator.com/).

## Why not use [a proper crawler](https://github.com/yujiosaka/headless-chrome-crawler)?
## Why not use [a proper crawler](https://github.com/yujiosaka/headless-chrome-crawler)

Because I am too lazy to Google things but too overkill on creating things.

## How do I contribute

- The original plan was creating scrappeteer as an npm package, so helping me with this would be nice
- Any kind of improvements are welcome, do send a pull request if you'd like

## License

MIT

0 comments on commit d56cfde

Please sign in to comment.