Crawler

Scenario-based crawler written in Node.js

About

Crawler is a standalone application written in Node.js built on top of Express.js, Crawlee, Puppeteer and BullMQ, allowing you to crawl data from web pages by defining scenarios. This is all controlled through the Rest API.

Development setup

Prerequisites

Docker compose
Make

Installation

$ git clone https://github.com/68publishers/crawler.git crawler
$ cd crawler
$ make init

Creating a user

HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.

$ docker exec -it crawler-app npm run user:create

Production setup

Prerequisites

Docker
Postgres >=14.6
Redis >=7

For production use, the following Redis settings must be made:

Configuring persistence with Append-only-file strategy - https://redis.io/docs/management/persistence/#aof-advantages
Set Max memory policy to noeviction - https://redis.io/docs/reference/eviction/#eviction-policies

Installation

Firstly, you need to run the database migrations with the following command:

$ docker run \
    --network <NETWORK> \
    -e DB_URL=postgres://<USER>:<PASSWORD>@<HOSTNAME>:<PORT>/<DB_NAME> \
    --entrypoint '/bin/sh' \
    -it \
    --rm \
    68publishers/crawler:latest \
    -c 'npm run migrations:up'

Then download the seccomp file, which is required to run chrome:

$ curl -C - -O https://raw.githubusercontent.com/68publishers/crawler/main/.docker/chrome/chrome.json

And run the application:

$ docker run \
    -- init \
    --network <NETWORK> \
    -e APP_URL=<APPLICATION_URL> \
    -e DB_URL=postgres://<USER>:<PASSWORD>@<HOSTNAME>:<PORT>/<DB_NAME> \
    -e REDIS_HOST=<HOSTNAME> \
    -e REDIS_PORT=<PORT> \
    -e REDIS_AUTH=<PASSWORD> \
    -p 3000:3000 \
    --security-opt seccomp=$(pwd)/chrome.json \
    -d \
    --name 68publishers_crawler \
    68publishers/crawler:latest

Creating a user

HTTP Basic authorization is required for API access and administration. Here we need to create a user to access the application.

$ docker exec -it 68publishers_crawler npm run user:create

Environment variables

Name	Required	Default	Description
APP_URL	yes	-	Full origin of the application e.g. `https://www.example.com`. The variable is used to create links to screenshots etc.
APP_PORT	no	`3000`	Port to which the application listens
DB_URL	yes	-	Connection string to postgres database e.g. postgres://root:root@localhost:5432/crawler
REDIS_HOST	yes	-	Redis hostname
REDIS_PORT	yes	-	Redis port
REDIS_AUTH	no	-	Optional redis password
REDIS_DB	no	`0`	Redis database number
WORKER_PROCESSES	no	`5`	Number of workers that process the queue of running scenarios
CRAWLEE_STORAGE_DIR	no	`./var/crawlee`	Directory where crawler stores runtime data
CHROME_PATH	no	`/usr/bin/chromium-browser`	Path to Chromium executable file
SENTRY_DSN	no	-	Logging into the Sentry is enabled if the variable is passed
SENTRY_SERVER_NAME	no	`crawler`	Server name that is passed into the Sentry logger

Rest API and Queues board

The specification of the Rest API (Swagger UI) can be found at endpoint /api-docs. Usually http://localhost:3000/api-docs in case of development setup. You can try to call all endpoints here.

Alternatively, the specification can be viewed online.

BullBoard is located at /admin/queues. Here you can see all the scenarios that are currently running or have already run.

Working with scenarios

@todo

Working with scenario schedulers

@todo

Tutorial: Creating the first scenario

@todo

Integrations

PHP Client for Crawler's API - 68publishers/crawler-client-php

License

The package is distributed under the MIT License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Crawler

Table of Contents

About

Development setup

Prerequisites

Installation

Creating a user

Production setup

Prerequisites

Installation

Creating a user

Environment variables

Rest API and Queues board

Working with scenarios

Working with scenario schedulers

Tutorial: Creating the first scenario

Integrations

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Crawler

Table of Contents

About

Development setup

Prerequisites

Installation

Creating a user

Production setup

Prerequisites

Installation

Creating a user

Environment variables

Rest API and Queues board

Working with scenarios

Working with scenario schedulers

Tutorial: Creating the first scenario

Integrations

License