Collector

This is just a bunch of notes I take in order not to forget things in the future. :)

Managing the containers

Building

# with docker
sudo docker build -t website .
# with compose
sudo docker-compose build

or, if behind a proxy:

# with docker
sudo docker build -t website --build-arg PROXY="http://user:password@proxyserver:port" .
# with compose
sudo docker-compose build --build-arg PROXY="http://user:password@proxyserver:port" .

Running

sudo docker-compose up

Debugging

sudo docker exec -it <container> /bin/bash

Scrapy

Installing

Remember that to install Scrapy, it's necessary to have a few other packages installed on Ubuntu.

sudo apt-get install python-dev python-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

And if running Python3:

sudo apt-get install python3 python3-dev

Running from terminal

To get data from a certain search text in Ali, use:

scrapy crawl search -a searchtext=<search text>

Example:

scrapy crawl search -a searchtext=mp3

Development guidelines

It is really useful to use Scrapy's shell in order to test XPaths to be extracted from a website.

To enter a shell:

scrapy shell

Then, to fetch a webpage:

fetch("<url>")

From there, we can use:

The view function to see the fetched page on a web browser:
- view(response)
The response object to run XPath tests
- Examples:
  - response.xpath('//title/text()').extract_first()
  - response.xpath('//title/text()').extract()
  - pprint(response.headers)

Scrapy List of useful commands

Create a project
- scrapy startproject <project name>
Running spiders on terminal
- scrapy crawl <spider name>
Scrapy shell
- scrapy shell
Storing a spider run data into a file
- scrapy crawl <spider name> -o <file name>.json
Using arguments when invoking spiders from terminal
- scrapy crawl <spider name> -a <tag>=<value>
- Example:
  - scrapy crawl search -a searchtext=mp3

Pipeline to DB (dev vs prod environment)

I will manage the DBs on Django.

To run the crawlers from my dev environment, I need to overwrite the DB_HOST paramenter to point it to the localhost where I am running postgres. Like this:

scrapy crawl -s DB_HOST=localhost search -a searchtext=watches

Whereas while running on the containerized environment, since the DB is running on a different container, we need to plug our Scrapy container to that container network. We can use the following command:

sudo docker run -it --network <network name> collector bash. E.g.: sudo docker run -it --network website_website_db_network collector bash

Scrapyd

Scrapyd is a server which run spiders by responding to HTTP requests sent to it. The easiest way to add projects and spiders to the server is to use scrapyd-client.

Scrapyd list of useful commands

To start the service, run:
- scrapyd
And to schedule a spider run, use the following end point:
- curl <server address>/schedule.json -d project=<project name> -d spider=<spider name>
- Example:
  - curl http://localhost:6800/schedule.json -d project=myproject -d spider=spider2
To add arguments, just add them using additional -d:
- curl <server address>/schedule.json -d project=<project name> -d spider=<spider name> -d <argument name>=<argument value>

Scrapyd-client

How it works

Eggifying your project. You'll need to install setuptools for this. See Egg Caveats below.
Uploading the egg to the Scrapyd server through the addversion.json endpoint.
Steps:
1. First, cd to the projec root and deploy the project with the following command:
  - scrapyd-deploy <target> -p <project>
  - P.S.: To avoid having to type the target everytime, defaults can be saved at the scrapy.cfg file. Example:
```
[deploy]
url = http://scrapyd.example.com/api/scrapyd
username = scrapy
password = secret
project = yourproject
```

Cron jobs

One idea is to set a cron job to run a custom manage.py command periodically as mentioned here.

unittest

To add verbosity use -v.

Example

From collector/ali run: python3 -m unittest tests.ali.utils.reader

CLI

Examples

python -m unittest test_module1 test_module2
python -m unittest test_module.TestClass
python -m unittest test_module.TestClass.test_method

Discovery

To let unittest to search the project for tests, just run python3 -m unittest discover.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
ali		ali
scrapyd		scrapyd
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Collector

Managing the containers

Scrapy

Installing

Running from terminal

Development guidelines

Scrapy List of useful commands

Pipeline to DB (dev vs prod environment)

Scrapyd

Scrapyd list of useful commands

Scrapyd-client

How it works

Cron jobs

unittest

Example

CLI

Examples

Discovery

About

Releases

Packages

Languages

License

fabiomolinar/collector

Folders and files

Latest commit

History

Repository files navigation

Collector

Managing the containers

Scrapy

Installing

Running from terminal

Development guidelines

Scrapy List of useful commands

Pipeline to DB (dev vs prod environment)

Scrapyd

Scrapyd list of useful commands

Scrapyd-client

How it works

Cron jobs

unittest

Example

CLI

Examples

Discovery

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages