What is this?

You are not a real programmer if you don't do a crawler/scraper for your favorite website at least once in your life. This is my second time with Filmaffinity.

I wrote a scraper for it some years ago in PHP, and it kind of worked, but I wanted to try the power of Node.js.

This crawler takes advantage of the asynchronous paradigm and it sends simultaneous requests in parallel, so it reduces in some orders of magnitude the time to crawl the whole web.

What do you need

Filmaffinity.com will block your IP if you do too many request so, if you want to get all their film information (more than 170 000 at this moment), some workaround is needed.

First I was thinking about using some different public proxies, but I didn't want to take care of them (if they are up or not, etc), so I'm taking advantage of Tor and Polipo.

You will need to have them installed and running, either as a service or standalone.

If you want to store the data scraped you will also need to connect to a mysql server.

Finally, to run this piece of code you have to use Node.js.

Steps to run

Create a new database:
```
CREATE DATABASE filmaffinity;
```

Add a new user for that database and grant permissions:

CREATE USER 'filmaffinity'@'localhost' IDENTIFIED BY 'filmaffinity';

GRANT ALL PRIVILEGES ON filmaffinity.* TO 'filmaffinity'@'localhost';

Import database structure:

mysql -ufilmaffinity -pfilmaffinity filmaffinity < sql/db_structure.sql

Install node modules:
```
npm install
```
Run Tor and Polipo (or be sure that they are running as services).
Double check that the database name, user and password you added in the previous steps match the ones in config/parameters.ini

Run it!

node crawl.js action

You must specify one valid action:

all: Crawls all the movies
new: Crawls new recently added movies
popular: Crawls most popular movies from last week
theatres: Crawls films currently in theatres
failed: Crawls films that previously failed to be crawled
user_friends: Crawls friends from a user id (filmaffinity id)
user_friends_ratings: Crawls last ratings from users friends
user_friends_films: Crawls last films rated from friends (so those films are up to date)
id: Crawls an specific film by id and outputs the film info (option used mostly for debug purposes)

If you run it with the "all" action, it will start crawling top popular films. All data will be populated to the database and the poster images will be downloaded to the "img" folder.

You can take a look at crawler.log to see what is happening behind the scenes.

Worker

There's one worker (worker.js) to consume jobs from RabbitMQ. Those messages are published through Filmaffin API.

At the moment the worker handles UserAddedEvent, UserUpdatedEvent events from the queue. It gets and import user friends from Filmaffinity, imports last films rated by them and sends a notification through Firebase when everything is done, so the user is notificated in the App

Name		Name	Last commit message	Last commit date
Latest commit History 141 Commits
config		config
lib		lib
sql		sql
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
crawl.js		crawl.js
package-lock.json		package-lock.json
package.json		package.json
worker.js		worker.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

What do you need

Steps to run

Worker

About

Releases

Packages

Contributors 2

Languages

License

franjid/filmaffinity-crawler

Folders and files

Latest commit

History

Repository files navigation

What is this?

What do you need

Steps to run

Worker

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages