GitHub - pymug/NamgyalJune19ScrapyDjangoScrapyd: Web-Scraping Presentation for Pymug community

[Pymug] Presentation of Web-Scraping techniques

This presentation intends to demonstrate the simplicity of web-scraping data

to structure it in a Django Application.

It scrapes the Scrapy github pull requests data from https://github.com/scrapy/pulls

Taking advantage of Django Admin and some simple views to represent & visualize the data.

Getting Started

Just clone the repository by typing in:

git clone https://github.com/nam4dev/web_scraping_presentation.git

Installation

Only for Windows Users, download PostgreSQL & Install it. Open PgAdmin Change Django Settings accordingly

Install Python (3.7+)

Create a python virtualenv in which you will install the requirements,

For Windows Users

Installing Scrapy on Windows

For Linux Users

cd ./web_scraping_presentation
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser
python manage.py runserver

Visualize modelled views (without data)

Open the web browser to http://127.0.0.1:8000/admin

then login using credentials defined by the above command line (createsuperuser)

Go to Authors and Pull Requests sections in the Django admin zone to visualize tables.

Go on the root page http://127.0.0.1:8000 as well to visualize custom views without data.

Fill database with scrapped data

It is now time to fill the database from the spider(s).

Through CLI

One can trigger the github spider by simply typing in a shell,

scrapy runspider web_crawlers/spiders/github.py

Through Django with scrapyd

Run scrapyd server

One can run the scrapyd server by simply typing in a shell,

cd web_scraping_presentation
scrapyd > scrapyd.server.log

Visualize scrapyd server interface

Go to http://127.0.0.1:6800

Deploy web_crawlers project to scrapyd server

The scrapyd server shall run as prerequisite

One can deploy the web_crawlers project by simply typing in a shell,

For Linux Users

cd web_scraping_presentation
scrapyd-deploy

For Windows Users

cd web_scraping_presentation
python setup.py bdist_egg

Go on the root page http://127.0.0.1:8000, and click on the appropriated link: Add the Github Spider Project to Scrapyd Server

Trigger Github Spider

Go on the root page http://127.0.0.1:8000, and click on the appropriated link: Trigger the Github Spider

Go further and automate it through celery

Just as an hint, one can easily automate by periodically scheduling a task to trigger the spider(s) through Celery Distributed Task Queue & django-celery plugin for example

Visualize scrapped data

Go back to Authors and Pull Requests sections in the Django admin zone to visualize scrapped data.

Go back on the root page http://127.0.0.1:8000 as well to visualize the data through custom views.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app		app
presentation		presentation
scripts		scripts
templates		templates
web_crawlers		web_crawlers
web_scraping		web_scraping
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg
scrapyd.conf		scrapyd.conf
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[Pymug] Presentation of Web-Scraping techniques

Getting Started

Installation

For Windows Users

For Linux Users

Visualize modelled views (without data)

Fill database with scrapped data

Through CLI

Through Django with scrapyd

Run scrapyd server

Visualize scrapyd server interface

Deploy web_crawlers project to scrapyd server

For Linux Users

For Windows Users

Trigger Github Spider

Go further and automate it through celery

Visualize scrapped data

About

Uh oh!

Releases

Packages

Languages

License

pymug/NamgyalJune19ScrapyDjangoScrapyd

Folders and files

Latest commit

History

Repository files navigation

[Pymug] Presentation of Web-Scraping techniques

Getting Started

Installation

For Windows Users

For Linux Users

Visualize modelled views (without data)

Fill database with scrapped data

Through CLI

Through Django with scrapyd

Run scrapyd server

Visualize scrapyd server interface

Deploy web_crawlers project to scrapyd server

For Linux Users

For Windows Users

Trigger Github Spider

Go further and automate it through celery

Visualize scrapped data

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages