This presentation intends to demonstrate the simplicity of web-scraping data
to structure it in a Django Application.
It scrapes the Scrapy github pull requests data from https://github.com/scrapy/pulls
Taking advantage of Django Admin and some simple views to represent & visualize the data.
Just clone the repository by typing in:
git clone https://github.com/nam4dev/web_scraping_presentation.git
Only for Windows Users, download PostgreSQL & Install it. Open PgAdmin Change Django Settings accordingly
Install Python (3.7+)
Create a python virtualenv in which you will install the requirements,
cd ./web_scraping_presentation
pip install -r requirements.txt
python manage.py migrate
python manage.py createsuperuser
python manage.py runserver
Open the web browser to http://127.0.0.1:8000/admin
then login using credentials defined by the above command line (createsuperuser
)
Go to Authors and Pull Requests sections in the Django admin zone to visualize tables.
Go on the root page http://127.0.0.1:8000 as well to visualize custom views without data.
It is now time to fill the database from the spider(s).
One can trigger the github spider
by simply typing in a shell,
scrapy runspider web_crawlers/spiders/github.py
One can run the scrapyd server
by simply typing in a shell,
cd web_scraping_presentation
scrapyd > scrapyd.server.log
Go to http://127.0.0.1:6800
The scrapyd server shall run as prerequisite
One can deploy the web_crawlers project
by simply typing in a shell,
cd web_scraping_presentation
scrapyd-deploy
cd web_scraping_presentation
python setup.py bdist_egg
Go on the root page http://127.0.0.1:8000, and click on the appropriated link: Add the Github Spider Project to Scrapyd Server
Go on the root page http://127.0.0.1:8000, and click on the appropriated link: Trigger the Github Spider
Just as an hint, one can easily automate by periodically scheduling a task to trigger the spider(s) through Celery Distributed Task Queue & django-celery plugin for example
Go back to Authors and Pull Requests sections in the Django admin zone to visualize scrapped data.
Go back on the root page http://127.0.0.1:8000 as well to visualize the data through custom views.