scraping101_py

An example of how to scrape a newspapers' website using Python, requests and bs4.

Description

This small project is meant to exemplify how one can easily scrape a newspaper website - this is a free access website, you'd use selenium should log in actions be required - to retrieve articles and respective content, writing the results to a csv file with one article per line.

The approach is amenable to feed downstream Natural Language Processing applications, especially if combined with the Mediacloud API endpoint for identifying article urls of interest.

Installation and set up

Git and Python (you can change the version in Pipfile) installation is assumed. To set up run:

git clone [email protected]:marquesafonso/scraping101_py.git
pip install pipenv
pipenv install

Usage

Here we grab 5 articles from https://24.sapo.pt/, place them in a list and inspect the html using the F12 key in the browser. See the image below for an example:

This allows us to investigate the elements we wish to scrape:

Label: The category of the article.
Title: The title of the article.
Lead: The lead of the article.
Author: The author of the article.
Date: The date the article was published in.
Body: The article text itself.

The requirements of the project are:

requests: Allows us to make HTTP requests to the urls we wish to scrape, returning the html as the response
bs4: Allows us to convert the responses from the requests into soup objects, which come with methods such as find() allowing us to efficiently parse the html and retrieve the text we are looking for.

Two additional functions are used to conveniently convert the date strings into a general date/time pattern (long time) - see https://docs.microsoft.com/en-us/dotnet/standard/base-types/standard-date-and-time-format-strings for more info. This makes the date string ready for more interesting uses and to be loaded to a database if needed.

Feel free to play around with the code and adapt it to your needs. To test it simply run:

pipenv run python scraper.py --outfile 'output/ex_scrape.csv'

And check out the output folder to see the results!

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
output		output
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
find_elements.png		find_elements.png
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

scraping101_py

Description

Installation and set up

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

marquesafonso/scraping101_py

Folders and files

Latest commit

History

Repository files navigation

scraping101_py

Description

Installation and set up

Usage

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages