wiki-table-scrape

Scrape all the tables from a Wikipedia article into a folder of CSV files.

You can read more about it in the blog post

Installation

This is a Python 3.5 module that depends on the Beautiful Soup and requests packages.

Clone and cd into this repo.
Install Python 3.5.
Install requirements from pip with pip install -r requirements.txt.
If on Windows, download the .whl for the lxml parser and install it locally.
Test the program by running python test_wikitablescrape.py.

Usage

Just import the module and call the scrape function. Pass it the full URL of a Wikipedia article, and a simple string (no special characters or filetypes) for the output name. The output will all be written to the output_name folder, with files named output_name.csv, output_name_1.csv, etc.

import wikitablescrape

wikitablescrape.scrape(
    url="https://en.wikipedia.org/wiki/List_of_highest-grossing_films",
    output_name="films"
)

Inspecting the output with Bash gives the following results:

$ ls films/
films.csv  films_1.csv  films_2.csv  films_3.csv

$ cat films/films_1.csv
"Rank","Title","Worldwide gross (2014 $)","Year"
"1","Gone with the Wind","$3,440,000,000","1939"
"2","Avatar","$3,020,000,000","2009"
"3","Star Wars","$2,825,000,000","1977"
"4","Titanic","$2,516,000,000","1997"
"5","The Sound of Music","$2,366,000,000","1965"
"6","E.T. the Extra-Terrestrial","$2,310,000,000","1982"
"7","The Ten Commandments","$2,187,000,000","1956"
"8","Doctor Zhivago","$2,073,000,000","1965"
"9","Jaws","$2,027,000,000","1975"
"10","Snow White and the Seven Dwarfs","$1,819,000,000","1937"

Disclaimers

The script won't give you 100% clean data for every page on Wikipedia, but it will get you most of the way there. You can see the output from the pages for mountain height, volcano height, NBA scores, and the highest-grossing films in the output folder of this repo.

I only plan to add features to this module as I need them, but if you would like to contribute, please open an issue or pull request.

If you'd like to read more about this module, please check out my blog post.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
test_wikitablescrape.py		test_wikitablescrape.py
wikitablescrape.py		wikitablescrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

wiki-table-scrape

Installation

Usage

Disclaimers

About

Releases

Packages

Languages

License

yasinsb/wiki-table-scrape

Folders and files

Latest commit

History

Repository files navigation

wiki-table-scrape

Installation

Usage

Disclaimers

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages