Parlamentsspiegel Scraper

The Parlamentsspiegel collects the federal parliamentary documentation of Germany.

Unfortunately, the documents on Parlamentsspiegel lack consistent meta data. For example, they use weird abbreviations for the German federal states, like SACA for Sachsen-Anhalt. That's far away from common standards like NUTS or ISO 3166... The translation between Parlamentsspiegel's weird country codes and more widely used codes can be found in input/lookup_laender_ps.csv

Also, the HTML structure has no well defined css classes, which makes the parsing a bit annoying.

The crawler is written in Python, the parsing in R.

Setup

create a virtual environment: python3 -m venv env
activate it: source env/bin/activate
install requirements: pip3 install -r requirements.txt

Use it

Define your search interest

Write your search words for the search input field in input/searchwords.csv
and the official tags ("Schlagworte") you're interested in input/keywords.csv

Run scripts

01_get_overview.py: to fetch all overview tables, which will be stored as html files in input/html/overview/ and relevant links in input/data/links_beratungsstand.csv
02_get_detailpages.py: to get all the information for every single document, the resulting html files can be found in input/html/beratungsstand/
03_parsing.R: to free metadata and to get a csv file with parsed information, that can be found in input/data/df.csv

To Do

clear distiction between input and output
automate folder creation
improve metadata parsing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Parlamentsspiegel Scraper

Setup

Use it

Define your search interest

Run scripts

To Do

Files

README.md

Latest commit

History

README.md

File metadata and controls

Parlamentsspiegel Scraper

Setup

Use it

Define your search interest

Run scripts

To Do