Allris-Scraper

A scraper for ratsinfo.leipzig.de.

Requirements

Runtime

docker

Development

Python 3.8
pyenv (optional)

Usage

Using docker

Build the docker image

docker build -t codeforleipzig/allris-scraper:latest .

Run the docker container

docker run -v $(pwd)/data:/app/data --rm codeforleipzig/allris-scraper

Using python

It is recommended to use a virtual environment in order to isolate libraries used in this project from the environment of your operating system. To do so, run the following in the project directory:

# create the virtual environment in the project directory; do this once
python3 -m venv venv

# activate the environment; do this before working with the scraper
source venv/bin/activate

# install the required libraries
pip3 install -r requirements.txt

To run the scraper using python:

python3 ./1_read_paper_json.py --page_from 1 --page_to 1000 --modified_to 2023-04-27 --modified_from 2023-04-19
python3 ./2_download_pdfs.py
python3 ./3_txt_extraction.py
python3 ./4_srm_import.py

Scraper Output

The scraper writes its output to the data directory. One file per scraping session is written, the convention for the filename is <OParl object type>_<current timestamp>.jl. For example, when scraping papers: paper_2020-06-19T10-19-16.jl.

The output is a feed in JSONLines format, which means one scraped JSON document per line. For inspecting the data, the jq is useful and can be used line this:

# all documents in the file
cat path/to/file | jq .

# only the first document
head -n1 path/to/file | jq .

Extraction of PDF and TXT files

The method download_pdfs() in the leipzig.py file downloads all PDFs, linked in the the JSONLines files and saves them in data/pdfs. Files that are already saved in the folder will not be downloaded.

From the PDF files, TXT files can be generated with the extract_text_from_pdfs_recursively() method in txt_extraction.py, using Tika. The TXTs will be saved to data/txts. Files that are already saved in the folder will not be extracted.

Configuration

Scrapy allows for configuration on various levels. General configuration can be found in allris/settings.py. For the purposes of this project, relevant values are overridden in leipzig.py. Per default, it is configured towards development needs. Specifically, aggressive caching is enabled (HTTP_CACHE_ENABLED) and the number of scraped pages is limited (CLOSESPIDER_PAGECOUNT).

PDF text extraction

Prerequisite: leipzig.py scraper has been run and downloaded files to data/pdfs.

Run

python3 ./txt_extraction.py

to extract the texts from the PDFs. Files will be created under data/txts.

CSV

Prerequisite: txt_extraction.py has been run.

Run

python3 ./nlp.py

to join those text files as rows into a CSV file. That is created as data/data.csv. This file can be used for further NLP processing.

NLP

Data Preparation

nlp.py provides a method read_txts_into_dataframe() to read all TXT files in data/txts into a pandas dataframe and a method write_df_to_csv() to save this dataframe in csv format as data.csv in the data folder.

Topic Modeling

To make the obtained documents more accessible for users interested in certain topics, a topic modeling has been run on the extracted documents with the R software tidyToPān. The obtained model will be used later on for e.g. a search function.

python -m spacy download de_core_news_sm

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
.github/workflows		.github/workflows
allris		allris
bin		bin
docker		docker
.dockerignore		.dockerignore
.gitignore		.gitignore
.python-version		.python-version
1_read_paper_json.py		1_read_paper_json.py
2_download_pdfs.py		2_download_pdfs.py
3_txt_extraction.py		3_txt_extraction.py
4_srm_import.py		4_srm_import.py
Dockerfile		Dockerfile
LICENSE		LICENSE
Readme.md		Readme.md
nlp.py		nlp.py
renovate.json		renovate.json
requirements.txt		requirements.txt
runtime.txt		runtime.txt
scraper.py		scraper.py
scrapy.cfg		scrapy.cfg
summarize.py		summarize.py
topicmap.py		topicmap.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Allris-Scraper

Requirements

Runtime

Development

Usage

Using docker

Using python

Scraper Output

Extraction of PDF and TXT files

Configuration

PDF text extraction

CSV

NLP

Data Preparation

Topic Modeling

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 4

Uh oh!

Languages

License

CodeforLeipzig/allris-scraper

Folders and files

Latest commit

History

Repository files navigation

Allris-Scraper

Requirements

Runtime

Development

Usage

Using docker

Using python

Scraper Output

Extraction of PDF and TXT files

Configuration

PDF text extraction

CSV

NLP

Data Preparation

Topic Modeling

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 4

Uh oh!

Languages

Packages