Responses crawler for Internet Consultatie

This crawler will collect all responses to the Internet Consultatie specified by IC_PATH in settings.py and will store them in DocumentCloud in a project with title specified by DOCUMENTCLOUD_PROJECT_TITLE in settings.py.

Important links

Installation and usage

Clone project
Copy local_settings.py.example to local_settings.py and set values
Change IC_PATH, NUMBER_RESPONSE_PAGES and DOCUMENTCLOUD_PROJECT_TITLE in settings.py
If desired, change names of ARCHIVED_FILENAME and ERROR_LOG_NAME
Build using sudo docker compose -f docker-compose.yml up --build -d
Run using sudo docker exec -it ic_responses-crawler-1 python run_crawler.py (preferrably use screen)

If you must reprocess all responses, make sure to delete the file denoted by ARCHIVED_FILENAME first.

Debugging

The loglevel for DocumentCloud is set to INFO and can be changed in document_storage.py.

Scrapy automatically logs the items that have been scraped. We suppress that using QuietLogFormatter. If you do want to log all items, uncomment the line for LOG_SCRAPED_ITEMS in ICResponsesSpider.

Miscellaneous

You can use manage.py for the following:

If not all responses were uploaded to DocumentCloud, use command get-missing to find the response_numbers which are missing. Using these numbers it is possible to upload these individually, by manually tweaking start_urls and the initialization of last_response_number in ICResponsesSpider.py.
Create an overview.csv of all responses in DocumentCloud using command generate-overview. Each line will contain a response in the form of '"url","name","place","timestamp"'.

Usage: sudo docker exec -it ic_responses-crawler-1 python manage <command>

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
document_storage		document_storage
spiders		spiders
utils		utils
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml
items.py		items.py
local_settings.py.example		local_settings.py.example
manage.py		manage.py
pipelines.py		pipelines.py
requirements.txt		requirements.txt
run_crawler.py		run_crawler.py
scrapy.cfg		scrapy.cfg
settings.py		settings.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

Responses crawler for Internet Consultatie

Important links

Installation and usage

Debugging

Miscellaneous

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Languages

Uh oh!

openstate/ic_responses

Folders and files

Latest commit

History

Repository files navigation

Responses crawler for Internet Consultatie

Important links

Installation and usage

Debugging

Miscellaneous

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Languages

Packages