Skip to content

openstate/ic_responses

Repository files navigation

Responses crawler for Internet Consultatie

This crawler will collect all responses to the Internet Consultatie specified by IC_PATH in settings.py and will store them in DocumentCloud in a project with title specified by DOCUMENTCLOUD_PROJECT_TITLE in settings.py.

Important links

Installation and usage

  • Clone project
  • Copy local_settings.py.example to local_settings.py and set values
  • Change IC_PATH, NUMBER_RESPONSE_PAGES and DOCUMENTCLOUD_PROJECT_TITLE in settings.py
  • If desired, change names of ARCHIVED_FILENAME and ERROR_LOG_NAME
  • Build using sudo docker compose -f docker-compose.yml up --build -d
  • Run using sudo docker exec -it ic_responses-crawler-1 python run_crawler.py (preferrably use screen)

If you must reprocess all responses, make sure to delete the file denoted by ARCHIVED_FILENAME first.

Debugging

The loglevel for DocumentCloud is set to INFO and can be changed in document_storage.py.

Scrapy automatically logs the items that have been scraped. We suppress that using QuietLogFormatter. If you do want to log all items, uncomment the line for LOG_SCRAPED_ITEMS in ICResponsesSpider.

Miscellaneous

You can use manage.py for the following:

  • If not all responses were uploaded to DocumentCloud, use command get-missing to find the response_numbers which are missing. Using these numbers it is possible to upload these individually, by manually tweaking start_urls and the initialization of last_response_number in ICResponsesSpider.py.
  • Create an overview.csv of all responses in DocumentCloud using command generate-overview. Each line will contain a response in the form of '"url","name","place","timestamp"'.

Usage: sudo docker exec -it ic_responses-crawler-1 python manage <command>

About

Collects responses on an Internet Consultatie

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published