This crawler will collect all responses to the Internet Consultatie specified by
IC_PATH
in settings.py
and will store them in DocumentCloud
in a
project with title specified by DOCUMENTCLOUD_PROJECT_TITLE
in settings.py
.
- Clone project
- Copy
local_settings.py.example
tolocal_settings.py
and set values - Change
IC_PATH
,NUMBER_RESPONSE_PAGES
andDOCUMENTCLOUD_PROJECT_TITLE
insettings.py
- If desired, change names of
ARCHIVED_FILENAME
andERROR_LOG_NAME
- Build using
sudo docker compose -f docker-compose.yml up --build -d
- Run using
sudo docker exec -it ic_responses-crawler-1 python run_crawler.py
(preferrably usescreen
)
If you must reprocess all responses, make sure to delete the file denoted by ARCHIVED_FILENAME
first.
The loglevel for DocumentCloud
is set to INFO
and can be changed in document_storage.py
.
Scrapy
automatically logs the items that have been scraped. We suppress that using QuietLogFormatter
.
If you do want to log all items, uncomment the line for LOG_SCRAPED_ITEMS
in ICResponsesSpider
.
You can use manage.py
for the following:
- If not all responses were uploaded to DocumentCloud, use command
get-missing
to find theresponse_number
s which are missing. Using these numbers it is possible to upload these individually, by manually tweakingstart_urls
and the initialization oflast_response_number
inICResponsesSpider.py
. - Create an
overview.csv
of all responses in DocumentCloud using commandgenerate-overview
. Each line will contain a response in the form of'"url","name","place","timestamp"'
.
Usage: sudo docker exec -it ic_responses-crawler-1 python manage <command>