This is a very ad-hoc tool designed to test the BTW site and its siblings.
You would typically invoke it like this from the top of the directory that contains this README:
$ scrapy crawl btw [-a url=something...]
Please read the documentation in btw_smoketest/spiders/btw.py
to
see what parameters can be passed to the spider with -a
.
The spider crawls the site given in the url
parameter and checks
that:
- Loading pages return a 200 status code. (Redirections are ignored: the spider wants the final load to resolve to an actual page.)
- The HTML returned is valid.
- Some headers are properly set.
Each time it is run, it creates a new subdirectory in out/
. The
subdirectory is created with the UTC date and time at the start of the
crawl in ISO 8601 format. The results of the run are stored in the
subdirectory:
- If there are no errors, a file named
CLEAN
will be created with the text "yes". - If there are errors, a file named
ERRORS
will contain the items that have errors, in JSON format. There will also be a file namedREPORT
that contains a human-readable error report. This is also what is sent by email, if the spider was invoked with an email address passed to it. (Again, readbtw_smoketest/spiders/btw.py
to know how to pass such address.) - Whether there are errors or not, the content of each page visited is stored in the output directory in a file named after the URL of the page, but slugified.
- Whether there are errors or not, each page visited gets a validation
report which has the same file name as the file that saves the
content of the page (see the previous item in this list) but has
.report
appended to it.
Note that the spider will only visit those pages that are readily available to the general public. Any page that requires speciall permissions for access will not be visited. Moreover, the spider is not able to interpret JavaScript. Therefore, URLs that get added by JavaScript won't be seen by the spider.
This spider requires that the settings for Scrapy contain a
VNU_JAR_PATH
setting which should be set to the location of the
VNU jar on your system. (https://github.com/validator/validator/) The VNU
jar is used to validate the HTML of the pages. This is the only
setting you could be messing with.
Running npm install
as suggested below will install the VNU jar in a default
location that will be found with the stock settings.py
. So normally you
don't need to worry about installing it.
We recommend you set a virtualenv for this spider. For instance,
assuming you start in the top directory of btw_smoketest
:
$ cd .. $ virtualenv btw_smoketest_env $ cd btw_smoketest $ . ../btw_smoketest_env/bin/activate $ pip install -r requirements.txt $ npm install