Skip to content

Tooling to check the health and content of some Mozilla sites

License

Notifications You must be signed in to change notification settings

mozmeao/www-site-checker

Repository files navigation

www-site-checker

CURRENT STATUS: Beta

This project contains tooling designed to check on the state of the mozilla.org website in various ways.

Supported checks:

  • Verify all outbound URLs found on www.mozilla.org. Verification is done by domain and or an allow-list and exceptions are reported loudly.
  • Confirm robots.txt points to the correct sitemap, on the public domain, never an internal domain
  • Verify the geo code in the majority of HTML pages served via the CDN is stable, confirming appripriate CDN Vary-header configuration
  • Verify that RSS and Atom feeds are error-free

Roadmap for future checks and behaviour

  • Site-wide smoke test: confirmation of 200 OK from every single page (in dev/stage/prod)
  • HTML validation of all CMS-authored pages
  • Support throttling, if necessary in real use

All of these checks are run by Github Actions - see .github/workflows

Project Development

Linting, etc

Install pre-commit, and then run pre-commit install and you'll be setup to auto format your code according to our style and check for errors for every commit.

Tests

TO COME


Outbound link-checker - specific instructions

Automatic usage

Currently, checks are set to run when updated code in this repo is pushed or on a twice-a-day schedule. If any issues are spotted, alerts are sent to Sentry and Slack (if either is configured) and report files are uploaded as artifacts as a result of the Github Action (more on this below)

Manual usage and local development

You can also clone this repo to your local machine and run it there. This is the recommended approach for developing new checks:

git clone [email protected]:mozmeao/www-site-checker.git
cd www-site-checker
# make a virtualenv, with whatever you prefer, and activate it
pip install -r requirements.txt
export ALLOWLIST_FILEPATH=data/allowlist-mozorg.yaml
export EXTRA_URLS_FILEPATH=data/extra-urls-mozorg.yaml
python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml

The above will start to work through the entire sitemap (and child sitemaps) at that URL

If you only want to check a smaller batch of URLs (handy in development), add the --batch param:

# Only inspect batch three of all URLs to check, after slicing the site into 40 batches
python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml --batch=3:40

And if you only want to check a specific page, you use the --specific-url param. The following, for example, checks the homepage and a Fx mobile downbload page

python bin/scan_site.py --specific-url=https://www.mozilla.org/en-US/firefox/browsers/mobile/

There is a default allowlist in use (data/allowlist-mozorg.yaml - set via env vars) but an alernative can be passed via the --allowlist param

python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml --allowlist=/path/to/custom/allowlist.yaml

If you want or need to check a site whose sitemap points to a different domain (eg you want to check an origin server whose sitemap is hard-coded to refer to the CDN domain, or a localhost setup) you should ensure the server is listed as an option in the allowlist and also pass the --maintain-hostname parameter.

For example:

python bin/scan_site.py --sitemap-url=http://origin-server.example.com/sitemap.xml --maintain-hostname

or, for localhost

python bin/scan_site.py --sitemap-url=http://localhost:8000/sitemap.xml --maintain-hostname

If you want to test the Sentry integration locally, you can pass a Sentry DSN as an environment variable. Here, we're passing a URL to Kent - a local 'fake Sentry'

SENTRY_DSN=http://[email protected]:8011/1 python bin/scan_site.py --sitemap-url=https://www.mozilla.org/sitemap.xml

The output files

If unexpected URLs are detected, they are output in pairs:

  • files starting flat_ contain a flat list of the unexpected URLs
  • files starting nested_ show each unexpected URL followed by a tab-indented list of the pages/URLs that feature the unexpected URL.
  • files starting structured_ show each page in the site with 1..n unexpected URLs, and what those URLs are. These are also used as input for the automatic creation of PRs or Issues.

If the checks were carried out in batches, there may be multiple pairs of output files, with the batch number included in the filename.

Running the checks locally will put files in the output/ directory. Checks run via Github Actions will gave a scan-results archive in the artifacts section for the relevant run, which can be downloaded and inspected. You must be authenticated to access the artifact.

Adding to the default allowlist

If you come across an alert saying there was an unexpected URL detected and you're happy to allow it here's how you make an unexpected URL into an expected one:

Automatically via Github PR

  • This tool can now open PRs against its own allowlists, so in the ideal case all you need to do is review the generated PR and merge it to main.

Manually via Github editing

  • Browse to and edit the data/allowlist-mozorg.yaml file
  • Add either a new entry to allowed_outbound_url_literals or a new tested regex to allowed_outbound_url_regexes
  • Raise a new PR against the main branch. Github Actions will run the site checks. If your new rule change is valid, the checks will no longer consider that URL to be unexpected

Locally

  • Make a new branch off main
  • Edit the allowlist - eg data/allowlist-mozorg.yaml
  • Run the checks locally (see above)
  • Push the branch up and raise a PR.

About

Tooling to check the health and content of some Mozilla sites

Resources

License

Code of conduct

Stars

Watchers

Forks

Contributors 4

  •  
  •  
  •  
  •