This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive
dataset in BigQuery.
The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main
branch is used on each triggered pipeline run.
Tag: crawl_complete
- httparchive.crawl.pages
- httparchive.crawl.parsed_css
- httparchive.crawl.requests
Tag: crux_ready
- httparchive.core_web_vitals.technologies
Consumers:
Tag: crawl_complete
- httparchive.blink_features.features
- httparchive.blink_features.usage
Consumers:
- chromestatus.com - example
-
crawl-complete PubSub subscription
Tags: ["crawl_complete"]
-
bq-poller-crux-ready Scheduler
Tags: ["crux_ready"]
In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.
- Create new dev workspace in Dataform.
- Make adjustments to the dataform configuration files and manually run a workflow to verify.
- Push all your changes to a dev branch & open a PR with the link to the BigQuery artifacts generated in the test workflow.
- In
workflow_settings.yaml
setenv_name: dev
to process sampled data. - In
includes/constants.js
settoday
or other variables to a custome value.