HTTP Archive datasets pipeline

This repository handles the HTTP Archive data pipeline, which takes the results of the monthly HTTP Archive run and saves this to the httparchive dataset in BigQuery.

Pipelines

The pipelines are run in Dataform service in Google Cloud Platform (GCP) and are kicked off automatically on crawl completion and other events. The code in the main branch is used on each triggered pipeline run.

HTTP Archive Crawl

Tag: crawl_complete

Crawl dataset httparchive.crawl.*

Consumers:
- public dataset and BQ Sharing Listing
Blink Features Report httparchive.blink_features.usage

Consumers:
- chromestatus.com

HTTP Archive Technology Report

Tag: crux_ready

httparchive.reports.cwv_tech_* and httparchive.reports.tech_*

Consumers:
- HTTP Archive Tech Report

Schedules

crawl-complete PubSub subscription

Tags: ["crawl_complete"]
bq-poller-crux-ready Scheduler

Tags: ["crux_ready"]

Triggering workflows

In order to unify the workflow triggering mechanism, we use a Cloud Run function that can be invoked in a number of ways (e.g. listen to PubSub messages), do intermediate checks and trigger the particular Dataform workflow execution configuration.

Cloud resources overview

graph TB;
    subgraph Cloud Run
        dataform-service[dataform-service service]
        bigquery-export[bigquery-export job]
    end

    subgraph PubSub
        crawl-complete[crawl-complete topic]
        dataform-service-crawl-complete[dataform-service-crawl-complete subscription]
        crawl-complete --> dataform-service-crawl-complete
    end

    dataform-service-crawl-complete --> dataform-service

    subgraph Cloud_Scheduler
        bq-poller-crux-ready[bq-poller-crux-ready Poller Scheduler Job]
        bq-poller-crux-ready --> dataform-service
    end

    subgraph Dataform
        dataform[Dataform Repository]
        dataform_release_config[dataform Release Configuration]
        dataform_workflow[dataform Workflow Execution]
    end

    dataform-service --> dataform[Dataform Repository]
    dataform --> dataform_release_config
    dataform_release_config --> dataform_workflow

    subgraph BigQuery
        bq_jobs[BigQuery jobs]
        bq_datasets[BigQuery table updates]
        bq_jobs --> bq_datasets
    end

    dataform_workflow --> bq_jobs

    bq_jobs --> bigquery-export

    subgraph Monitoring
        cloud_run_logs[Cloud Run logs]
        dataform_logs[Dataform logs]
        bq_logs[BigQuery logs]
        alerting_policies[Alerting Policies]
        slack_notifications[Slack notifications]

        cloud_run_logs --> alerting_policies
        dataform_logs --> alerting_policies
        bq_logs --> alerting_policies
        alerting_policies --> slack_notifications
    end

    dataform-service --> cloud_run_logs
    dataform_workflow --> dataform_logs
    bq_jobs --> bq_logs
    bigquery-export --> cloud_run_logs

Development Setup

Install dependencies:
```
npm install
```
Available Scripts:
- npm run format - Format code using Standard.js, fix Markdown issues, and format Terraform files
- npm run lint - Run linting checks on JavaScript, Markdown files, and compile Dataform configs
- make tf_apply - Apply Terraform configurations

Code Quality

This repository uses:

Standard.js for JavaScript code style
Markdownlint for Markdown file formatting
Dataform's built-in compiler for SQL validation

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.github		.github
definitions		definitions
includes		includes
infra		infra
.editorconfig		.editorconfig
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
dataform.md		dataform.md
package-lock.json		package-lock.json
package.json		package.json
workflow_settings.yaml		workflow_settings.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

HTTP Archive datasets pipeline

Pipelines

HTTP Archive Crawl

HTTP Archive Technology Report

Schedules

Triggering workflows

Cloud resources overview

Development Setup

Code Quality

About

Uh oh!

Uh oh!

Contributors 4

Uh oh!

Languages

HTTPArchive/dataform

Folders and files

Latest commit

History

Repository files navigation

HTTP Archive datasets pipeline

Pipelines

HTTP Archive Crawl

HTTP Archive Technology Report

Schedules

Triggering workflows

Cloud resources overview

Development Setup

Code Quality

About

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors 4

Uh oh!

Languages