Upgrade report data pipelines #30

max-ostapenko · 2024-11-16T23:38:35Z

We want to replace legacy reports script https://github.com/HTTPArchive/bigquery/tree/master/sql
This is the implementation that follows previous discussion.

Replaces script implementation (source) and breaks down a pipeline into 2 steps:

data preparation in BQ with Dataform
export to GCS or Firestore with Function

Expected outcome:

transparent data preparation with SQL queries - main DAG expanded to go from big to small data
event-driven exports of small data to reports data storage
comfortable Dataform Console to manage any actions manually (rerun, backfill, etc.)
~3x performance when requesting GZIPped CSVs:
- https://cdn.httparchive.org/reports/dev/2024_11_01/bytesTotal.json
- https://cdn.httparchive.org/reports/2024_11_01/bytesTotal.json
performance improvement for Firestore imports: 6min VS 13h

For standard reports:

reports configuration file created with a timeseries and a histogram
BQ table updates are running with fresh reports data (dataset reports) on crawl_complete
GCS upload is triggered whenever data is updated in BQ

for tech reports:

report configurations created as Dataform actions
aggregated BQ tables updated on cwv_tech_report tag after CrUX update
Firestore collections updated whenever data is updated in BQ

Supports features:

Run monthly histograms SQLs when crawl is finished
Runs as part of Dataform invocation on crawl_complete+crawl_reports tag.
Run longer term time series SQLs when crawl is finished. Be able to run the time series in an incremental fashion.
In Dataform console change date range to iterate over and run separate actions or tag.
[?] Handle different lenses (Top X, WordPress, Drupal, Magento)
To clarify
[?] Handle CrUX reports (monthly histograms and time series) having to run later.
Need to clarify
Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN
Runs on each successful SQL report data query from DAG.
Be able to run and only run reports missing (histograms) or missing dates (time series)
Failed queries are logged in DAG execution in Dataform console.
To rerun select failed actions and run (within Dataform console).
Be able to force rerun (to override any existing reports). Be able to run a subset of reports.
To run any actions you can select them in Dataform console (crawl-data repo).

Resolves:

…o reports

max-ostapenko · 2024-11-17T00:44:55Z

includes/reports_config.js

+    bytesTotal: {
+      name: 'Total Kilobytes',


I found reports config file - seems a good idea to keep all the configs in one place (more transparent for future contributors).

I copied it over here (to experiment with) and added the queries.
I wouldn't be able to add the queries unless the format supports multiline strings - so just saved in JS.
Actually it is required to be readable with python - YAML?

I couldn't make it work with remote config (e.g. reading from httparchive repo).
The file needs to be stored in this repo at the runtime.

max-ostapenko · 2024-11-17T00:46:04Z

definitions/output/reports/dynamic_publisher.js

+    publish(sql.type, {
+      type: 'table',
+      schema: 'reports',
+      tags: ['crawl_reports']
+    }).query(ctx => constants.fillTemplate(sql.query, params))


In reports_* datasets we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh row batches and save them to GCS.

Currently it's configured to have a table per metric per chart type, e.g httparchive.reports_timeseries.totalBytes
We could (but it seems a bit more complicated for maintaining and querying), store all the metrics for one chart type in a single table (and cluster by metric).

max-ostapenko · 2024-11-17T01:01:33Z

definitions/output/reports/dynamic_publisher.js

+const params = {
+  date: constants.currentMonth,
+  rankFilter: constants.devRankFilter
+}


Query parameters.
I found only date.
Need to list all the required and add the queries to test them with.

max-ostapenko · 2024-11-17T01:27:03Z

@tunetheweb here is a demo version that needs to be discussed.
Once we see that it covers all the requirements and agree on feasibility of the 3 topics in comments above - I'll finalise the part with uploading to GCS.

And I have no idea what to do with lenses and 2 more requests (see in description)..

Copilot reviewed 21 out of 36 changed files in this pull request and generated no suggestions.

Files not reviewed (15)

Makefile: Language not supported
infra/dataform-export/package.json: Language not supported
.github/workflows/linter.yaml: Evaluated as low risk
definitions/output/core_web_vitals/technologies.js: Evaluated as low risk
README.md: Evaluated as low risk
definitions/sources/httparchive.js: Evaluated as low risk
includes/constants.js: Evaluated as low risk
infra/README.md: Evaluated as low risk
definitions/output/reports/cwv_tech_core_web_vitals.js: Evaluated as low risk
infra/dataform-export/bigquery.js: Evaluated as low risk
definitions/output/reports/cwv_tech_technologies.js: Evaluated as low risk
includes/reports.js: Evaluated as low risk
definitions/output/reports/cwv_tech_lighthouse.js: Evaluated as low risk
infra/dataform-export/firestore.js: Evaluated as low risk
definitions/output/reports/cwv_tech_categories.js: Evaluated as low risk

max-ostapenko and others added 3 commits November 16, 2024 23:22

demo report

48a3b35

fix local package

f752d9f

crawl reports tag triggered

172e120

max-ostapenko changed the title ~~Preparing data for reports~~ Prepare for migration of report queries Nov 16, 2024

max-ostapenko added 2 commits November 17, 2024 01:04

Merge branch 'reports' of https://github.com/HTTPArchive/dataform int…

39ae950

…o reports

timeseries added

eb76476

max-ostapenko commented Nov 17, 2024

View reviewed changes

max-ostapenko added 2 commits November 17, 2024 02:21

split tables

4a2d145

lint

a8e2137

max-ostapenko and others added 18 commits November 20, 2024 01:06

tech report tables

4250677

check tech report sql

607c6b2

Merge branch 'main' into main

aba1af4

Merge branch 'reports' into reports

791c0e8

missing declaration

3fff267

formatting

3be0274

Merge branch 'reports' into reports

4c361a9

preOps

b804540

dataset change

cc04bb6

cwv_tech_report tested

9c8e567

Merge branch 'main' into main

45f1095

Merge branch 'reports' into reports

e38d4f0

tech_reports moved

1185758

exporter function draft

9591bf0

fix depependencies

4acdf05

rename

9a1f13e

dataset renamed

b2cd6b4

storage exp draft

02d1db7

max-ostapenko and others added 10 commits December 6, 2024 18:36

more parallelization improvements

da4718c

Merge branch 'main' into main

8e042a3

Merge branch 'reports' into reports

1a1caa6

tested batch reports

7baa4df

Merge branch 'reports' into reports

687750a

testing fast deletion

d61666b

deletion tested

ab581df

limit concurrency

2f5bed6

retries

a78999b

wait to resolve

85a5690

max-ostapenko requested a review from Copilot December 9, 2024 10:38

Copilot AI reviewed Dec 9, 2024

View reviewed changes

max-ostapenko added 4 commits December 9, 2024 11:50

tested deployed version

2d116dd

cleanup for test merge

9fd868a

cwv-tech-report to prod db

e859d29

note to unwrap pubsub payloads

0c81fb2

max-ostapenko mentioned this pull request Dec 9, 2024

Update report JSON exports to crawl dataset HTTPArchive/httparchive.org#938

Open

max-ostapenko and others added 11 commits December 9, 2024 17:46

cleanup

dbe38a1

lint

dc5732e

Merge branch 'main' into reports

ae875d9

Merge branch 'main' into reports

a4eba5a

revisited template builder

963ebfa

cleanup

91822e1

tf 6.13

e0de181

lint

87909ef

renamed

24a9bac

aligned timeout with prod

ade5867

simplify tags

f2b56f0

max-ostapenko merged commit ef54451 into main Dec 9, 2024
19 checks passed

max-ostapenko deleted the reports branch December 9, 2024 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade report data pipelines #30

Upgrade report data pipelines #30

max-ostapenko commented Nov 16, 2024 •

edited

Loading

max-ostapenko Nov 17, 2024

max-ostapenko Nov 26, 2024

max-ostapenko Nov 17, 2024 •

edited

Loading

max-ostapenko Nov 17, 2024 •

edited

Loading

max-ostapenko commented Nov 17, 2024

Upgrade report data pipelines #30

Upgrade report data pipelines #30

Conversation

max-ostapenko commented Nov 16, 2024 • edited Loading

max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

max-ostapenko Nov 26, 2024

Choose a reason for hiding this comment

max-ostapenko Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

max-ostapenko Nov 17, 2024 • edited Loading

Choose a reason for hiding this comment

max-ostapenko commented Nov 17, 2024

Choose a reason for hiding this comment

max-ostapenko commented Nov 16, 2024 •

edited

Loading

max-ostapenko Nov 17, 2024 •

edited

Loading

max-ostapenko Nov 17, 2024 •

edited

Loading