Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrade report data pipelines #30

Merged
merged 81 commits into from
Dec 9, 2024
Merged

Upgrade report data pipelines #30

merged 81 commits into from
Dec 9, 2024

Conversation

max-ostapenko
Copy link
Contributor

@max-ostapenko max-ostapenko commented Nov 16, 2024

We want to replace legacy reports script https://github.com/HTTPArchive/bigquery/tree/master/sql
This is the implementation that follows previous discussion.

Replaces script implementation (source) and breaks down a pipeline into 2 steps:

  • data preparation in BQ with Dataform
  • export to GCS or Firestore with Function

Expected outcome:

For standard reports:

  • reports configuration file created with a timeseries and a histogram
  • BQ table updates are running with fresh reports data (dataset reports) on crawl_complete
  • GCS upload is triggered whenever data is updated in BQ

for tech reports:

  • report configurations created as Dataform actions
  • aggregated BQ tables updated on cwv_tech_report tag after CrUX update
  • Firestore collections updated whenever data is updated in BQ

Supports features:

  • Run monthly histograms SQLs when crawl is finished
    Runs as part of Dataform invocation on crawl_complete+crawl_reports tag.

  • Run longer term time series SQLs when crawl is finished. Be able to run the time series in an incremental fashion.
    In Dataform console change date range to iterate over and run separate actions or tag.
    Screenshot 2024-12-02 at 21 11 55

  • [?] Handle different lenses (Top X, WordPress, Drupal, Magento)
    To clarify

  • [?] Handle CrUX reports (monthly histograms and time series) having to run later.
    Need to clarify

  • Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN
    Runs on each successful SQL report data query from DAG.

  • Be able to run and only run reports missing (histograms) or missing dates (time series)
    Failed queries are logged in DAG execution in Dataform console.
    To rerun select failed actions and run (within Dataform console).

  • Be able to force rerun (to override any existing reports). Be able to run a subset of reports.
    To run any actions you can select them in Dataform console (crawl-data repo).

Resolves:

@max-ostapenko max-ostapenko changed the title Preparing data for reports Prepare for migration of report queries Nov 16, 2024
Comment on lines 110 to 111
bytesTotal: {
name: 'Total Kilobytes',
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found reports config file - seems a good idea to keep all the configs in one place (more transparent for future contributors).

I copied it over here (to experiment with) and added the queries.
I wouldn't be able to add the queries unless the format supports multiline strings - so just saved in JS.
Actually it is required to be readable with python - YAML?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't make it work with remote config (e.g. reading from httparchive repo).
The file needs to be stored in this repo at the runtime.

Comment on lines 10 to 14
publish(sql.type, {
type: 'table',
schema: 'reports',
tags: ['crawl_reports']
}).query(ctx => constants.fillTemplate(sql.query, params))
Copy link
Contributor Author

@max-ostapenko max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In reports_* datasets we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh row batches and save them to GCS.

Currently it's configured to have a table per metric per chart type, e.g httparchive.reports_timeseries.totalBytes
We could (but it seems a bit more complicated for maintaining and querying), store all the metrics for one chart type in a single table (and cluster by metric).

Comment on lines 2 to 5
const params = {
date: constants.currentMonth,
rankFilter: constants.devRankFilter
}
Copy link
Contributor Author

@max-ostapenko max-ostapenko Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Query parameters.
I found only date.
Need to list all the required and add the queries to test them with.

@max-ostapenko
Copy link
Contributor Author

@tunetheweb here is a demo version that needs to be discussed.
Once we see that it covers all the requirements and agree on feasibility of the 3 topics in comments above - I'll finalise the part with uploading to GCS.

And I have no idea what to do with lenses and 2 more requests (see in description)..

@max-ostapenko max-ostapenko requested a review from Copilot December 9, 2024 10:38

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 21 out of 36 changed files in this pull request and generated no suggestions.

Files not reviewed (15)
  • Makefile: Language not supported
  • infra/dataform-export/package.json: Language not supported
  • .github/workflows/linter.yaml: Evaluated as low risk
  • definitions/output/core_web_vitals/technologies.js: Evaluated as low risk
  • README.md: Evaluated as low risk
  • definitions/sources/httparchive.js: Evaluated as low risk
  • includes/constants.js: Evaluated as low risk
  • infra/README.md: Evaluated as low risk
  • definitions/output/reports/cwv_tech_core_web_vitals.js: Evaluated as low risk
  • infra/dataform-export/bigquery.js: Evaluated as low risk
  • definitions/output/reports/cwv_tech_technologies.js: Evaluated as low risk
  • includes/reports.js: Evaluated as low risk
  • definitions/output/reports/cwv_tech_lighthouse.js: Evaluated as low risk
  • infra/dataform-export/firestore.js: Evaluated as low risk
  • definitions/output/reports/cwv_tech_categories.js: Evaluated as low risk
@max-ostapenko max-ostapenko merged commit ef54451 into main Dec 9, 2024
19 checks passed
@max-ostapenko max-ostapenko deleted the reports branch December 9, 2024 23:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant