-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrade report data pipelines #30
Conversation
includes/reports_config.js
Outdated
bytesTotal: { | ||
name: 'Total Kilobytes', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found reports config file - seems a good idea to keep all the configs in one place (more transparent for future contributors).
I copied it over here (to experiment with) and added the queries.
I wouldn't be able to add the queries unless the format supports multiline strings - so just saved in JS.
Actually it is required to be readable with python - YAML?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't make it work with remote config (e.g. reading from httparchive
repo).
The file needs to be stored in this repo at the runtime.
publish(sql.type, { | ||
type: 'table', | ||
schema: 'reports', | ||
tags: ['crawl_reports'] | ||
}).query(ctx => constants.fillTemplate(sql.query, params)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In reports_*
datasets we could store intermediate aggregated data - it's easier to check for data issues in BQ than in GCS.
Cloud Function then will pick fresh row batches and save them to GCS.
Currently it's configured to have a table per metric per chart type, e.g httparchive.reports_timeseries.totalBytes
We could (but it seems a bit more complicated for maintaining and querying), store all the metrics for one chart type in a single table (and cluster by metric).
const params = { | ||
date: constants.currentMonth, | ||
rankFilter: constants.devRankFilter | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Query parameters.
I found only date
.
Need to list all the required and add the queries to test them with.
@tunetheweb here is a demo version that needs to be discussed. And I have no idea what to do with lenses and 2 more requests (see in description).. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot reviewed 21 out of 36 changed files in this pull request and generated no suggestions.
Files not reviewed (15)
- Makefile: Language not supported
- infra/dataform-export/package.json: Language not supported
- .github/workflows/linter.yaml: Evaluated as low risk
- definitions/output/core_web_vitals/technologies.js: Evaluated as low risk
- README.md: Evaluated as low risk
- definitions/sources/httparchive.js: Evaluated as low risk
- includes/constants.js: Evaluated as low risk
- infra/README.md: Evaluated as low risk
- definitions/output/reports/cwv_tech_core_web_vitals.js: Evaluated as low risk
- infra/dataform-export/bigquery.js: Evaluated as low risk
- definitions/output/reports/cwv_tech_technologies.js: Evaluated as low risk
- includes/reports.js: Evaluated as low risk
- definitions/output/reports/cwv_tech_lighthouse.js: Evaluated as low risk
- infra/dataform-export/firestore.js: Evaluated as low risk
- definitions/output/reports/cwv_tech_categories.js: Evaluated as low risk
We want to replace legacy reports script https://github.com/HTTPArchive/bigquery/tree/master/sql
This is the implementation that follows previous discussion.
Replaces script implementation (source) and breaks down a pipeline into 2 steps:
Expected outcome:
transparent data preparation with SQL queries - main DAG expanded to go from big to small data
event-driven exports of small data to reports data storage
comfortable Dataform Console to manage any actions manually (rerun, backfill, etc.)
~3x performance when requesting GZIPped CSVs:
performance improvement for Firestore imports: 6min VS 13h
For standard reports:
reports
) oncrawl_complete
for tech reports:
cwv_tech_report
tag after CrUX updateSupports features:
Run monthly histograms SQLs when crawl is finished
Runs as part of Dataform invocation on
crawl_complete
+crawl_reports
tag.Run longer term time series SQLs when crawl is finished. Be able to run the time series in an incremental fashion.
In Dataform console change date range to iterate over and run separate actions or tag.
[?] Handle different lenses (Top X, WordPress, Drupal, Magento)
To clarify
[?] Handle CrUX reports (monthly histograms and time series) having to run later.
Need to clarify
Be able to upload to cloud storage in GCP to allow it to be hosted on our CDN
Runs on each successful SQL report data query from DAG.
Be able to run and only run reports missing (histograms) or missing dates (time series)
Failed queries are logged in DAG execution in Dataform console.
To rerun select failed actions and run (within Dataform console).
Be able to force rerun (to override any existing reports). Be able to run a subset of reports.
To run any actions you can select them in Dataform console (
crawl-data
repo).Resolves:
crawl
dataset httparchive.org#938