-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update report JSON exports to crawl
dataset
#938
Comments
They've been hacked and hacked and hacked and are now a mess. Mostly my hacking. Open to better ideas here (dataform?). Basically they need to be able to do this:
Thoughts on how best to tackle this WITHOUT hacky bash scripts? |
Considering the requirement list no surprise it's a hacky one :) We could move it dataform, but I'm not sure it will give any advantages. How it would look in dataform:
Assign a tag to histogram report table jobs, and include it to run on
Assign a tag to time series report table jobs, and include it to run on
I think we could be better off moving this logic to SQL completely. Having a table per each report and exporting whole table into GS after each table change.
Seems like a templating scenario, can be implemented with nodeJS in Dataform.
Different tag, different trigger
Related to incrementality implementation note above, this needs to be separated (Cloud Function?).
Missing why?
Select a report in dev workspace, adjust template parameters and hit
Can be done using additional tags. |
Yeah, seems Dataform matches requirements well (simple and native) So I suppose it's a manual step today? |
Deployed on a VM in the cloud. We do a "git pull" to release a new report and it's kicked off from cron every morning and bails early if tables not ready, and then runs each report when data is available, and then checks (but does not run) each report every other day of that month. A bit wasteful to be honest! It takes a full day to run all reports for all lenses. |
Ideally no missing data. But sometimes it fails and I don't get run to fixing it that month. The way it works currently is it checks the last date and then runs the report with |
Oh, so it's the queries that I see running today (with VM service account) And looking at first histogram report bytes estimate we are doing 35X more processing that would be required in case of incremental table. I would focus on rewriting queries to new |
BTW if I understood correctly the current solution is constantly checking and trying to process missing reports without much manual interference today - it's not the case with Dataform. When jobs fail in Dataform we'll need to fix and re-run them manually (the correct DAG being the solution). Queuing reports and retrying failed later doesn't work with this tool. |
Yup.
Oh that's a good point! That no longer exists so will never complete. But it's running everyday trying. Have deleted that file so it won't be run anymore. Will delete from GitHub too.
As well as odd examples like above, CrUX ones will fail until that data is ready. And some don't work with lenses (but we don't have a good way of excluding them). As those will run everyday (without success) that probably inflates the failure rates. But yeah would be nice to be able to schedule these better so we only have real failures.
This is because the Lighthouse report will now be JSON and so don't need to process the whole report to get a very small subset that data? That will indeed be a very nice improvement!
Yeah I started on that. Will try to finish it out.
Yes. But that also has downsides as per above! It was very low tech. We can do better here so we don't need to do that.
That shouldn't be an issue if we make this more robust. |
No, I meant we don't need to process all historical data every month, only last crawl. We may even deduplicate some processing and atomically create a wide metrics table:
|
The reports in the |
Oh, true. I was talking about time series: https://github.com/HTTPArchive/bigquery/blob/master/sql/timeseries/a11yButtonName.sql |
Yeah but then I add a |
Oh, I expected something like this. Ok, let me create a branch with some drafts, so that you can get a feeling about the workflow and templating. |
Yeah the script started simple, and then grew and grew and grew - well beyond what we should use a bash script for! But hey, it works! |
all
datasetcrawl
dataset
@tunetheweb I expect to merge HTTPArchive/dataform#30 so that we can deliver migration and improvements to the tomorrow's CWV Tech prod pipeline run. GCS JSON reports are currently configured with demo 1 timeseries and 1 histogram published to JSON reports will be ready for prod once:
|
For the script located in https://github.com/HTTPArchive/bigquery/tree/master/sql:
I can help with SQL rewrite, but not familiar with nuances of these bash scripts
Related issues:
The text was updated successfully, but these errors were encountered: