Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update inaccurate sizes query for 2024 #108

Merged
merged 12 commits into from
Apr 16, 2024

Conversation

joemcgill
Copy link
Collaborator

@joemcgill joemcgill commented Mar 29, 2024

This is an update to the previous query created in 2022 to evaluate the impact of inaccurate image sizes attributes on WordPress sites using HTTPArchive data.

The main changes from the original query are:

  • Updates the query to use the new httparchive.all.pages table.
  • Reports percentages at every 10th percentile rather than only 10, 25, 50, 75, and 90.

See: https://github.com/GoogleChromeLabs/wpp-research/blob/main/sql/2022/12/inaccurate-sizes-attribute-impact.sql

Query results

percentile client sizesAbsoluteError sizesRelativeError idealSizesSelectedResourceEstimatedPixels actualSizesEstimatedWastedLoadedPixels relativeSizesEstimatedWastedLoadedPixels idealSizesSelectedResourceEstimatedBytes actualSizesEstimatedWastedLoadedBytes relativeSizesEstimatedWastedLoadedBytes
10 desktop 0 0.00% 22500 0 0.00% 3217.294704 0 0.00%
20 desktop 0 0.00% 38700 0 0.00% 6348.266602 0 0.00%
30 desktop 30 2.56% 60000 0 0.00% 10174 0 0.00%
40 desktop 92 21.15% 81510 0 0.00% 15106.58125 0 0.00%
50 desktop 168 46.75% 90000 0 0.00% 22073.49726 0 0.00%
60 desktop 287 76.47% 150000 959 0.83% 33262 283.5713028 0.83%
70 desktop 403 117.39% 240000 76464 77.78% 51765 11312.8272 77.78%
80 desktop 600 190.91% 360000 258560 221.92% 84253 37755.31432 221.92%
90 desktop 934 344.44% 559872 633600 611.11% 166977.4965 114393.7302 611.11%
10 mobile 0 0.00% 32400 0 0.00% 5159.87395 0 0.00%
20 mobile 13 0.00% 62500 0 0.00% 10203.16195 0 0.00%
30 mobile 31 8.43% 90000 0 0.00% 16965.07937 0 0.00%
40 mobile 50 12.50% 147456 0 0.00% 26480.16 0 0.00%
50 mobile 72 19.21% 230400 0 0.00% 40172.8267 0 0.00%
60 mobile 136 29.50% 320000 0 0.00% 59071.69489 0 0.00%
70 mobile 192 66.67% 421888 27900 13.78% 85978.125 3595.061728 13.78%
80 mobile 244 115.22% 589824 227500 124.84% 130197.1576 26660.31658 124.84%
90 mobile 360 168.66% 786432 504000 440.83% 242668 77222.4375 440.83%

This is an update to the previous query created in 2022 to evaluate the impact of inaccurate image sizes attributes on WordPress sites using HTTPArchive data.

The main changes from the original query are:

- Updates the query to use the new `httparchive.all.pages` table.
- Reports percentages at every 10th percentile rather than only 10, 25, 50, 75, and 90.

See: https://github.com/GoogleChromeLabs/wpp-research/blob/main/sql/2022/12/inaccurate-sizes-attribute-impact.sql
@joemcgill joemcgill requested a review from felixarntz March 29, 2024 19:57
@joemcgill
Copy link
Collaborator Author

joemcgill commented Mar 29, 2024

@felixarntz I've taken a first pass at trying to update the previous query from #19 here. I assume that creating a new query in the 2024 folder is preferred to directly editing the previous query. I've not run this query directly, but have validated that it will run. I'm lookin forward to your feedback.

Also, the metrics that are being processed from the payload are coming from this custom-metric definition for responsive images: https://github.com/HTTPArchive/custom-metrics/blob/main/dist/responsive_images.js

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joemcgill I left some suggestions regarding the query. While I realize much of it is based on the original query from 2022, I think it's worth simplifying and optimizing the query to focus on the data we currently care about (which FWIW also makes the query cheaper and faster to execute).

At the moment it includes some data points that aren't really important for the optimization of the sizes attribute.

sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
@joemcgill
Copy link
Collaborator Author

@felixarntz I've updated the query in 932f5f1 to make use of the custom_metrics column (I have no way of verifying that the structure is right due to quota limits) and return only data columns that I think we'll find useful:

  • percentile
  • client
  • sizesAbsoluteError
  • sizesRelativeError
  • idealSizesSelectedResourceEstimatedPixels
  • actualSizesEstimatedWastedLoadedPixels
  • idealSizesSelectedResourceEstimatedBytes
  • actualSizesEstimatedWastedLoadedBytes

It may also be useful to get a count of pages contained in each group. What do you think?

@joemcgill joemcgill requested a review from felixarntz April 2, 2024 01:40
Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more technical feedback on the query.

sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
Comment on lines 36 to 39
imgData.idealSizesSelectedResourceEstimatedPixels
imgData.actualSizesEstimatedWastedLoadedPixels,
imgData.idealSizesSelectedResourceEstimatedBytes
imgData.actualSizesEstimatedWastedLoadedBytes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replying to your feedback in #108 (comment), I do think we should also calculate the relative (%) values here, since aggregating those 4 fields individually per percentile doesn't carry a lot of meaningful information. The pixel and byte numbers are entirely dependent on the size of the respective images, and because we aggregate them individually per percentile at the end of the query, we don't get to see any relationship between them. Aggregating individually only tells us how many pixels/bytes are wasted between the smallest and largest pictures. I think relative values would be more helpful for this, because of course larger images lead to larger waste.

So I would suggest to return here:

  • sizesRelativeWastedLoadedPixels: actualSizesEstimatedWastedLoadedPixels / idealSizesSelectedResourceEstimatedPixels
  • sizesRelativeWastedLoadedBytes: actualSizesEstimatedWastedLoadedBytes / idealSizesSelectedResourceEstimatedBytes

And then get percentiles for those two.

FWIW, this is similar with sizesAbsoluteError and sizesRelativeError. The latter is probably more helpful in measuring eventual success, as absolute numbers are skewed by larger images and larger viewports. We may want to return all of the data (both absolute and relative), but doing that would serve different purposes. To me the relative data appears more useful.

Copy link
Collaborator Author

@joemcgill joemcgill Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was assuming each quantiles would be relative to each other, so we could easily calculate the % based on these numbers if we decided we wanted to. Perhaps I'm misunderstanding how each of these APPROX_QUANTILES values relate to each other.

I'm happy to add back relative values to the query, but am unsure how to best do so without being able to do some trial and error on the query itself. Would it be something like this?

CREATE TEMPORARY FUNCTION GET_IMG_SIZES_ACCURACY(custom_metrics STRING) RETURNS
  ARRAY<STRUCT<hasSrcset BOOL,
  hasSizes BOOL,
  sizesAbsoluteError INT64,
  sizesRelativeError FLOAT64,
  idealSizesSelectedResourceEstimatedPixels INT64,
  actualSizesEstimatedWastedLoadedPixels INT64,
  relativeSizesEstimatedWastedLoadedPixels FLOAT64,
  idealSizesSelectedResourceEstimatedBytes FLOAT64,
  actualSizesEstimatedWastedLoadedBytes FLOAT64,
  relativeSizesEstimatedWastedLoadedBytes FLOAT64>>
AS (
  ARRAY(
    SELECT AS STRUCT
      CAST(JSON_EXTRACT_SCALAR(image, '$.hasSrcset') AS BOOL) AS hasSrcset,
      CAST(JSON_EXTRACT_SCALAR(image, '$.hasSizes') AS BOOL) AS hasSizes,
      CAST(JSON_EXTRACT_SCALAR(image, '$.sizesAbsoluteError') AS INT64) AS sizesAbsoluteError,
      CAST(JSON_EXTRACT_SCALAR(image, '$.sizesRelativeError') AS FLOAT64) AS sizesRelativeError,
      CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64) AS idealSizesSelectedResourceEstimatedPixels,
      CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) AS actualSizesEstimatedWastedLoadedPixels,
      SAFE_DIVIDE(
        CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64),
        CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64)
      ) AS relativeSizesEstimatedWastedLoadedPixels,
      CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64) AS idealSizesSelectedResourceEstimatedBytes,
      CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) AS actualSizesEstimatedWastedLoadedBytes,
      SAFE_DIVIDE(
        CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64),
        CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64)
      ) AS relativeSizesEstimatedWastedLoadedBytes,
    FROM
      UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.responsive_images.responsive-images')) AS image
  )
);

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make ☝🏻 easier to test, I went ahead and pushed 1054716 as a first attempt at this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks exactly right. Let me try to run the query to verify.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I was assuming each quantiles would be relative to each other, so we could easily calculate the % based on these numbers if we decided we wanted to. Perhaps I'm misunderstanding how each of these APPROX_QUANTILES values relate to each other.

The quantiles are only based on the values, but not relatively. So if you have 10 values, with 8 values being 1 and 2 values being 10, almost all quantiles would show the value 1. Of the percentiles displayed here only the 90th percentile would not show 1 but 10.

So my feedback to include relative values here is not related to the quantiles, but about that the absolute sizes error is very much dependent on the size of the images. From an error perspective though, we shouldn't give larger images more impact in our data assessment.

For example, if for an ideal image of 2000px or 1MB an image of 4000px or 2MB is loaded, absolutely speaking, this is a lot worse than for an ideal image of 200px or 100kB loading an image of 400px or 200kB. But relatively, it's the same: The actually loaded image is 100% larger (or 200% as large) compared to the ideal image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you're saying. Showing relative waste gives us a different view of the effect that incorrect sizes attributes have. I'd anticipate that any additional constraint on max sizes values will likely have a greater impact on larger images, since those are the ones that are more likely to be shown when a smaller source is selected, but having both values will make it easier to confirm.

Copy link
Collaborator

@felixarntz felixarntz Apr 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After further review, I have one point of feedback regarding:

      SAFE_DIVIDE(
        CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64),
        CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64)
      ) AS relativeSizesEstimatedWastedLoadedPixels,
      ...
      SAFE_DIVIDE(
        CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64),
        CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64)
      ) AS relativeSizesEstimatedWastedLoadedBytes,

I'm unsure the order of this division is right. I don't necessarily have the answer myself, but I'm still trying to decipher what the two values respectively are. To look at some real data, I ran a the "sub query" part of this (i.e. without aggregating into percentiles) for a specific URL, in this case https://wordpress.org/. I've put the query results into this spreadsheet for us to look at. This basically has the data for every single image on that page.

What I see is that the actualSizes... value is many times 0. I guess that means the sizes attribute leads to the ideal srcset being selected? Maybe we should verify at the actual page for whether that makes sense based on our understanding.

In my understanding so far, the actualSizes... is the amount of pixels/bytes that is wasted, and the idealSizes... is the amount of pixels/bytes based on the ideal available srcset. If that's the case, it would mean the sum of the two values is the amount of pixels/bytes actually loaded. Based on that, to get the relevant wasted loaded pixels/bytes, we need to make the division the other way around (actualSizes... / idealSizes...). If the loaded image is a little too large, it would give us small percentages for example, while if the image is e.g. twice as large, it would give us 100%, or three times as large would give us 200%.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that in both the cases of estimated wasted pixels and wasted bytes, idealSizesSelectedResource... represents the resource (from srcset) that would have been selected if the sizes value was accurate. Similarly, actualSizesEstimatedWasted... represents the difference between the ideal size and the size of the selected source. So the places where you're seeing 0 reported for actualSizesEstimatedWasted... means that even though the sizes attribute was inaccurate, there wasn't a better source available in the srcset.

That said, I think you're probably right that we would want to divide the estimated wasted bytes by the ideal selected to get a relative value of the wasted value.

sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
@felixarntz
Copy link
Collaborator

@joemcgill

It may also be useful to get a count of pages contained in each group.

Can you clarify how do you mean that? Do you mean the number of pages that fall into each percentile? IMO this would be difficult to intertwine with this query. FWIW we usually need multiple queries to answer all the questions we have. I think when it comes to measuring the opportunity in number of pages or WordPress sites, I think a separate query would be more helpful, maybe in a separate PR. We could write a query that groups the images by page and then gets e.g. the median wasted pixels or bytes, absolute or relative. And then overall get a distribution of that data, which would give us data like "x% of WordPress sites have a median sizes error of y% or worse".

Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joemcgill One additional technical recommendation at a higher level: While we probably know JS better than BigQuery SQL, using JS in BigQuery is problematic for two reasons:

  • The queries are a lot slower as the BigQuery parsers and runners cannot optimize JS. It's like a black box to them.
  • For the same reason, error reporting is poor, which makes errors in the JS hard to deal with. For instance, BigQuery doesn't know whether the data types are correct, so failures are only found upon running the query, and you don't get a better message than that the JS failed, without any details.

We can replace the entire getImgSizesAccuracy(custom_metrics) function with native BigQuery SQL code to make the queries faster and improve DX, as any mistake you make would be reported in the BigQuery console (e.g. in the Google Cloud Project) as you're writing the query, not only when you run it.

I have rewritten the function, in the way it currently is, here. You can replace this 1:1 (only update the function name elsewhere, as I've made it follow the SQL best practice of the capitalized name):

CREATE TEMPORARY FUNCTION GET_IMG_SIZES_ACCURACY(custom_metrics STRING) RETURNS
  ARRAY<STRUCT<sizesAbsoluteError INT64,
  sizesRelativeError FLOAT64,
  idealSizesSelectedResourceEstimatedPixels INT64,
  actualSizesEstimatedWastedLoadedPixels INT64,
  idealSizesSelectedResourceEstimatedBytes FLOAT64,
  actualSizesEstimatedWastedLoadedBytes FLOAT64>>
AS (
  ARRAY(
    SELECT AS STRUCT
      CAST(JSON_EXTRACT_SCALAR(image, '$.sizesAbsoluteError') AS INT64) AS sizesAbsoluteError,
      CAST(JSON_EXTRACT_SCALAR(image, '$.sizesRelativeError') AS FLOAT64) AS sizesRelativeError,
      CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64) AS idealSizesSelectedResourceEstimatedPixels,
      CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) AS actualSizesEstimatedWastedLoadedPixels,
      CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64) AS idealSizesSelectedResourceEstimatedBytes,
      CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) AS actualSizesEstimatedWastedLoadedBytes
    FROM
      UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.responsive_images.responsive-images')) AS image
  )
);

I'd recommend that you do that, and afterwards incorporate any applicable suggestions from my previous review, which should be straightforward with this "template".

To explain the code a little:

  • JSON_EXTRACT_ARRAY() extracts JSON data into an array. Every single array entry is a JSON string itself, which is needed because BigQuery can't just "guess" the type of data in it.
  • We UNNEST() so that you effectively query the images almost like a table of image objects.
  • Since in our case we know the inner JSON of each item contains an object, we then call JSON_EXTRACT_SCALAR() on every field of the inner item that we need.
  • Last but not least we have to then cast them to the correct data type, as JSON_EXTRACT_SCALAR() returns everything as a string.

@joemcgill
Copy link
Collaborator Author

Thanks @felixarntz. I think I've addressed most of the feedback except for adding back relative values, which I've asked about here. It's definitely challenging to get this right without being able to run these queries to see how the data is returned. I appreciate your help with this.

@joemcgill
Copy link
Collaborator Author

One other question that I had with the updated SQL, is that the use of JSON_EXTRACT_SCALAR
and JSON_EXTRACT_ARRAY are both listed in the docs as legacy functions that are no longer recomended. Should we try to use JSON_VALUE and JSON_VALUE_ARRAY instead?

Comment on lines 36 to 39
imgData.idealSizesSelectedResourceEstimatedPixels
imgData.actualSizesEstimatedWastedLoadedPixels,
imgData.idealSizesSelectedResourceEstimatedBytes
imgData.actualSizesEstimatedWastedLoadedBytes,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks exactly right. Let me try to run the query to verify.

@felixarntz
Copy link
Collaborator

One other question that I had with the updated SQL, is that the use of JSON_EXTRACT_SCALAR and JSON_EXTRACT_ARRAY are both listed in the docs as legacy functions that are no longer recomended. Should we try to use JSON_VALUE and JSON_VALUE_ARRAY instead?

Potentially, though I don't personally feel strongly about that. I have to admit I haven't used the recommendations myself as I only found out a few days ago myself that the JSON_EXTRACT functions are no longer recommended. I'm not sure how straightforward the change would be (is it just a 1:1 replacement or does it work differently?), so maybe we stick with what I'm more familiar with for now? Once the query looks good to go in this way, we could try to update to the recommended functions and re-run to verify we get the same results.

Copy link
Collaborator Author

@joemcgill joemcgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @felixarntz. I've replied to both of your comments inline.

Comment on lines 36 to 39
imgData.idealSizesSelectedResourceEstimatedPixels
imgData.actualSizesEstimatedWastedLoadedPixels,
imgData.idealSizesSelectedResourceEstimatedBytes
imgData.actualSizesEstimatedWastedLoadedBytes,
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that in both the cases of estimated wasted pixels and wasted bytes, idealSizesSelectedResource... represents the resource (from srcset) that would have been selected if the sizes value was accurate. Similarly, actualSizesEstimatedWasted... represents the difference between the ideal size and the size of the selected source. So the places where you're seeing 0 reported for actualSizesEstimatedWasted... means that even though the sizes attribute was inaccurate, there wasn't a better source available in the srcset.

That said, I think you're probably right that we would want to divide the estimated wasted bytes by the ideal selected to get a relative value of the wasted value.

sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
Copy link
Collaborator

@felixarntz felixarntz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joemcgill This looks great, just one tiny problem. I ran the query and will put the results into the PR description now.

sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
Comment on lines 99 to 100
percentile,
client
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially a good idea to reverse this. I'm not sure whether in this particular situation it's a good idea to have the mobile and desktop results always next to each other, but usually we're looking at them as two independent lenses at the same data, so client is typically best to use first in ORDER BY.

joemcgill and others added 2 commits April 11, 2024 16:02
Values can occasionally be a decimal

Co-authored-by: Felix Arntz <[email protected]>
Co-authored-by: Felix Arntz <[email protected]>
Copy link
Collaborator Author

@joemcgill joemcgill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made the changes you've recommended. Thanks for testing this out.

sql/2024/04/inaccurate-sizes-attribute-impact.sql Outdated Show resolved Hide resolved
@joemcgill joemcgill requested a review from westonruter April 15, 2024 15:11
Copy link
Collaborator

@adamsilverstein adamsilverstein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@westonruter
Copy link
Collaborator

For reference, the diff between the old query and the new one:

--- old.sql	2024-04-16 10:41:31.794750809 -0700
+++ new.sql	2024-04-16 10:41:24.940750600 -0700
@@ -1,6 +1,6 @@
 # HTTP Archive query to measure impact of inaccurate sizes attributes per <img> for WordPress sites.
 #
-# WPP Research, Copyright 2022 Google LLC
+# WPP Research, Copyright 2024 Google LLC
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
@@ -13,81 +13,90 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+#
+# See https://github.com/GoogleChromeLabs/wpp-research/pull/108
+
+DECLARE DATE_TO_QUERY DATE DEFAULT '2024-03-01';
 
-# See query results here: https://github.com/GoogleChromeLabs/wpp-research/pull/19
-CREATE TEMPORARY FUNCTION
-  getSrcsetSizesAccuracy(payload STRING)
-  RETURNS ARRAY<STRUCT<sizesAbsoluteError INT64,
+CREATE TEMPORARY FUNCTION GET_IMG_SIZES_ACCURACY(custom_metrics STRING) RETURNS
+  ARRAY<STRUCT<hasSrcset BOOL,
+  hasSizes BOOL,
+  sizesAbsoluteError FLOAT64,
   sizesRelativeError FLOAT64,
-  wDescriptorAbsoluteError INT64,
-  wDescriptorRelativeError FLOAT64,
+  idealSizesSelectedResourceEstimatedPixels INT64,
   actualSizesEstimatedWastedLoadedPixels INT64,
+  relativeSizesEstimatedWastedLoadedPixels FLOAT64,
+  idealSizesSelectedResourceEstimatedBytes FLOAT64,
   actualSizesEstimatedWastedLoadedBytes FLOAT64,
-  wastedLoadedPercent FLOAT64>>
-  LANGUAGE js AS '''
-try {
-  var $ = JSON.parse(payload);
-  var responsiveImages = JSON.parse($._responsive_images);
-  responsiveImages = responsiveImages['responsive-images'];
-  return responsiveImages.map(({
-    sizesAbsoluteError,
-    sizesRelativeError,
-    wDescriptorAbsoluteError,
-    wDescriptorRelativeError,
-    idealSizesSelectedResourceEstimatedPixels,
-    actualSizesEstimatedWastedLoadedPixels,
-    actualSizesEstimatedWastedLoadedBytes
-  }) => {
-    let wastedLoadedPercent;
-    if ( idealSizesSelectedResourceEstimatedPixels > 0 ) {
-      wastedLoadedPercent = actualSizesEstimatedWastedLoadedPixels / idealSizesSelectedResourceEstimatedPixels;
-    } else {
-      wastedLoadedPercent = null;
-    }
-    return {
-      sizesAbsoluteError,
-      sizesRelativeError,
-      wDescriptorAbsoluteError,
-      wDescriptorRelativeError,
-      actualSizesEstimatedWastedLoadedPixels,
-      actualSizesEstimatedWastedLoadedBytes,
-      wastedLoadedPercent
-    };
-  }
+  relativeSizesEstimatedWastedLoadedBytes FLOAT64>>
+AS (
+  ARRAY(
+    SELECT AS STRUCT
+      CAST(JSON_EXTRACT_SCALAR(image, '$.hasSrcset') AS BOOL) AS hasSrcset,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.hasSizes') AS BOOL) AS hasSizes,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.sizesAbsoluteError') AS FLOAT64) AS sizesAbsoluteError,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.sizesRelativeError') AS FLOAT64) AS sizesRelativeError,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64) AS idealSizesSelectedResourceEstimatedPixels,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) AS actualSizesEstimatedWastedLoadedPixels,
+      SAFE_DIVIDE(
+        CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64),
+        CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64)
+      ) AS relativeSizesEstimatedWastedLoadedPixels,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64) AS idealSizesSelectedResourceEstimatedBytes,
+      CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) AS actualSizesEstimatedWastedLoadedBytes,
+      SAFE_DIVIDE(
+        CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64),
+        CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64)
+      ) AS relativeSizesEstimatedWastedLoadedBytes,
+    FROM
+      UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.responsive_images.responsive-images')) AS image
+  )
+);
+
+CREATE TEMPORARY FUNCTION IS_CMS(technologies ARRAY<STRUCT<technology STRING, categories ARRAY<STRING>, info ARRAY<STRING>>>, cms STRING, version STRING) RETURNS BOOL AS (
+  EXISTS(
+    SELECT * FROM UNNEST(technologies) AS technology, UNNEST(technology.info) AS info
+    WHERE technology.technology = cms
+    AND (
+      version = ""
+      OR ENDS_WITH(version, ".x") AND (STARTS_WITH(info, RTRIM(version, "x")) OR info = RTRIM(version, ".x"))
+      OR info = version
+    )
+  )
 );
-} catch (e) {
-  return [];
-}
-''';
+
+WITH wordpressSizesData AS (
 SELECT
-  percentile,
   client,
-  APPROX_QUANTILES(image.sizesAbsoluteError, 1000)[OFFSET(percentile * 10)] AS sizesAbsoluteError,
-  APPROX_QUANTILES(image.sizesRelativeError, 1000)[OFFSET(percentile * 10)] AS sizesRelativeError,
-  APPROX_QUANTILES(image.wDescriptorAbsoluteError, 1000)[OFFSET(percentile * 10)] AS wDescriptorAbsoluteError,
-  APPROX_QUANTILES(image.wDescriptorRelativeError, 1000)[OFFSET(percentile * 10)] AS wDescriptorRelativeError,
-  APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedPixels, 1000)[OFFSET(percentile * 10)] AS actualSizesEstimatedWastedLoadedPixels,
-  APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedBytes, 1000)[OFFSET(percentile * 10)] AS actualSizesEstimatedWastedLoadedBytes,
-  APPROX_QUANTILES(image.wastedLoadedPercent, 1000)[OFFSET(percentile * 10)] AS wastedLoadedPercent
-FROM (
-  SELECT
-    tpages._TABLE_SUFFIX AS client,
     image
   FROM
-    `httparchive.pages.2022_10_01_*` AS tpages,
-    UNNEST(getSrcsetSizesAccuracy(payload)) AS image
-  JOIN
-    `httparchive.technologies.2022_10_01_*` AS tech
-  ON
-    tech.url = tpages.url
+    `httparchive.all.pages`,
+    UNNEST(GET_IMG_SIZES_ACCURACY(custom_metrics)) AS image
   WHERE
-    tpages._TABLE_SUFFIX = tech._TABLE_SUFFIX
-    AND app = 'WordPress'
-    AND category = 'CMS' ),
-  UNNEST([10, 25, 50, 75, 90]) AS percentile
+    date = DATE_TO_QUERY
+    AND IS_CMS(technologies, 'WordPress', '')
+    AND is_root_page = TRUE
+    AND image.hasSrcset = TRUE
+    AND image.hasSizes = TRUE
+)
+
+SELECT
+  percentile,
+  client,
+  APPROX_QUANTILES(image.sizesAbsoluteError, 100)[OFFSET(percentile)] AS sizesAbsoluteError,
+  APPROX_QUANTILES(image.sizesRelativeError, 100)[OFFSET(percentile)] AS sizesRelativeError,
+  APPROX_QUANTILES(image.idealSizesSelectedResourceEstimatedPixels, 100)[OFFSET(percentile)] AS idealSizesSelectedResourceEstimatedPixels,
+  APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedPixels, 100)[OFFSET(percentile)] AS actualSizesEstimatedWastedLoadedPixels,
+  APPROX_QUANTILES(image.relativeSizesEstimatedWastedLoadedPixels, 100)[OFFSET(percentile)] AS relativeSizesEstimatedWastedLoadedPixels,
+  APPROX_QUANTILES(image.idealSizesSelectedResourceEstimatedBytes, 100)[OFFSET(percentile)] AS idealSizesSelectedResourceEstimatedBytes,
+  APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedBytes, 100)[OFFSET(percentile)] AS actualSizesEstimatedWastedLoadedBytes,
+  APPROX_QUANTILES(image.relativeSizesEstimatedWastedLoadedBytes, 100)[OFFSET(percentile)] AS relativeSizesEstimatedWastedLoadedBytes,
+FROM
+  wordpressSizesData,
+  UNNEST([10, 20, 30, 40, 50, 60, 70, 80, 90]) AS percentile
 GROUP BY
   percentile,
   client
 ORDER BY
-  percentile,
-  client
+  client,
+  percentile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants