-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update inaccurate sizes query for 2024 #108
Conversation
This is an update to the previous query created in 2022 to evaluate the impact of inaccurate image sizes attributes on WordPress sites using HTTPArchive data. The main changes from the original query are: - Updates the query to use the new `httparchive.all.pages` table. - Reports percentages at every 10th percentile rather than only 10, 25, 50, 75, and 90. See: https://github.com/GoogleChromeLabs/wpp-research/blob/main/sql/2022/12/inaccurate-sizes-attribute-impact.sql
@felixarntz I've taken a first pass at trying to update the previous query from #19 here. I assume that creating a new query in the 2024 folder is preferred to directly editing the previous query. I've not run this query directly, but have validated that it will run. I'm lookin forward to your feedback. Also, the metrics that are being processed from the payload are coming from this custom-metric definition for responsive images: https://github.com/HTTPArchive/custom-metrics/blob/main/dist/responsive_images.js |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joemcgill I left some suggestions regarding the query. While I realize much of it is based on the original query from 2022, I think it's worth simplifying and optimizing the query to focus on the data we currently care about (which FWIW also makes the query cheaper and faster to execute).
At the moment it includes some data points that aren't really important for the optimization of the sizes
attribute.
@felixarntz I've updated the query in 932f5f1 to make use of the
It may also be useful to get a count of pages contained in each group. What do you think? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more technical feedback on the query.
imgData.idealSizesSelectedResourceEstimatedPixels | ||
imgData.actualSizesEstimatedWastedLoadedPixels, | ||
imgData.idealSizesSelectedResourceEstimatedBytes | ||
imgData.actualSizesEstimatedWastedLoadedBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replying to your feedback in #108 (comment), I do think we should also calculate the relative (%) values here, since aggregating those 4 fields individually per percentile doesn't carry a lot of meaningful information. The pixel and byte numbers are entirely dependent on the size of the respective images, and because we aggregate them individually per percentile at the end of the query, we don't get to see any relationship between them. Aggregating individually only tells us how many pixels/bytes are wasted between the smallest and largest pictures. I think relative values would be more helpful for this, because of course larger images lead to larger waste.
So I would suggest to return here:
sizesRelativeWastedLoadedPixels: actualSizesEstimatedWastedLoadedPixels / idealSizesSelectedResourceEstimatedPixels
sizesRelativeWastedLoadedBytes: actualSizesEstimatedWastedLoadedBytes / idealSizesSelectedResourceEstimatedBytes
And then get percentiles for those two.
FWIW, this is similar with sizesAbsoluteError
and sizesRelativeError
. The latter is probably more helpful in measuring eventual success, as absolute numbers are skewed by larger images and larger viewports. We may want to return all of the data (both absolute and relative), but doing that would serve different purposes. To me the relative data appears more useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was assuming each quantiles would be relative to each other, so we could easily calculate the % based on these numbers if we decided we wanted to. Perhaps I'm misunderstanding how each of these APPROX_QUANTILES
values relate to each other.
I'm happy to add back relative values to the query, but am unsure how to best do so without being able to do some trial and error on the query itself. Would it be something like this?
CREATE TEMPORARY FUNCTION GET_IMG_SIZES_ACCURACY(custom_metrics STRING) RETURNS
ARRAY<STRUCT<hasSrcset BOOL,
hasSizes BOOL,
sizesAbsoluteError INT64,
sizesRelativeError FLOAT64,
idealSizesSelectedResourceEstimatedPixels INT64,
actualSizesEstimatedWastedLoadedPixels INT64,
relativeSizesEstimatedWastedLoadedPixels FLOAT64,
idealSizesSelectedResourceEstimatedBytes FLOAT64,
actualSizesEstimatedWastedLoadedBytes FLOAT64,
relativeSizesEstimatedWastedLoadedBytes FLOAT64>>
AS (
ARRAY(
SELECT AS STRUCT
CAST(JSON_EXTRACT_SCALAR(image, '$.hasSrcset') AS BOOL) AS hasSrcset,
CAST(JSON_EXTRACT_SCALAR(image, '$.hasSizes') AS BOOL) AS hasSizes,
CAST(JSON_EXTRACT_SCALAR(image, '$.sizesAbsoluteError') AS INT64) AS sizesAbsoluteError,
CAST(JSON_EXTRACT_SCALAR(image, '$.sizesRelativeError') AS FLOAT64) AS sizesRelativeError,
CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64) AS idealSizesSelectedResourceEstimatedPixels,
CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) AS actualSizesEstimatedWastedLoadedPixels,
SAFE_DIVIDE(
CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64),
CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64)
) AS relativeSizesEstimatedWastedLoadedPixels,
CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64) AS idealSizesSelectedResourceEstimatedBytes,
CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) AS actualSizesEstimatedWastedLoadedBytes,
SAFE_DIVIDE(
CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64),
CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64)
) AS relativeSizesEstimatedWastedLoadedBytes,
FROM
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.responsive_images.responsive-images')) AS image
)
);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make ☝🏻 easier to test, I went ahead and pushed 1054716 as a first attempt at this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks exactly right. Let me try to run the query to verify.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I was assuming each quantiles would be relative to each other, so we could easily calculate the % based on these numbers if we decided we wanted to. Perhaps I'm misunderstanding how each of these
APPROX_QUANTILES
values relate to each other.
The quantiles are only based on the values, but not relatively. So if you have 10 values, with 8 values being 1
and 2 values being 10
, almost all quantiles would show the value 1
. Of the percentiles displayed here only the 90th percentile would not show 1
but 10
.
So my feedback to include relative values here is not related to the quantiles, but about that the absolute sizes error is very much dependent on the size of the images. From an error perspective though, we shouldn't give larger images more impact in our data assessment.
For example, if for an ideal image of 2000px or 1MB an image of 4000px or 2MB is loaded, absolutely speaking, this is a lot worse than for an ideal image of 200px or 100kB loading an image of 400px or 200kB. But relatively, it's the same: The actually loaded image is 100% larger (or 200% as large) compared to the ideal image.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see what you're saying. Showing relative waste gives us a different view of the effect that incorrect sizes attributes have. I'd anticipate that any additional constraint on max sizes values will likely have a greater impact on larger images, since those are the ones that are more likely to be shown when a smaller source is selected, but having both values will make it easier to confirm.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After further review, I have one point of feedback regarding:
SAFE_DIVIDE( CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64), CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) ) AS relativeSizesEstimatedWastedLoadedPixels, ... SAFE_DIVIDE( CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64), CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) ) AS relativeSizesEstimatedWastedLoadedBytes,
I'm unsure the order of this division is right. I don't necessarily have the answer myself, but I'm still trying to decipher what the two values respectively are. To look at some real data, I ran a the "sub query" part of this (i.e. without aggregating into percentiles) for a specific URL, in this case https://wordpress.org/
. I've put the query results into this spreadsheet for us to look at. This basically has the data for every single image on that page.
What I see is that the actualSizes...
value is many times 0
. I guess that means the sizes
attribute leads to the ideal srcset being selected? Maybe we should verify at the actual page for whether that makes sense based on our understanding.
In my understanding so far, the actualSizes...
is the amount of pixels/bytes that is wasted, and the idealSizes...
is the amount of pixels/bytes based on the ideal available srcset. If that's the case, it would mean the sum of the two values is the amount of pixels/bytes actually loaded. Based on that, to get the relevant wasted loaded pixels/bytes, we need to make the division the other way around (actualSizes... / idealSizes...
). If the loaded image is a little too large, it would give us small percentages for example, while if the image is e.g. twice as large, it would give us 100%, or three times as large would give us 200%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that in both the cases of estimated wasted pixels and wasted bytes, idealSizesSelectedResource...
represents the resource (from srcset
) that would have been selected if the sizes
value was accurate. Similarly, actualSizesEstimatedWasted...
represents the difference between the ideal size and the size of the selected source. So the places where you're seeing 0 reported for actualSizesEstimatedWasted...
means that even though the sizes
attribute was inaccurate, there wasn't a better source available in the srcset
.
That said, I think you're probably right that we would want to divide the estimated wasted bytes by the ideal selected to get a relative value of the wasted value.
Can you clarify how do you mean that? Do you mean the number of pages that fall into each percentile? IMO this would be difficult to intertwine with this query. FWIW we usually need multiple queries to answer all the questions we have. I think when it comes to measuring the opportunity in number of pages or WordPress sites, I think a separate query would be more helpful, maybe in a separate PR. We could write a query that groups the images by page and then gets e.g. the median wasted pixels or bytes, absolute or relative. And then overall get a distribution of that data, which would give us data like "x% of WordPress sites have a median |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joemcgill One additional technical recommendation at a higher level: While we probably know JS better than BigQuery SQL, using JS in BigQuery is problematic for two reasons:
- The queries are a lot slower as the BigQuery parsers and runners cannot optimize JS. It's like a black box to them.
- For the same reason, error reporting is poor, which makes errors in the JS hard to deal with. For instance, BigQuery doesn't know whether the data types are correct, so failures are only found upon running the query, and you don't get a better message than that the JS failed, without any details.
We can replace the entire getImgSizesAccuracy(custom_metrics)
function with native BigQuery SQL code to make the queries faster and improve DX, as any mistake you make would be reported in the BigQuery console (e.g. in the Google Cloud Project) as you're writing the query, not only when you run it.
I have rewritten the function, in the way it currently is, here. You can replace this 1:1 (only update the function name elsewhere, as I've made it follow the SQL best practice of the capitalized name):
CREATE TEMPORARY FUNCTION GET_IMG_SIZES_ACCURACY(custom_metrics STRING) RETURNS
ARRAY<STRUCT<sizesAbsoluteError INT64,
sizesRelativeError FLOAT64,
idealSizesSelectedResourceEstimatedPixels INT64,
actualSizesEstimatedWastedLoadedPixels INT64,
idealSizesSelectedResourceEstimatedBytes FLOAT64,
actualSizesEstimatedWastedLoadedBytes FLOAT64>>
AS (
ARRAY(
SELECT AS STRUCT
CAST(JSON_EXTRACT_SCALAR(image, '$.sizesAbsoluteError') AS INT64) AS sizesAbsoluteError,
CAST(JSON_EXTRACT_SCALAR(image, '$.sizesRelativeError') AS FLOAT64) AS sizesRelativeError,
CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64) AS idealSizesSelectedResourceEstimatedPixels,
CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) AS actualSizesEstimatedWastedLoadedPixels,
CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64) AS idealSizesSelectedResourceEstimatedBytes,
CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) AS actualSizesEstimatedWastedLoadedBytes
FROM
UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.responsive_images.responsive-images')) AS image
)
);
I'd recommend that you do that, and afterwards incorporate any applicable suggestions from my previous review, which should be straightforward with this "template".
To explain the code a little:
JSON_EXTRACT_ARRAY()
extracts JSON data into an array. Every single array entry is a JSON string itself, which is needed because BigQuery can't just "guess" the type of data in it.- We
UNNEST()
so that you effectively query the images almost like a table of image objects. - Since in our case we know the inner JSON of each item contains an object, we then call
JSON_EXTRACT_SCALAR()
on every field of the inner item that we need. - Last but not least we have to then cast them to the correct data type, as
JSON_EXTRACT_SCALAR()
returns everything as a string.
Thanks @felixarntz. I think I've addressed most of the feedback except for adding back relative values, which I've asked about here. It's definitely challenging to get this right without being able to run these queries to see how the data is returned. I appreciate your help with this. |
One other question that I had with the updated SQL, is that the use of JSON_EXTRACT_SCALAR |
imgData.idealSizesSelectedResourceEstimatedPixels | ||
imgData.actualSizesEstimatedWastedLoadedPixels, | ||
imgData.idealSizesSelectedResourceEstimatedBytes | ||
imgData.actualSizesEstimatedWastedLoadedBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks exactly right. Let me try to run the query to verify.
Potentially, though I don't personally feel strongly about that. I have to admit I haven't used the recommendations myself as I only found out a few days ago myself that the |
Co-authored-by: Felix Arntz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @felixarntz. I've replied to both of your comments inline.
imgData.idealSizesSelectedResourceEstimatedPixels | ||
imgData.actualSizesEstimatedWastedLoadedPixels, | ||
imgData.idealSizesSelectedResourceEstimatedBytes | ||
imgData.actualSizesEstimatedWastedLoadedBytes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that in both the cases of estimated wasted pixels and wasted bytes, idealSizesSelectedResource...
represents the resource (from srcset
) that would have been selected if the sizes
value was accurate. Similarly, actualSizesEstimatedWasted...
represents the difference between the ideal size and the size of the selected source. So the places where you're seeing 0 reported for actualSizesEstimatedWasted...
means that even though the sizes
attribute was inaccurate, there wasn't a better source available in the srcset
.
That said, I think you're probably right that we would want to divide the estimated wasted bytes by the ideal selected to get a relative value of the wasted value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joemcgill This looks great, just one tiny problem. I ran the query and will put the results into the PR description now.
percentile, | ||
client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially a good idea to reverse this. I'm not sure whether in this particular situation it's a good idea to have the mobile
and desktop
results always next to each other, but usually we're looking at them as two independent lenses at the same data, so client
is typically best to use first in ORDER BY
.
Values can occasionally be a decimal Co-authored-by: Felix Arntz <[email protected]>
Co-authored-by: Felix Arntz <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made the changes you've recommended. Thanks for testing this out.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
For reference, the diff between the old query and the new one: --- old.sql 2024-04-16 10:41:31.794750809 -0700
+++ new.sql 2024-04-16 10:41:24.940750600 -0700
@@ -1,6 +1,6 @@
# HTTP Archive query to measure impact of inaccurate sizes attributes per <img> for WordPress sites.
#
-# WPP Research, Copyright 2022 Google LLC
+# WPP Research, Copyright 2024 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
@@ -13,81 +13,90 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
+#
+# See https://github.com/GoogleChromeLabs/wpp-research/pull/108
+
+DECLARE DATE_TO_QUERY DATE DEFAULT '2024-03-01';
-# See query results here: https://github.com/GoogleChromeLabs/wpp-research/pull/19
-CREATE TEMPORARY FUNCTION
- getSrcsetSizesAccuracy(payload STRING)
- RETURNS ARRAY<STRUCT<sizesAbsoluteError INT64,
+CREATE TEMPORARY FUNCTION GET_IMG_SIZES_ACCURACY(custom_metrics STRING) RETURNS
+ ARRAY<STRUCT<hasSrcset BOOL,
+ hasSizes BOOL,
+ sizesAbsoluteError FLOAT64,
sizesRelativeError FLOAT64,
- wDescriptorAbsoluteError INT64,
- wDescriptorRelativeError FLOAT64,
+ idealSizesSelectedResourceEstimatedPixels INT64,
actualSizesEstimatedWastedLoadedPixels INT64,
+ relativeSizesEstimatedWastedLoadedPixels FLOAT64,
+ idealSizesSelectedResourceEstimatedBytes FLOAT64,
actualSizesEstimatedWastedLoadedBytes FLOAT64,
- wastedLoadedPercent FLOAT64>>
- LANGUAGE js AS '''
-try {
- var $ = JSON.parse(payload);
- var responsiveImages = JSON.parse($._responsive_images);
- responsiveImages = responsiveImages['responsive-images'];
- return responsiveImages.map(({
- sizesAbsoluteError,
- sizesRelativeError,
- wDescriptorAbsoluteError,
- wDescriptorRelativeError,
- idealSizesSelectedResourceEstimatedPixels,
- actualSizesEstimatedWastedLoadedPixels,
- actualSizesEstimatedWastedLoadedBytes
- }) => {
- let wastedLoadedPercent;
- if ( idealSizesSelectedResourceEstimatedPixels > 0 ) {
- wastedLoadedPercent = actualSizesEstimatedWastedLoadedPixels / idealSizesSelectedResourceEstimatedPixels;
- } else {
- wastedLoadedPercent = null;
- }
- return {
- sizesAbsoluteError,
- sizesRelativeError,
- wDescriptorAbsoluteError,
- wDescriptorRelativeError,
- actualSizesEstimatedWastedLoadedPixels,
- actualSizesEstimatedWastedLoadedBytes,
- wastedLoadedPercent
- };
- }
+ relativeSizesEstimatedWastedLoadedBytes FLOAT64>>
+AS (
+ ARRAY(
+ SELECT AS STRUCT
+ CAST(JSON_EXTRACT_SCALAR(image, '$.hasSrcset') AS BOOL) AS hasSrcset,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.hasSizes') AS BOOL) AS hasSizes,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.sizesAbsoluteError') AS FLOAT64) AS sizesAbsoluteError,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.sizesRelativeError') AS FLOAT64) AS sizesRelativeError,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64) AS idealSizesSelectedResourceEstimatedPixels,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64) AS actualSizesEstimatedWastedLoadedPixels,
+ SAFE_DIVIDE(
+ CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedPixels') AS INT64),
+ CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedPixels') AS INT64)
+ ) AS relativeSizesEstimatedWastedLoadedPixels,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64) AS idealSizesSelectedResourceEstimatedBytes,
+ CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64) AS actualSizesEstimatedWastedLoadedBytes,
+ SAFE_DIVIDE(
+ CAST(JSON_EXTRACT_SCALAR(image, '$.actualSizesEstimatedWastedLoadedBytes') AS FLOAT64),
+ CAST(JSON_EXTRACT_SCALAR(image, '$.idealSizesSelectedResourceEstimatedBytes') AS FLOAT64)
+ ) AS relativeSizesEstimatedWastedLoadedBytes,
+ FROM
+ UNNEST(JSON_EXTRACT_ARRAY(custom_metrics, '$.responsive_images.responsive-images')) AS image
+ )
+);
+
+CREATE TEMPORARY FUNCTION IS_CMS(technologies ARRAY<STRUCT<technology STRING, categories ARRAY<STRING>, info ARRAY<STRING>>>, cms STRING, version STRING) RETURNS BOOL AS (
+ EXISTS(
+ SELECT * FROM UNNEST(technologies) AS technology, UNNEST(technology.info) AS info
+ WHERE technology.technology = cms
+ AND (
+ version = ""
+ OR ENDS_WITH(version, ".x") AND (STARTS_WITH(info, RTRIM(version, "x")) OR info = RTRIM(version, ".x"))
+ OR info = version
+ )
+ )
);
-} catch (e) {
- return [];
-}
-''';
+
+WITH wordpressSizesData AS (
SELECT
- percentile,
client,
- APPROX_QUANTILES(image.sizesAbsoluteError, 1000)[OFFSET(percentile * 10)] AS sizesAbsoluteError,
- APPROX_QUANTILES(image.sizesRelativeError, 1000)[OFFSET(percentile * 10)] AS sizesRelativeError,
- APPROX_QUANTILES(image.wDescriptorAbsoluteError, 1000)[OFFSET(percentile * 10)] AS wDescriptorAbsoluteError,
- APPROX_QUANTILES(image.wDescriptorRelativeError, 1000)[OFFSET(percentile * 10)] AS wDescriptorRelativeError,
- APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedPixels, 1000)[OFFSET(percentile * 10)] AS actualSizesEstimatedWastedLoadedPixels,
- APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedBytes, 1000)[OFFSET(percentile * 10)] AS actualSizesEstimatedWastedLoadedBytes,
- APPROX_QUANTILES(image.wastedLoadedPercent, 1000)[OFFSET(percentile * 10)] AS wastedLoadedPercent
-FROM (
- SELECT
- tpages._TABLE_SUFFIX AS client,
image
FROM
- `httparchive.pages.2022_10_01_*` AS tpages,
- UNNEST(getSrcsetSizesAccuracy(payload)) AS image
- JOIN
- `httparchive.technologies.2022_10_01_*` AS tech
- ON
- tech.url = tpages.url
+ `httparchive.all.pages`,
+ UNNEST(GET_IMG_SIZES_ACCURACY(custom_metrics)) AS image
WHERE
- tpages._TABLE_SUFFIX = tech._TABLE_SUFFIX
- AND app = 'WordPress'
- AND category = 'CMS' ),
- UNNEST([10, 25, 50, 75, 90]) AS percentile
+ date = DATE_TO_QUERY
+ AND IS_CMS(technologies, 'WordPress', '')
+ AND is_root_page = TRUE
+ AND image.hasSrcset = TRUE
+ AND image.hasSizes = TRUE
+)
+
+SELECT
+ percentile,
+ client,
+ APPROX_QUANTILES(image.sizesAbsoluteError, 100)[OFFSET(percentile)] AS sizesAbsoluteError,
+ APPROX_QUANTILES(image.sizesRelativeError, 100)[OFFSET(percentile)] AS sizesRelativeError,
+ APPROX_QUANTILES(image.idealSizesSelectedResourceEstimatedPixels, 100)[OFFSET(percentile)] AS idealSizesSelectedResourceEstimatedPixels,
+ APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedPixels, 100)[OFFSET(percentile)] AS actualSizesEstimatedWastedLoadedPixels,
+ APPROX_QUANTILES(image.relativeSizesEstimatedWastedLoadedPixels, 100)[OFFSET(percentile)] AS relativeSizesEstimatedWastedLoadedPixels,
+ APPROX_QUANTILES(image.idealSizesSelectedResourceEstimatedBytes, 100)[OFFSET(percentile)] AS idealSizesSelectedResourceEstimatedBytes,
+ APPROX_QUANTILES(image.actualSizesEstimatedWastedLoadedBytes, 100)[OFFSET(percentile)] AS actualSizesEstimatedWastedLoadedBytes,
+ APPROX_QUANTILES(image.relativeSizesEstimatedWastedLoadedBytes, 100)[OFFSET(percentile)] AS relativeSizesEstimatedWastedLoadedBytes,
+FROM
+ wordpressSizesData,
+ UNNEST([10, 20, 30, 40, 50, 60, 70, 80, 90]) AS percentile
GROUP BY
percentile,
client
ORDER BY
- percentile,
- client
+ client,
+ percentile |
This is an update to the previous query created in 2022 to evaluate the impact of inaccurate image sizes attributes on WordPress sites using HTTPArchive data.
The main changes from the original query are:
httparchive.all.pages
table.See: https://github.com/GoogleChromeLabs/wpp-research/blob/main/sql/2022/12/inaccurate-sizes-attribute-impact.sql
Query results