Metric selection based on benchmark #247

Phhofm · 2025-01-23T14:07:55Z

Hey :) I have the following question:

If id like to make a smaller selection of iqa models for testing sisr model outputs. Lookng at your benchmarks, then it seems to me that
for FR the selection, according to your results/rankings on there, could be:

topiq_fr 2. dists 3. vif 4. lpips-vgg+
And for NR the selection could be
qalign (4bit) 2. clipiqa+ 3. musiq 4. maniqa and 5. topiq_nr

(might include psnry and ssim metrics still, simply for legacy reasons, since thats whats often used in sisr papers)

My question is if that would be a good selection of metrics?

Also there are more metrics, that are not on that benchmark list, like qualiclip (+), fid_mmd, fid_dinov2, compare2score, deepdc, arniqa. My question kinda is how they would fare on that benchmark/ranking, and if it would be good if i were to include on of these aswell in my test.

Thank you for your input :)

Phhofm · 2025-01-23T23:57:25Z

Since im already writing, there was also the idea of combining multiple metrics into a single score
for example, there is topiq_nr trained on different datasets. I could simply make a combined score called topiq_nr_full or topiq_nr_merged or something thats basically '(topiq_nr + topiq_nr-flive + topiq_nr-spaq) / 3'
Or FR and NR can be combined together, like '(topiq_fr + qalign_4bit (with range normalized so 1-5 score range normalized to 0-1 like topiq)) / 2' . Or a simpler case would be '(topiq_fr + topiq_nr) / 2' which also already have the same scoring range.
Maybe with such merged metric scores one could overcome weaknesses of a single metric. And even combine categories, like using FR only for pretrain sisr models (which need to be faithful), NR+FR for gan trained sisr models (creates more details but still tries to be faithful) or NR for diffusion sisr models (which the goal is less to be faithful but simply create additional details and look good to the eye by nature) and for training dataset evaluation (the HR's), can also be combined with iaa metrics.

Phhofm · 2025-01-24T00:09:22Z

And to simply show what i mean, the table im currently working on looks like this

Where i simply take official pretrain (meaning l1 loss only, or mssim only, or a combo of those) models (and later community trained pretrain+gan sisr models), the run inference on my released BHI100 validation set and then calculate metric scores.
So I know all the models have been run on the same dataset in the same way, and its the released models people can use, and there is nothing funky going on like psnr being compared to psnry or something.

Because things were going on my nerves, papers using psnry, so psnr with y channel, but i dont know why in the paper its written as 'psnr' and not 'psnry'. Also Urban100 and other official sets just having all weird image dimensions, which actually leads to a difference if they are bicubic downscaled using matlab, or using pillow (just the more i tried to recreate reported paper metrics, which i could not, the more i got frustrated with the current situation, and made my own testset (with normalized image dimensions, which actually make sense, like using 480x480px for hr so it can be downscaled to x2, x3 and x4 without something funky going on, because for example the hr image 004 from urban100 is 1024x681px, which is neither divisible by 2 or 4, and then it depends on the downsampling code on how that is being handles, therefore differences between matlab bicubic and pillow bicubic downsampled for example) , and will run the released models on it and calculate metric scores so its all fair and square and can actually be compared.

(PS im using psnry and ssim here even though i did not list them in my first comment because in sisr papers they are still always used, so i kept them for legacy reasons so to say, even though they are older and according to your fr_benchmark_results also long overtaken/deprecated by other metrics)

abcnorio · 2025-01-24T10:08:51Z

If I am allowed to enter the discussion, because encountering a similar problem. Problem is to select a model as entry point for fine-tuning of svshs-simulated images. So upscaled example target images using some fine-tuned checkpoints (actually over each intermediate ckpts to learn more about the evolution of a model) as well as well-known upscale models. Dataset is already the one that will be used for svhs-simulation (= degradataion).

But which forumla to use to select the best model? Visual inspection can be done in the end but we talk of ~72 reference images and >750 checkpoints (4 fine-tuned models, but a lot of intermediate checkpoints) which results in > 54k upscaled example images. So visual inspection is not the start, but the end. I need a pre-selection based on a clear formula how to combine scores.

I thought about using a weighted linear sum by starting with the BHI filtering from @Phhofm to inverse the selection process of images and apply it to perform model selection:

weighted_score = .6*hyperiqa + .3*ic9600 + .1*blokiness

which is of course not an empirically based formula but a start. NO need to stick to that, a better formula for this purpose here is welcome. However, I think it is better to use a few selected weighted scores instead of just summing up all (even if one could point out that error theory would claim "all in one" to rule out errors on the long run, but this is very time-expensive and inefficient).

If one has a simple look on how scores relate to each other (Pearson r corr), empirical example:

              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.48747398 -0.29866789
hyperiqa  -0.4874740  1.00000000  0.04615131
blokiness -0.2986679  0.04615131  1.00000000

which is something that should be done for more scores across standardized datasets/ models. The example above is across models, one dataset. This here is distorted because the intermediate checkpoints (similar to each other) compared to other different models have too much influence. Will do that later with final chkpts from each fine-tuned model which creates then an not-distorted dataset for analyses. But the idea should be clear.

Much better would be to have certain application-based score bundles to apply them standardized for selected and clearly defined purposes. Purpose can be - goal: model selection -

real world photos
anime
mix anime + real world
faces/ humans
architecture
patterns (e.g. gaming)
and for selected special degradations

etc. Of course papers list certain scores but this is not necessarily comparable across papers and it still not clear enough regarding purposes like above esp. different datasets.

Then the factor could be investigated as well - is the formula (=combination of scores) stable over 2x, 4x, 4x-downscaled-to-2x, etc.?

Then - a formula should at least be driven by theoretical assumptions and tested empirically - even in times of AI/ML my own preference is to have some meaning behind the selection of scores. AI/ML models themselves have certain difficulties with theory-driven approaches.

In sum, with all the amount of image quality scores it is really difficult to select the ones suited best for one's purpose. With time and GPUs one could create an experimental study to investigate

statistical relationships between iqa scores across models -> and v.v.:
statistical relationships between models across iqa scores
over selected standardized datasets (e.g. the BHI dataset from @Phhofm, and other datasets)
for clearly defined purposes (see above)

Just listing scores in papers is definitely not enough to understand the context clearly. I want to know whether scores measure something similar or not and this must be based on empirical findings, not on single case reports. Correlations can nowadays easily be calculated via Bayesian statistics so one can avoid the p-value/ significance discussion but report instead uncertainty estimations ie. HDIs/CIs which makes them more robust regarding their meaning. Further analyses can be found anyway...

Regarding the initial thread question of @Phhofm this means IF one knows the relationships between iqa scores it becomes much easier to select those that achieve the purpose one has in mind. This sounds better than relying on "old habits" of using legacy scores UNLESS they still proof to be valid enough (for a specific purpose, etc.).

A good procedure using standardized elements would ensure that new scores can easily be added by just repeating the statistical analyses which doesn't take much time and can be scripted anyway. IF the relationships between scores are unclear it becomes difficult to combine them.

chaofengc · 2025-01-24T12:23:17Z

Thank you for your in-depth discussions and ideas! They are really valuable for exploring the complex landscape of IQA metrics. @Phhofm, I encountered similar metric-related challenges during my studies on SISR tasks. These issues motivated me to delve into IQA research, though I quickly realized the field is far more intricate than I initially anticipated. The primary goal of this repository is to provide a platform that fosters the progression of IQA research while offering user-friendly tools for downstream tasks.

TL;DR

@Phhofm, here are my short suggestions:

If you have access to pristine reference images, report psnr(y) and ssim for consistency with most existing works, and use lpips to evaluate perceptual quality.
If no reference ground truth is available, the choice of metric depends on your specific task. In the absence of context, qalign might be a not bad option.
Regarding image dimensions, a common practice is to crop them into multiples like x2, x3, etc.

Below, I’ve shared some detailed thoughts based on my experience that might be helpful.

Selection of FR Metrics

FR metrics are generally more reliable due to the availability of pristine reference images. From my observations:

Deep learning-based metrics (e.g., topiq_fr, dists, lpips) tend to align better with human perception compared to traditional metrics with handcrafted features (e.g., psnr, ssim, vif).
Small differences in PLCC/SRCC scores in benchmarks often have minimal impact on downstream tasks like SISR. For researchers not directly focused on IQA, these minor discrepancies are unlikely to influence their conclusions about which SISR model performs better.

Therefore, I recommend using lpips as it is widely accepted and provides a good balance between simplicity and effectiveness. While lpips has its limitations, and newer metrics like topiq_fr and dists may offer improvements in certain aspects, they are unlikely to provide groundbreaking insights for enhancing SISR models at this stage. Similarly, combining multiple metrics is helpful to improve PLCC/SRCC but less likely to yield significant new perspectives for improving downstream tasks.

NR Metrics

IQA without reference images is significantly more complex and lacks straightforward solutions. Combining multiple metrics is likely a good solution, and @abcnorio has provided thoughtful suggestions in this regard. To make progress, we need to:

Analyze the correlations between existing metric scores.
Develop application-specific score bundles tailored to tasks such as SISR.

The main challenge lies in creating golden benchmark datasets that align with various application purposes. Such datasets would help analyze the characteristics of different metrics and their suitability for specific tasks. However, building these datasets is time-consuming and complicated, especially given the diversity of existing IQA datasets, which include images captured with different devices across various time periods. This issue is less pronounced in FR IQA, as the reference images themselves provide a clear standard for comparison.

Phhofm · 2025-01-24T13:11:18Z

Thank you for your valuable insight :) (and for all your effort)

I think i made my selection for my table :) (with psnry, ssim, lpips and qalign_4bit being a part of).
I also have a script now that uses the listed metrics to score each model/subfolder and then builds the table/csv (with a row per sisr model and a column for each chosen metric). So if I build a table with my selection of metrics, or if I build this table with all the available metrics on pyiqa (like 90 or something), it does not result in more effort for me. But of course increases calculation time until this table is built, by a lot.

abcnorio · 2025-01-24T16:17:08Z

If I can be of help for any (statistical analyses), pls let me know it. What is also interesting are not just correlations but using EDA techniques sensu JW Tukey and plot the data in 2d or 3d which gives a certain impression about distances between models or whater one wants to analyze. If one wants to do clustering, the problem of distance metric + agglomeration algorithm comes into play which is almost similar to our problem here. So simple plotting based on MDS is sometimes simple, but effective. One can also analyse for prototype. The prototype here (according to a non-published paper by H Oldenbürger) is the representative of a class with the minimal distances to all other members of a class. etc...

abcnorio · 2025-01-24T17:39:15Z

just FYI how it looks if one compares based on simple correlations:

private dataset (ie. not standardized to other datasets) consisting of ~72 reference images taken from SVHS sources
models are taken from the openmodeldb.info except for the last four (fine-tuned on a non-standardized private dataset and here the correlations are calculated over all fine-tune checkpoints and not for each separately)
each image was upscaled by each model and then the three BHI scores were calculated

One can see that there are certain patterns across models. No summary statistics, for Pearson r one has to do fisher-z transfo because correlations are skewed. But one can do that to get an impression about the ranges.
When there is some time will create one MDS based plot so you can see for yourself whether something like that is beneficial or not.

What can be done for a few scores can be done for much more scores, procedure is identical. And datasets can be huge of course (but this would require to upscale each image... which requires a lot of time, so maybe using selected images is a more efficient approach to limit the GPU time). The few tables below are already based on a total of >54k images. Using more reference images results in a huge amount of images to upscale, to score, and to store.

$`001_classicalSR_DF2K_s64w8_SwinIR-M_x4`
               ic9600   hyperiqa   blokiness
ic9600     1.00000000 -0.7144672 -0.03656801
hyperiqa  -0.71446722  1.0000000  0.25347156
blokiness -0.03656801  0.2534716  1.00000000

$`001_classicalSR_DIV2K_s48w8_SwinIR-M_x4`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5255909 -0.3001573
hyperiqa  -0.5255909  1.0000000  0.2378260
blokiness -0.3001573  0.2378260  1.0000000

$`003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN-with-dict-keys-params-and-params_ema`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5405329 -0.2025887
hyperiqa  -0.5405329  1.0000000  0.2918075
blokiness -0.2025887  0.2918075  1.0000000

$`003_realSR_BSRGAN_DFOWMFC_s64w8_SwinIR-L_x4_GAN`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5405329 -0.2025887
hyperiqa  -0.5405329  1.0000000  0.2918075
blokiness -0.2025887  0.2918075  1.0000000

$`1x-BaldrickVHSFixV0-2`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.1649151 -0.4028395
hyperiqa  -0.1649151  1.0000000  0.2104263
blokiness -0.4028395  0.2104263  1.0000000

$`2x_VHS-upscale-and-denoise_Film_477000_G`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.4900315 -0.3267267
hyperiqa  -0.4900315  1.0000000  0.3498996
blokiness -0.3267267  0.3498996  1.0000000

$`2x_VHS-upscale-only_Film_490000_G`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.6177652 -0.3166500
hyperiqa  -0.6177652  1.0000000  0.4138242
blokiness -0.3166500  0.4138242  1.0000000

$`4x_foolhardy_Remacri`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5557711 -0.3044512
hyperiqa  -0.5557711  1.0000000  0.2188935
blokiness -0.3044512  0.2188935  1.0000000

$`4x_Nickelback_70000G`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.4147969 -0.2349702
hyperiqa  -0.4147969  1.0000000  0.1871385
blokiness -0.2349702  0.1871385  1.0000000

$`4x_NMKD-Siax_200k`
              ic9600   hyperiqa blokiness
ic9600     1.0000000 -0.3525615 -0.291872
hyperiqa  -0.3525615  1.0000000  0.200116
blokiness -0.2918720  0.2001160  1.000000

$`4x_NMKD-Superscale-SP_178000_G`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.22391762 -0.37585336
hyperiqa  -0.2239176  1.00000000 -0.09423573
blokiness -0.3758534 -0.09423573  1.00000000

$`4x_RealisticRescaler_100000_G`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.16307195 -0.25933862
hyperiqa  -0.1630719  1.00000000 -0.05710591
blokiness -0.2593386 -0.05710591  1.00000000

$`4x-UltraSharp`
              ic9600     hyperiqa    blokiness
ic9600     1.0000000 -0.187574410 -0.262666207
hyperiqa  -0.1875744  1.000000000  0.007069197
blokiness -0.2626662  0.007069197  1.000000000

$`4xFaceUpDAT`
              ic9600     hyperiqa    blokiness
ic9600     1.0000000 -0.225382541 -0.182713865
hyperiqa  -0.2253825  1.000000000  0.006345556
blokiness -0.1827139  0.006345556  1.000000000

$`4xFaceUpLDAT`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.3739752 -0.1166793
hyperiqa  -0.3739752  1.0000000 -0.1086248
blokiness -0.1166793 -0.1086248  1.0000000

$`4xFaceUpSharpDAT`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000  0.0872532 -0.2354124
hyperiqa   0.0872532  1.0000000 -0.2751862
blokiness -0.2354124 -0.2751862  1.0000000

$`4xFaceUpSharpLDAT`
               ic9600   hyperiqa   blokiness
ic9600     1.00000000 -0.1191318 -0.09831071
hyperiqa  -0.11913177  1.0000000 -0.26088363
blokiness -0.09831071 -0.2608836  1.00000000

$`4xFFHQDAT`
               ic9600   hyperiqa   blokiness
ic9600     1.00000000 -0.5232888 -0.08397637
hyperiqa  -0.52328879  1.0000000  0.26331737
blokiness -0.08397637  0.2633174  1.00000000

$`4xFFHQLDAT`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.7137516 -0.1071949
hyperiqa  -0.7137516  1.0000000  0.2404124
blokiness -0.1071949  0.2404124  1.0000000

$`4xLexicaDAT2_otf`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.6429169 -0.1652589
hyperiqa  -0.6429169  1.0000000  0.2902183
blokiness -0.1652589  0.2902183  1.0000000

$`4xLSDIR`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5142862 -0.2258426
hyperiqa  -0.5142862  1.0000000  0.1556164
blokiness -0.2258426  0.1556164  1.0000000

$`4xLSDIRCompactv2`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.48901673 -0.14908852
hyperiqa  -0.4890167  1.00000000  0.06999061
blokiness -0.1490885  0.06999061  1.00000000

$`4xLSDIRplusR`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.1065645 -0.2771966
hyperiqa  -0.1065645  1.0000000 -0.2130019
blokiness -0.2771966 -0.2130019  1.0000000

$`4xmssim_realplksr_dysample_pretrain`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.6348407 -0.3677895
hyperiqa  -0.6348407  1.0000000  0.3663328
blokiness -0.3677895  0.3663328  1.0000000

$`4xNomos2_hq_drct-l`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.53646566 -0.34413629
hyperiqa  -0.5364657  1.00000000 -0.01711986
blokiness -0.3441363 -0.01711986  1.00000000

$`4xNomos2_otf_esrgan`
               ic9600    hyperiqa   blokiness
ic9600     1.00000000 -0.03751289 -0.23918990
hyperiqa  -0.03751289  1.00000000  0.02059687
blokiness -0.23918990  0.02059687  1.00000000

$`4xPSNR`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.4837952 -0.3191866
hyperiqa  -0.4837952  1.0000000  0.2151557
blokiness -0.3191866  0.2151557  1.0000000

$`4xRealWebPhoto_v4_dat2`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.55528972 -0.22730752
hyperiqa  -0.5552897  1.00000000  0.04204086
blokiness -0.2273075  0.04204086  1.00000000

$`4xRealWebPhoto_v4_drct-l`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.6436229 -0.1119709
hyperiqa  -0.6436229  1.0000000  0.1184378
blokiness -0.1119709  0.1184378  1.0000000

$`4xRealWebPhoto_v4`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.55528972 -0.22730752
hyperiqa  -0.5552897  1.00000000  0.04204086
blokiness -0.2273075  0.04204086  1.00000000

$Swin2SR_ClassicalSR_X4_64
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5154101 -0.2778533
hyperiqa  -0.5154101  1.0000000  0.1924772
blokiness -0.2778533  0.1924772  1.0000000

$Swin2SR_RealworldSR_X4_64_BSRGAN_PSNR
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5906840 -0.3221671
hyperiqa  -0.5906840  1.0000000  0.4175565
blokiness -0.3221671  0.4175565  1.0000000

$`ToonVHS-1x-300000_G`
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.2417829 -0.4464579
hyperiqa  -0.2417829  1.0000000  0.1767276
blokiness -0.4464579  0.1767276  1.0000000

$`VHS-Sharpen-1x_46000_G`
              ic9600    hyperiqa   blokiness
ic9600     1.0000000 -0.17414165 -0.39164958
hyperiqa  -0.1741417  1.00000000 -0.00162893
blokiness -0.3916496 -0.00162893  1.00000000

$compatc_4x_otf_train_compact_otf_models
              ic9600   hyperiqa blokiness
ic9600     1.0000000 -0.6715656 -0.446446
hyperiqa  -0.6715656  1.0000000  0.300308
blokiness -0.4464460  0.3003080  1.000000

$dat_4x_otf_train_dat_otf_v1_4x_models
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5110019 -0.3039362
hyperiqa  -0.5110019  1.0000000  0.2201101
blokiness -0.3039362  0.2201101  1.0000000

$esrgan_4x_otf_train_esrgan_otf_v1_4x_models
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.5640237 -0.3391226
hyperiqa  -0.5640237  1.0000000  0.1935413
blokiness -0.3391226  0.1935413  1.0000000

$realplksr_4x_otf_train_realplksr_otf_v1_models
              ic9600   hyperiqa  blokiness
ic9600     1.0000000 -0.4949500 -0.3073568
hyperiqa  -0.4949500  1.0000000  0.1905468
blokiness -0.3073568  0.1905468  1.0000000

abcnorio · 2025-01-25T14:10:24Z

FYI - MDS (multidimensional scaling, based on Euclidean distances)

iqa scores (hyperiqa, ic9600, blokiness) based on checkpoints
checkpoints based on iqa scores

Phhofm · 2025-01-28T22:04:06Z

Hm thought i could show real quick aswell, started working on a table where I downloaded offficial sisr released 4x pretrains, ran inference with chaiNNer on my own dataset, and then scored with pyiqa.

Put things in this repo https://github.com/Phhofm/bhi100-sisr-iqa-metrics and made an interactive table with github pages that runs on https://phhofm.github.io/bhi100-sisr-iqa-metrics/

But of course one can also create charts out of that data instead of a table, something like this (textual values were rounded for visual clarity, bars use the full data)
This is just the dists metric example with zoom in to see the differences, i ran almost all of the metrics i could run with pip, and can of course still extend that table with more models.

chaofengc · 2025-02-03T09:52:23Z

Thank you very much for your efforts and for sharing the data!

I conducted a simple analysis using the provided scores and calculated the Spearman rank correlation between all FR metrics.

Notably, ahiq, dists, and topiq_fr exhibit strong correlations (>0.9) with all other metrics. This might suggest that these three metrics may be more reliable or representative compared to the others.

Attached are the data sheet and the code used to generate the figure, created with the assistance of Kimi AI.

bhi_sisr_scores.csv

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset
df = pd.read_csv('bhi_sisr_scores.csv', index_col='Model')
df = df.dropna()  # Drop rows with missing values

# Full-reference metrics excluding 'ckdn' and 'cw_ssim'
full_reference_metrics = [
    'ahiq', 'ckdn', 'cw_ssim', 'dists', 'fsim', 'gmsd', 'lpips', 'lpips+', 'lpips-vgg',
    'lpips-vgg+', 'mad', 'ms_ssim', 'nlpd', 'pieapp', 'psnr', 'psnry', 'ssim', 'ssimc',
    'stlpips', 'stlpips-vgg', 'topiq_fr', 'topiq_fr-pipal', 'vif', 'vsi', 'wadiqam_fr'
]
excluded_metrics = ['ckdn', 'cw_ssim']
selected_metrics = [metric for metric in full_reference_metrics if metric not in excluded_metrics]

# Select columns
df_selected = df[selected_metrics]

# Calculate Spearman correlation matrix and take absolute values
correlation_matrix = df_selected.corr(method='spearman').abs()

# Identify metrics with >0.9 correlation with all other metrics
high_corr_metrics = []
for metric in correlation_matrix.index:
    # Exclude self-correlation and check if all other correlations > 0.9
    subset = correlation_matrix.loc[metric].drop(metric)
    if subset.gt(0.9).all():
        high_corr_metrics.append(metric)

print(f"\nMetrics with all correlations >0.9 (excluding 'ckdn' and 'cw_ssim'):")
print(high_corr_metrics)

# Highlight these metrics in the heatmap
plt.figure(figsize=(14, 12))
heatmap = sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt=".2f",
    cmap='YlGnBu',
    linewidths=0.5,
    annot_kws={"size": 8},
    cbar_kws={'label': 'Absolute Spearman Correlation', 'shrink': 0.8}
)
# Add bold borders for qualifying metrics
for metric in high_corr_metrics:
    col = selected_metrics.index(metric)
    heatmap.add_patch(plt.Rectangle((col, col), 1, 1, fill=False, edgecolor='red', lw=4))
plt.title("Heatmap of Absolute Spearman Correlations (Excluding 'ckdn' and 'cw_ssim')", pad=15, fontsize=14)
plt.xticks(rotation=45, ha='right', fontsize=8)
plt.yticks(rotation=0, fontsize=8)
plt.tight_layout()
plt.show()

abcnorio · 2025-02-03T13:45:44Z

Interesting!

Some notes:

Notably, ahiq, dists, and topiq_fr exhibit strong correlations (>0.9) with all other metrics. This might suggest that these three metrics may be more reliable or representative compared to the others.

Doesn't this just mean that those scores share so many overlapping order information that they tend to "measure" very similar/ identical characteristics? Based on a lot of images it means they are reliable but not necessarily representative. It means initially on the level of Spearman rang corr they share a lot of their order information with each other, and on the level of the score they tend to measure similar characteristics from images. Whether those characteristics are good indices to measure image quality is a complete different story and not covered by any correlation value.

With very high inter-correlations there are methods to handle that (not necessary here). E.g. one could perform (on an interval scale) some kind of PCA and use the resulting values for further processing and reduce (here) three scores to one score.

Over all scores one could calculate (assumption: interval scale level of scores) a (rotated 90 degree) FA/PCA and see whether there are some factors that explain huge parts of the variances. This has dis-/advantages: Advantage is a clear grouping of information to determine image quality, disadvantage is the question of how much information got lost due to such analyses, esp. if you want to make a statement on single images (on the level of groups of images it would probably be ok).

What is the data quality of all those iqa scores anyway? Aren't they located on the level of an interval scale? So one does not need necessarily the rang correlation approach but can work with other analyses as well ie. analyses that make use more of the embedded information that get lost a little bit by using only order information and neglecting the original scale.

When there is some time will reproduce your analyses wiith interval scaled methods.

The case of high inter-correlations is surely interesting.

chaofengc · 2025-02-04T10:02:44Z

1. High Correlations May Reflect Shared Validity, Not Just Redundancy

The claim that high inter-correlations (>0.9) only indicate shared "order information" overlooks their practical significance. Metrics like ahiq, dists, and topiq_fr are designed to model human perception, so their high inter-correlation likely stems from overlapping design principles (e.g., mimicking visual cortex processing). This redundancy is not accidental—it reflects intentional alignment with perceptual goals. If they also correlate strongly with human judgments, their convergence suggests reliability and validity.

If multiple metrics agree strongly and align with human judgments, their redundancy might be a strength, not a weakness. Unique metrics with weaker correlations are the ones needing scrutiny.

2. PCA/FA Risks Oversimplification

While PCA/factor analysis can reduce redundancy, it may discard unique variance that matters for specific distortions. For example, ahiq might excel at quantifying aliasing, while dists handles contrast shifts—nuances lost in a unified score. Retaining individual metrics preserves actionable insights, especially in research or diagnostic contexts.

3. Rank-Based Methods Are Better Suited to IQA

IQA metrics and human ratings are inherently ordinal, making Spearman correlation more appropriate than Pearson. Treating scores as interval-scaled imposes assumptions that IQA data rarely satisfies. Rank-based methods are not "inferior"—they align better with the domain's nature.

Phhofm · 2025-02-04T13:55:46Z

Thank you for all the thoughts and inputs and figures and this discussion.

Just a question in this regard, so if ahiq, dists and topiq_fr is a good mix, can we unify it into a single score?
Seems pretty simple since all three have a scoring value range of 0 to 1. Dists being lower_better gets inverted to normalize.

Simplest would be simply having the average: (ahiq+(1-dists)+topiq_fr)/3
Could also be weighted if one of these metrics were of more importance.

But maybe more interesting would be using a linear penalty to penalize a single low scoring metrics (so if one of these metrics give a low score, that the score is penalized a bit more, since all three of those are important). But not penalizing too extreme in comparison by using min function als penalty:
(ahiq+(1-dists)+topiq_fr)/3*(0.5+0.5*MIN(ahiq,(1-dists),topiq_fr))

Because penalizing using the min function seems a bit extreme (ahiq+(1-dists)+topiq_fr)/3*MIN(ahiq,(1-dists),topiq_fr)

The value score range of this unification would still be 0 to 1 this way. (could call it ahdito or something, first two letters of each)

For example ahiq 0.8, dists 0.2 and topiq_fr 0.9 with linear penalty would result in a 0.75 score. with only average would be 0.8333 and with min function penalty would be 0.6667. Though these metric scores are fictional, I simply wanted to show that it penalizes a bit but not too much.

Its just some thoughts and ideas

abcnorio · 2025-02-05T06:44:57Z

What's the aim of penalty here? Create more distance like some anti-log (but linear)? ie. lower substantially those overall scores for which at least one score is very low?

Why not use a weighted sum based on "average" intercorrelations to calc the overall score? This would be a more balanced and less penalty approach. It would emphasize high correlations between single scores. However, if there are more scores that are combined into a single score it can be that high AND low correlations are present and that may result in some balancing but without giving respect to the variance which can be great. Using a min() statement is also a relative statement, but here only directed towards the lower end. It depends on the purpose. Instead of correlations one can use average (Euclidean) distances or Mahalanobis (but this often fails while calculating because of some inverse matrix invovled). But if it works it gives respect to correlations and under certain conditions Mahalanobis = Euclidean - which further leads to Minkowski (a generalization).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metric selection based on benchmark #247

Metric selection based on benchmark #247

Phhofm commented Jan 23, 2025

Phhofm commented Jan 23, 2025 •

edited

Loading

Phhofm commented Jan 24, 2025 •

edited

Loading

abcnorio commented Jan 24, 2025 •

edited

Loading

chaofengc commented Jan 24, 2025 •

edited

Loading

Phhofm commented Jan 24, 2025

abcnorio commented Jan 24, 2025 •

edited

Loading

abcnorio commented Jan 24, 2025 •

edited

Loading

abcnorio commented Jan 25, 2025

Phhofm commented Jan 28, 2025

chaofengc commented Feb 3, 2025 •

edited

Loading

abcnorio commented Feb 3, 2025

chaofengc commented Feb 4, 2025

Phhofm commented Feb 4, 2025 •

edited

Loading

abcnorio commented Feb 5, 2025 •

edited

Loading

Metric selection based on benchmark #247

Metric selection based on benchmark #247

Comments

Phhofm commented Jan 23, 2025

Phhofm commented Jan 23, 2025 • edited Loading

Phhofm commented Jan 24, 2025 • edited Loading

abcnorio commented Jan 24, 2025 • edited Loading

chaofengc commented Jan 24, 2025 • edited Loading

TL;DR

Selection of FR Metrics

NR Metrics

Phhofm commented Jan 24, 2025

abcnorio commented Jan 24, 2025 • edited Loading

abcnorio commented Jan 24, 2025 • edited Loading

abcnorio commented Jan 25, 2025

Phhofm commented Jan 28, 2025

chaofengc commented Feb 3, 2025 • edited Loading

abcnorio commented Feb 3, 2025

chaofengc commented Feb 4, 2025

1. High Correlations May Reflect Shared Validity, Not Just Redundancy

2. PCA/FA Risks Oversimplification

3. Rank-Based Methods Are Better Suited to IQA

Phhofm commented Feb 4, 2025 • edited Loading

abcnorio commented Feb 5, 2025 • edited Loading

Phhofm commented Jan 23, 2025 •

edited

Loading

Phhofm commented Jan 24, 2025 •

edited

Loading

abcnorio commented Jan 24, 2025 •

edited

Loading

chaofengc commented Jan 24, 2025 •

edited

Loading

abcnorio commented Jan 24, 2025 •

edited

Loading

abcnorio commented Jan 24, 2025 •

edited

Loading

chaofengc commented Feb 3, 2025 •

edited

Loading

Phhofm commented Feb 4, 2025 •

edited

Loading

abcnorio commented Feb 5, 2025 •

edited

Loading