-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metric selection based on benchmark #247
Comments
Since im already writing, there was also the idea of combining multiple metrics into a single score |
And to simply show what i mean, the table im currently working on looks like this Because things were going on my nerves, papers using psnry, so psnr with y channel, but i dont know why in the paper its written as 'psnr' and not 'psnry'. Also Urban100 and other official sets just having all weird image dimensions, which actually leads to a difference if they are bicubic downscaled using matlab, or using pillow (just the more i tried to recreate reported paper metrics, which i could not, the more i got frustrated with the current situation, and made my own testset (with normalized image dimensions, which actually make sense, like using 480x480px for hr so it can be downscaled to x2, x3 and x4 without something funky going on, because for example the hr image 004 from urban100 is 1024x681px, which is neither divisible by 2 or 4, and then it depends on the downsampling code on how that is being handles, therefore differences between matlab bicubic and pillow bicubic downsampled for example) , and will run the released models on it and calculate metric scores so its all fair and square and can actually be compared. (PS im using psnry and ssim here even though i did not list them in my first comment because in sisr papers they are still always used, so i kept them for legacy reasons so to say, even though they are older and according to your fr_benchmark_results also long overtaken/deprecated by other metrics) |
If I am allowed to enter the discussion, because encountering a similar problem. Problem is to select a model as entry point for fine-tuning of svshs-simulated images. So upscaled example target images using some fine-tuned checkpoints (actually over each intermediate ckpts to learn more about the evolution of a model) as well as well-known upscale models. Dataset is already the one that will be used for svhs-simulation (= degradataion). But which forumla to use to select the best model? Visual inspection can be done in the end but we talk of ~72 reference images and >750 checkpoints (4 fine-tuned models, but a lot of intermediate checkpoints) which results in > 54k upscaled example images. So visual inspection is not the start, but the end. I need a pre-selection based on a clear formula how to combine scores. I thought about using a weighted linear sum by starting with the BHI filtering from @Phhofm to inverse the selection process of images and apply it to perform model selection:
which is of course not an empirically based formula but a start. NO need to stick to that, a better formula for this purpose here is welcome. However, I think it is better to use a few selected weighted scores instead of just summing up all (even if one could point out that error theory would claim "all in one" to rule out errors on the long run, but this is very time-expensive and inefficient). If one has a simple look on how scores relate to each other (Pearson r corr), empirical example:
which is something that should be done for more scores across standardized datasets/ models. The example above is across models, one dataset. This here is distorted because the intermediate checkpoints (similar to each other) compared to other different models have too much influence. Will do that later with final chkpts from each fine-tuned model which creates then an not-distorted dataset for analyses. But the idea should be clear. Much better would be to have certain application-based score bundles to apply them standardized for selected and clearly defined purposes. Purpose can be - goal: model selection -
etc. Of course papers list certain scores but this is not necessarily comparable across papers and it still not clear enough regarding purposes like above esp. different datasets. Then the factor could be investigated as well - is the formula (=combination of scores) stable over 2x, 4x, 4x-downscaled-to-2x, etc.? Then - a formula should at least be driven by theoretical assumptions and tested empirically - even in times of AI/ML my own preference is to have some meaning behind the selection of scores. AI/ML models themselves have certain difficulties with theory-driven approaches. In sum, with all the amount of image quality scores it is really difficult to select the ones suited best for one's purpose. With time and GPUs one could create an experimental study to investigate
Just listing scores in papers is definitely not enough to understand the context clearly. I want to know whether scores measure something similar or not and this must be based on empirical findings, not on single case reports. Correlations can nowadays easily be calculated via Bayesian statistics so one can avoid the p-value/ significance discussion but report instead uncertainty estimations ie. HDIs/CIs which makes them more robust regarding their meaning. Further analyses can be found anyway... Regarding the initial thread question of @Phhofm this means IF one knows the relationships between iqa scores it becomes much easier to select those that achieve the purpose one has in mind. This sounds better than relying on "old habits" of using legacy scores UNLESS they still proof to be valid enough (for a specific purpose, etc.). A good procedure using standardized elements would ensure that new scores can easily be added by just repeating the statistical analyses which doesn't take much time and can be scripted anyway. IF the relationships between scores are unclear it becomes difficult to combine them. |
Thank you for your in-depth discussions and ideas! They are really valuable for exploring the complex landscape of IQA metrics. @Phhofm, I encountered similar metric-related challenges during my studies on SISR tasks. These issues motivated me to delve into IQA research, though I quickly realized the field is far more intricate than I initially anticipated. The primary goal of this repository is to provide a platform that fosters the progression of IQA research while offering user-friendly tools for downstream tasks. TL;DR@Phhofm, here are my short suggestions:
Below, I’ve shared some detailed thoughts based on my experience that might be helpful. Selection of FR MetricsFR metrics are generally more reliable due to the availability of pristine reference images. From my observations:
Therefore, I recommend using NR MetricsIQA without reference images is significantly more complex and lacks straightforward solutions. Combining multiple metrics is likely a good solution, and @abcnorio has provided thoughtful suggestions in this regard. To make progress, we need to:
The main challenge lies in creating golden benchmark datasets that align with various application purposes. Such datasets would help analyze the characteristics of different metrics and their suitability for specific tasks. However, building these datasets is time-consuming and complicated, especially given the diversity of existing IQA datasets, which include images captured with different devices across various time periods. This issue is less pronounced in FR IQA, as the reference images themselves provide a clear standard for comparison. |
Thank you for your valuable insight :) (and for all your effort) I think i made my selection for my table :) (with psnry, ssim, lpips and qalign_4bit being a part of). |
If I can be of help for any (statistical analyses), pls let me know it. What is also interesting are not just correlations but using EDA techniques sensu JW Tukey and plot the data in 2d or 3d which gives a certain impression about distances between models or whater one wants to analyze. If one wants to do clustering, the problem of distance metric + agglomeration algorithm comes into play which is almost similar to our problem here. So simple plotting based on MDS is sometimes simple, but effective. One can also analyse for prototype. The prototype here (according to a non-published paper by H Oldenbürger) is the representative of a class with the minimal distances to all other members of a class. etc... |
just FYI how it looks if one compares based on simple correlations:
One can see that there are certain patterns across models. No summary statistics, for Pearson r one has to do fisher-z transfo because correlations are skewed. But one can do that to get an impression about the ranges. What can be done for a few scores can be done for much more scores, procedure is identical. And datasets can be huge of course (but this would require to upscale each image... which requires a lot of time, so maybe using selected images is a more efficient approach to limit the GPU time). The few tables below are already based on a total of >54k images. Using more reference images results in a huge amount of images to upscale, to score, and to store.
|
Hm thought i could show real quick aswell, started working on a table where I downloaded offficial sisr released 4x pretrains, ran inference with chaiNNer on my own dataset, and then scored with pyiqa. Put things in this repo https://github.com/Phhofm/bhi100-sisr-iqa-metrics and made an interactive table with github pages that runs on https://phhofm.github.io/bhi100-sisr-iqa-metrics/ But of course one can also create charts out of that data instead of a table, something like this (textual values were rounded for visual clarity, bars use the full data) |
Thank you very much for your efforts and for sharing the data! I conducted a simple analysis using the provided scores and calculated the Spearman rank correlation between all FR metrics. Notably, Attached are the data sheet and the code used to generate the figure, created with the assistance of Kimi AI.
|
Interesting! Some notes:
Doesn't this just mean that those scores share so many overlapping order information that they tend to "measure" very similar/ identical characteristics? Based on a lot of images it means they are reliable but not necessarily representative. It means initially on the level of Spearman rang corr they share a lot of their order information with each other, and on the level of the score they tend to measure similar characteristics from images. Whether those characteristics are good indices to measure image quality is a complete different story and not covered by any correlation value. With very high inter-correlations there are methods to handle that (not necessary here). E.g. one could perform (on an interval scale) some kind of PCA and use the resulting values for further processing and reduce (here) three scores to one score. Over all scores one could calculate (assumption: interval scale level of scores) a (rotated 90 degree) FA/PCA and see whether there are some factors that explain huge parts of the variances. This has dis-/advantages: Advantage is a clear grouping of information to determine image quality, disadvantage is the question of how much information got lost due to such analyses, esp. if you want to make a statement on single images (on the level of groups of images it would probably be ok). What is the data quality of all those iqa scores anyway? Aren't they located on the level of an interval scale? So one does not need necessarily the rang correlation approach but can work with other analyses as well ie. analyses that make use more of the embedded information that get lost a little bit by using only order information and neglecting the original scale. When there is some time will reproduce your analyses wiith interval scaled methods. The case of high inter-correlations is surely interesting. |
1. High Correlations May Reflect Shared Validity, Not Just RedundancyThe claim that high inter-correlations (>0.9) only indicate shared "order information" overlooks their practical significance. Metrics like ahiq, dists, and topiq_fr are designed to model human perception, so their high inter-correlation likely stems from overlapping design principles (e.g., mimicking visual cortex processing). This redundancy is not accidental—it reflects intentional alignment with perceptual goals. If they also correlate strongly with human judgments, their convergence suggests reliability and validity. If multiple metrics agree strongly and align with human judgments, their redundancy might be a strength, not a weakness. Unique metrics with weaker correlations are the ones needing scrutiny. 2. PCA/FA Risks OversimplificationWhile PCA/factor analysis can reduce redundancy, it may discard unique variance that matters for specific distortions. For example, ahiq might excel at quantifying aliasing, while dists handles contrast shifts—nuances lost in a unified score. Retaining individual metrics preserves actionable insights, especially in research or diagnostic contexts. 3. Rank-Based Methods Are Better Suited to IQAIQA metrics and human ratings are inherently ordinal, making Spearman correlation more appropriate than Pearson. Treating scores as interval-scaled imposes assumptions that IQA data rarely satisfies. Rank-based methods are not "inferior"—they align better with the domain's nature. |
Thank you for all the thoughts and inputs and figures and this discussion. Just a question in this regard, so if ahiq, dists and topiq_fr is a good mix, can we unify it into a single score? Simplest would be simply having the average: But maybe more interesting would be using a linear penalty to penalize a single low scoring metrics (so if one of these metrics give a low score, that the score is penalized a bit more, since all three of those are important). But not penalizing too extreme in comparison by using min function als penalty: Because penalizing using the min function seems a bit extreme The value score range of this unification would still be 0 to 1 this way. (could call it ahdito or something, first two letters of each) For example ahiq 0.8, dists 0.2 and topiq_fr 0.9 with linear penalty would result in a 0.75 score. with only average would be 0.8333 and with min function penalty would be 0.6667. Though these metric scores are fictional, I simply wanted to show that it penalizes a bit but not too much. Its just some thoughts and ideas |
What's the aim of penalty here? Create more distance like some anti-log (but linear)? ie. lower substantially those overall scores for which at least one score is very low? Why not use a weighted sum based on "average" intercorrelations to calc the overall score? This would be a more balanced and less penalty approach. It would emphasize high correlations between single scores. However, if there are more scores that are combined into a single score it can be that |
Hey :) I have the following question:
If id like to make a smaller selection of iqa models for testing sisr model outputs. Lookng at your benchmarks, then it seems to me that
for FR the selection, according to your results/rankings on there, could be:
And for NR the selection could be
(might include psnry and ssim metrics still, simply for legacy reasons, since thats whats often used in sisr papers)
My question is if that would be a good selection of metrics?
Also there are more metrics, that are not on that benchmark list, like qualiclip (+), fid_mmd, fid_dinov2, compare2score, deepdc, arniqa. My question kinda is how they would fare on that benchmark/ranking, and if it would be good if i were to include on of these aswell in my test.
Thank you for your input :)
The text was updated successfully, but these errors were encountered: