Add normalized game pair Elo to stats #2134

vondele · 2024-09-06T06:54:26Z

We're observing some nice results with the normalized game pair Elo (ngpElo) which seemingly is a good way to derive an Elo number that is largely book independent, working with game pairs. See some data (and the formula) here:

https://github.com/official-stockfish/Stockfish/wiki/Useful-data#equivalent-time-odds-and-normalized-game-pair-elo

Would be nice to put it in the stats, and possibly even see if it makes sense to replace nElo with it in fishtest, even though that's a larger undertaking.

The ngpElo concept is from @Naphthalin he might be able to explain the properties a bit better.

Naphthalin · 2024-09-06T10:44:19Z

I'd obviously be happy if the excruciating math stuff I had to deal with in order to make the WDL Contempt work for Leela and align it to Elo would be put to some wider use, especially in regard to the challenges stemming from mixed data sources due to testing with different opening biases and different TC. I will think about whether it makes sense to try replacing nElo in fishtest, as using UHO openings for small expected rating differences already deal with the two main issues of regular Elo, with the only remaining major issue that nElo has wrong assumptions about the draw distribution.

Key properties

The formula 100 * log10( (2*WW+WD+DW) / (2*LL + LD + DL)) has the following nice properties under the model assumptions:

like regular Elo, it can be calculated from game pair stats [hence game pair]
unlike regular Elo, it is invariant from the book exit [hence normalized]
unlike regular Elo, it is additive at top engine level
it aligns with regular Elo from +1.0 exits if opponents are close [hence Elo]
it reproduces the game pair Elo from UHO openings (up to the artificial factor 2 because of half number of games) [hence game pair Elo]

Background

The reason why it is necessary to redefine Elo for the high levels of modern chess circles around the combination of these issues:

the 3 result nature of chess and the wide draw margin makes the expected score over rating difference graph agree with a logistic curve less and less, causing some sort of "superadditivity" (A beats B by +20 Elo, B beats C by +20 Elo might show A beating C by +60 or more) or equivalently "top end compression" (A beats C by +80 Elo, B beats C by +50 Elo might show A beating B by only +10 Elo)
using highly unbalanced openings helps with that, but causes some sort of "long range subadditivity" due to the 75% or +191 Elo performance wall (A beating B with 70% score and B beating C with 70% score from UHO openings might show A beating C with 74% score)
using unbalanced openings is basically a necessity to detect a strength difference signal (see fishtest, TCEC etc), but the book bias directly affects the Elo spread, which means tests with different books can't be compared directly, and reaching an agreement to some sort of "standardization of opening books" for testing is neither realistic nor desirable.

Origin of the formula

The full derivations behind Leela's WDL Contempt etc are beautiful but this margin isn't wide enough ;) The relevant parts however can be summarized as

characterizing playing strength likely needs more than 1 parameter (my model uses 2, basically mean and variance of the expected inaccuracy distribution) to reproduce observed behavior at higher Elo differences, though they're related
performance and Elo curve between opponents of similar strength directly depend on mean/stddev
applying the definition of regular Elo also contains a dependency term book bias x in the shape of (1-x)/stddev, which can be read as "how likely is it to reach a +1.00 position between equal opponents from initial eval x"
using x=1 (the definition of UHO) is the only possible value eliminating this term and thus the indirect effect of overall strength
and finally, calculating the strength difference related quantity mean/stddev while trying to eliminate the book bias x without explicitly knowing it spits out an expression which can be translated into the log game pair ratio with double counting WW and LL. Aligning it with the regular Elo definition from +1.00 openings in the limit of vanishing WW and LL probability yields the 100 * log10 part of the formula.

Relationship with regular Elo

ngpElo is designed for the upper range of playing strength where WW and LL results are much less frequent than WD and LD, which is the case approximately from >80% expected draw rate from regular startpos resp. balanced openings. In the human range, this isn't the case, leading to a factor 1.5-2x discrepancy between Elo differences and ngpElo differences (at 2000 level, +100 regular Elo is equivalent to +50 ngpElo); an approximate conversion can be found in LeelaChessZero/lc0#1941 (comment) used for converting regular Elo into ngpElo internally in the Lc0 Contempt implementation.

vdbergh · 2024-09-18T14:50:04Z

I am just seeing this.

with the only remaining major issue that nElo has wrong assumptions about the draw distribution.

This is incorrect. nElo does not make any assumptions (not about the draw distribution and not about anything else). nElo is simply inversely proportional to the square root of the number of games required to prove that one engine is stronger than another with a given level of significance.

The motivation for expressing bounds in nElo in Fishtest is that in this way the resources consumed by a test are independent on the book.

unlike regular Elo, it is invariant from the book exit [hence normalized]

Do I understand correctly that you claim you can prove this for some reasonable model? Is there some write up of this?

vondele · 2024-09-18T17:38:42Z

glad you joined the discussion :-)

I think the neat thing of this proposal is that it is based on game pairs, contrary to nElo. I think using games pairs from the beginning is very important nowadays.

vdbergh · 2024-09-18T17:52:03Z

I don't understand. nElo also uses game pairs (it is computed from the pentanomial frequencies).

vondele · 2024-09-18T19:49:25Z

ah. So now I don't understand.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add normalized game pair Elo to stats #2134

Add normalized game pair Elo to stats #2134

vondele commented Sep 6, 2024

Naphthalin commented Sep 6, 2024

vdbergh commented Sep 18, 2024

vondele commented Sep 18, 2024

vdbergh commented Sep 18, 2024

vondele commented Sep 18, 2024

Add normalized game pair Elo to stats #2134

Add normalized game pair Elo to stats #2134

Comments

vondele commented Sep 6, 2024

Naphthalin commented Sep 6, 2024

Key properties

Background

Origin of the formula

Relationship with regular Elo

vdbergh commented Sep 18, 2024

vondele commented Sep 18, 2024

vdbergh commented Sep 18, 2024

vondele commented Sep 18, 2024