[DRAFT] knn based mi estimation #207

JulienBeg · 2025-10-21T21:48:54Z

Here is some preliminary benchmark for non parametric MI estimation using k-NN.

It is based on the article "Mutual Information between Discrete and Continuous
Data Sets" from Brian C.Ross which adapts KSG to the setting where X is discrete and Y continuous.

I used cKDTree from scipy.special but we can probably find a better lib (e.g. hnswlib, faiss, mlpack, flann)

I suggest using an ensemble estimators with multiple values of k and retains the median to robustify the estimator.
Not sure this is a good idea though. Maybe we even want to take the max ? At least on the benchmark the ensemble methods seems better.

We can see that it works pretty well for MI larger than 10^-2 bits struggle when the MI becomes weaker.

I may have some other ideas to investigate.

One thing to look at is how to choose k or a list of k automatically for the user depending on the number of samples and class.

JulienBeg · 2025-10-22T08:31:39Z

JulienBeg · 2025-10-22T08:32:53Z

Same experiments where I added 5 dummy dimensions containing pure gaussian noise. The estimator still works but is considerably slower.

rishubn · 2025-10-22T13:38:38Z

Hi, Could you briefly explain how this estimator differs from the one suggested by this crypto 24 paper: https://eprint.iacr.org/2022/1201 I’m not an expert but just curious Thanks

…

On Wed, Oct 22, 2025 at 10:33 AM JulienBeg ***@***.***> wrote: *JulienBeg* left a comment (simple-crypto/SCALib#207) <#207 (comment)> Same experiments where I added 5 dummy dimensions containing pure gaussian noise. The estimator still works but is considerably slower. — Reply to this email directly, view it on GitHub <#207 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABSJFIR76PKAR2SE55GWZSL3Y46MXAVCNFSM6AAAAACJ2STKQWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTIMZRGA4DMOBXGQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

JulienBeg · 2025-10-22T15:16:51Z

Hello @rishubn !

Both estimators are related and can be seen as variants of the original KSG estimators family (https://arxiv.org/pdf/cond-mat/0305641, Alexander Kraskov, Harald Stögbauer, and Peter Grassberger).

The KSG estimator has been tailored to compute the mutual information I(X;Y) where X,Y are continuous random variable admitting a density in R^{d_x} and R^{d_y} with respect to Lebesgue measure.

The article I looked at consider the setup when X is discrete and Y is continuous with a density in R^{d_y} which is the setup in side channel analysis.

The estimator suggested in the crypto paper (https://eprint.iacr.org/2022/1201) is based on the article https://arxiv.org/pdf/1709.06212 referred to as GKOV which considers a more generic settings where X and Y are mixtures of continuous and discrete random variable. For instance X is with probability 1/2 a binomial distribution and with probability 1/2 a normal distribution. It does not mean that either X or Y is discrete and the other is continuous. Both of them are a mix of the two cases. Using a distance on both the X and the Y space they manage to apply locally either the vanilla KSG or the pluging for discrete random variables.

I would say that GKOV is over-killed in the side channel setup.

In particular, we have to choose a distance on both X space and Y spaces which is comparable. In our setting it is not clear which metric would make sense for X. (For Y space any p-norm seems reasonable.)

In our setting where X is purely discrete and Y is purely continuous, when there are enough samples there a more than k+1 collisions for each X and the distance on X does not matter anymore. In this case it falls back to the estimator I implemented.

Another way to recover the estimator I implemented is to choose a "distance" on X defined by d_X(x,x') = 0 if x=x' and +\infty otherwise.

When there are not enough samples we can have less than k+1 collision for a given X and the metric on X then comes into play. Maybe we can use GKOV if we have very few samples and know a reasonable metric for the X space. Typically the Hamming distance could make sense (up to a normalization to make it comparable to the distance on the Y-space). But if the leakage depends not on the bit of X but on the bit of SBox(X) it is hard to justify why the Hamming distance would make sense here. To the best of my knowledge this has not been discussed yet and could be interesting. Furthermore, I tend to think that in this small sample regime the estimator will not be precise enough anyway.

So overall both amounts to the same in our setting. GKOV is more generic but it is over-killed here so I used Ross estimator whose presentation is simpler and exactly tailored for the scenario where X is discrete and Y continuous.

JulienBeg · 2025-10-22T15:18:55Z

My explanation is not so brief but I hope it is clear =)

rishubn · 2025-10-22T15:49:08Z

Ok, makes sense, thanks for the clear explanation!

rishubn · 2025-10-22T15:57:49Z

And another question, have you experimented with larger word sizes? I saw in your benchmark you tested 8-bits

JulienBeg · 2025-10-23T09:46:54Z

And another question, have you experimented with larger word sizes? I saw in your benchmark you tested 8-bits

I can test that but a limitation of the algorithm is that we need at least k+1 samples per class. Hence if we want to estimate the MI between X and Y where X is a n-bit variable we need at least (k+1) 2^n samples. With n=16, k = 10 it already 10^{5.81...} samples., with n=32,k =10, it is already prohibitively large (10^{10.63...} samples)

draft knn based mi estimation

476719b

cla-bot bot added the cla-signed label Oct 21, 2025

fmt

f69ecbc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[DRAFT] knn based mi estimation #207

[DRAFT] knn based mi estimation #207

JulienBeg commented Oct 21, 2025 •

edited

Loading

Uh oh!

JulienBeg commented Oct 22, 2025

Uh oh!

JulienBeg commented Oct 22, 2025

Uh oh!

rishubn commented Oct 22, 2025 via email

Uh oh!

JulienBeg commented Oct 22, 2025 •

edited

Loading

Uh oh!

JulienBeg commented Oct 22, 2025 •

edited

Loading

Uh oh!

rishubn commented Oct 22, 2025

Uh oh!

rishubn commented Oct 22, 2025

Uh oh!

JulienBeg commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[DRAFT] knn based mi estimation #207

Are you sure you want to change the base?

[DRAFT] knn based mi estimation #207

Conversation

JulienBeg commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulienBeg commented Oct 22, 2025

Uh oh!

JulienBeg commented Oct 22, 2025

Uh oh!

rishubn commented Oct 22, 2025 via email

Uh oh!

JulienBeg commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JulienBeg commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rishubn commented Oct 22, 2025

Uh oh!

rishubn commented Oct 22, 2025

Uh oh!

JulienBeg commented Oct 23, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulienBeg commented Oct 21, 2025 •

edited

Loading

JulienBeg commented Oct 22, 2025 •

edited

Loading

JulienBeg commented Oct 22, 2025 •

edited

Loading