Is doing pairwise SRM tests ok? #57

mysticaltech · 2021-11-04T14:45:47Z

Hello @lukasvermeer, first of all, thank you so much for your work on this! I have a quick question that pertains to calculating SRM. Would it be valid to look at control-variant pairs on top of the standard 'global' SRM test?

For example, if we had 4 variants including control, we would have 3 pairs:

SRM global - control - variant 1 - variant 2 - variant 3 (standard way)
SRM test, control - variant 1
SRM test, control - variant 2
SRM test, control - variant 3

Of course in the case that the control itself has a problem, all secondary tests would trigger, but if any of variants 1, 2, or 3 have a problem, this would enable us to know exactly which one.

From a naive analysis of the chi-square goodness of fit test, it would work out on the math level. But would it really be valid statistically speaking? 🙏

lukasvermeer · 2021-11-05T09:31:15Z

Hello @mysticaltech. Thanks for your kind words! Happy to hear this tool is useful for you.

The approach you describe is what we were doing before we improved our approach in #16. We briefly discussed how we could help solve your use case (i.e. figuring out what variation might be causing the SRM), but we were unsure how to approach this correctly, so I created a placeholder issue #17 to acknowledge that this is currently an unsolved user need.

I don't think there is necessarily anything wrong with your approach as a avenue to explore what might be the root cause. But as @geoprofi pointed out in our discussion, the cause of the SRM might be such that testing each variation against the control could actually lead us astray. (Emphasis below mine.)

Since we do not know the expected [total] number of samples for sure (potential data loss being one reason, biased allocation or tracking - another), can we really do any better? Since the p-value is computed for the entire table and not any cell in particular and if there is one cell sticking out from the rest, it is easy to see it as the culprit, e.g. :

50050 0.25
49950 0.25
50100 0.25
49000 0.25

results in p=0.000973. The last row is at first glance an obvious suspect for data loss. We can say it should have had 50,000. However, it could also be that users were allocated with bias towards to other three, so in fact all should have had about 49775 users each under unbiased allocation. I don't think there is a way around this in a simple SRM check. Untangling the issue should be a task for a deeper investigation.

Does that answer your question?

mysticaltech · 2021-11-05T10:30:09Z

It definitely does, thank you for that! I conclude that the problem is more subtle then it looks at first sight, but there is no great harm in performing secondary pairwise tests on top of the main one, as they just add a tiny potential to lead the experimenter astray.

mysticaltech closed this as completed Nov 5, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is doing pairwise SRM tests ok? #57

Is doing pairwise SRM tests ok? #57

mysticaltech commented Nov 4, 2021 •

edited

Loading

lukasvermeer commented Nov 5, 2021

mysticaltech commented Nov 5, 2021

Is doing pairwise SRM tests ok? #57

Is doing pairwise SRM tests ok? #57

Comments

mysticaltech commented Nov 4, 2021 • edited Loading

lukasvermeer commented Nov 5, 2021

mysticaltech commented Nov 5, 2021

mysticaltech commented Nov 4, 2021 •

edited

Loading