Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is doing pairwise SRM tests ok? #57

Closed
mysticaltech opened this issue Nov 4, 2021 · 2 comments
Closed

Is doing pairwise SRM tests ok? #57

mysticaltech opened this issue Nov 4, 2021 · 2 comments

Comments

@mysticaltech
Copy link

mysticaltech commented Nov 4, 2021

Hello @lukasvermeer, first of all, thank you so much for your work on this! I have a quick question that pertains to calculating SRM. Would it be valid to look at control-variant pairs on top of the standard 'global' SRM test?

For example, if we had 4 variants including control, we would have 3 pairs:

  • SRM global - control - variant 1 - variant 2 - variant 3 (standard way)
  • SRM test, control - variant 1
  • SRM test, control - variant 2
  • SRM test, control - variant 3

Of course in the case that the control itself has a problem, all secondary tests would trigger, but if any of variants 1, 2, or 3 have a problem, this would enable us to know exactly which one.

From a naive analysis of the chi-square goodness of fit test, it would work out on the math level. But would it really be valid statistically speaking? 🙏

@lukasvermeer
Copy link
Owner

Hello @mysticaltech. Thanks for your kind words! Happy to hear this tool is useful for you.

The approach you describe is what we were doing before we improved our approach in #16. We briefly discussed how we could help solve your use case (i.e. figuring out what variation might be causing the SRM), but we were unsure how to approach this correctly, so I created a placeholder issue #17 to acknowledge that this is currently an unsolved user need.

I don't think there is necessarily anything wrong with your approach as a avenue to explore what might be the root cause. But as @geoprofi pointed out in our discussion, the cause of the SRM might be such that testing each variation against the control could actually lead us astray. (Emphasis below mine.)

Since we do not know the expected [total] number of samples for sure (potential data loss being one reason, biased allocation or tracking - another), can we really do any better? Since the p-value is computed for the entire table and not any cell in particular and if there is one cell sticking out from the rest, it is easy to see it as the culprit, e.g. :

50050 0.25
49950 0.25
50100 0.25
49000 0.25

results in p=0.000973. The last row is at first glance an obvious suspect for data loss. We can say it should have had 50,000. However, it could also be that users were allocated with bias towards to other three, so in fact all should have had about 49775 users each under unbiased allocation. I don't think there is a way around this in a simple SRM check. Untangling the issue should be a task for a deeper investigation.

Does that answer your question?

@mysticaltech
Copy link
Author

It definitely does, thank you for that! I conclude that the problem is more subtle then it looks at first sight, but there is no great harm in performing secondary pairwise tests on top of the main one, as they just add a tiny potential to lead the experimenter astray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants