-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is doing pairwise SRM tests ok? #57
Comments
Hello @mysticaltech. Thanks for your kind words! Happy to hear this tool is useful for you. The approach you describe is what we were doing before we improved our approach in #16. We briefly discussed how we could help solve your use case (i.e. figuring out what variation might be causing the SRM), but we were unsure how to approach this correctly, so I created a placeholder issue #17 to acknowledge that this is currently an unsolved user need. I don't think there is necessarily anything wrong with your approach as a avenue to explore what might be the root cause. But as @geoprofi pointed out in our discussion, the cause of the SRM might be such that testing each variation against the control could actually lead us astray. (Emphasis below mine.)
Does that answer your question? |
It definitely does, thank you for that! I conclude that the problem is more subtle then it looks at first sight, but there is no great harm in performing secondary pairwise tests on top of the main one, as they just add a tiny potential to lead the experimenter astray. |
Hello @lukasvermeer, first of all, thank you so much for your work on this! I have a quick question that pertains to calculating SRM. Would it be valid to look at control-variant pairs on top of the standard 'global' SRM test?
For example, if we had 4 variants including control, we would have 3 pairs:
Of course in the case that the control itself has a problem, all secondary tests would trigger, but if any of variants 1, 2, or 3 have a problem, this would enable us to know exactly which one.
From a naive analysis of the chi-square goodness of fit test, it would work out on the math level. But would it really be valid statistically speaking? 🙏
The text was updated successfully, but these errors were encountered: