-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add parameter for assuming minor allele is the effect allele #193
Comments
It would be good to get other's opinions on this - @mykmal @sjfandrews @jonathangriffiths @rmgpanw |
Hi there - this does sound like a sensible thing to add. However, the number of ways that people can mess up sumstats is uncountable, and many of these ways are tricky to diagnose (often requiring downstream analyses like LDSC regression, or by looking at patterns of LD and effect sizes to identify flips). So, while I think this is a sensible addition, I'm not sure how easily or reasonably a user would be able to make the assumption you describe about all minor alleles being the effect allele. Keep up the good work on this package! |
Cheers for the input @jonathangriffiths! I guess the thing is that the assumption isn't All minor alleles being effect alleles only the majority. The check would just say if the designated A2 has more matches to the reference than the designated A1, swap the column naming. Then any cases that have the major allele as the effect allele would be flipped (along with their effect columns) to match the direction of the rest of the SNPs (as MSS does already). Again, this would be the default setting but I think it is a sensible choice for most cases. |
Hi thanks for tagging me - this sounds like an important issue and could be a really helpful new feature to implement. Could you please point me/us to where your 'on reference genome' and 'infer reference genome build' are located? Thanks |
Sure @rmgpanw, updated in the original but also below. To note, I have added some checks already to help with ambiguous naming of the allele columns which are worth baring in mind, namely the 'on reference genome' and called in 'infer reference genome build' and called in Can discuss further if you like? |
Thanks for tagging me! I agree that this is an important issue, but I don't think that adding a new check/parameter is the best solution. Last fall you added the infer_effect_column() function to MSS, which is enabled by setting the
From what I understand, the solution you are proposing in this issue is identical to the third check done by The reason that MSS is failing to correctly infer reference and effect alleles in your example GWAS is because of how we defined what counts as an ambiguous column name. In particular, MSS only considers column names with the numbers 1 and 2 to be ambiguous. Moreover, MSS only attempts to infer reference and effect alleles from other column names or from the reference genome if both of the original allele column names are ambiguous. In your example, however, the allele column names contain the numbers 0 and 1 instead. Only one of them is treated as ambiguous, so MSS does not try to match them to the reference genome. Rather than adding an entirely new check that would duplicate existing functionality, I suggest modifying the definition of ambiguous column names in P.S. Those just joining this conversation might want to refer to the discussion in #168 where @Al-Murphy and I came up with the rationale for the three-step checks described above. |
Hey @mykmal, Thanks for this - I didn't even consider the third check of This check is run by default - |
Yes that sounds great to me! |
Hi @Al-Murphy, If a user suspects that the columns in their GWAS summary statistics might be mislabeled despite having unambiguous names (e.g., a column named EFFECT_ALLELE is actually the non-effect allele), then they should carefully investigate that and rename the columns accordingly before running MSS. I don't think it's our job to try to catch such cases. Having said that, if you think that it would be useful to add a new option to manually enable the third check in Thanks again for bringing this up and for all your awesome work on MungeSumstats! |
Yep, definitely off by default! I'll work on these additions today. Thanks for your opinions everyone! |
Okay I believe I have captured all this in v1.13.7: Bug fix
New features
To give an example using the same sumstats from above:
Now run as normal with default settings:
This will infer the effect direction using the A1FREQ column (check 2). See the messages outputted by MSS:
And the output sumstats:
So here, A1FREQ tells MSS this is the expected direction (A1 when A0/A1 are the alleles) so there is no need to rename the columns. So then the effect alleles get flipped since the newly named A1 (A0 to begin with) doesn't match the reference genome. Now if we rerun and remove A1FREQ so MSS can't use this for inference, this should invoke check 3 where the reference genome is used to infer effect direction:
Now we can see the reference genome check forced the renaming of A0/A1 and thereafter no flipping was needed. The effect is the effect alleles aren't flipped whereas for the first case they were - MSS inferred that the columns were mislabeled. Finally, what happens using the new
This gives the same result as the second example. So overall, we think this sumstats was mislabelled which MSS can catch with the ambiguous allele check when check 2 doesn't catch it or when Give it a try and let me know what you think! |
@Al-Murphy I agree with your approach above.
@mykmal When working with one or a couple of GWAS I kind of agree with this. But for large-scale analyses (eg my project involves looking at thousands of GWAS datasets) manual inspection becomes impractical. Thus trying to automate as much as possible can be really helpful in these scenarios. |
Thanks everyone for your input, closing this now but feel free to reopen if you want to discuss more. |
The effect allele in a sumstats is not always the minor allele in a reference genome. As such, MSS does not assume that the effect allele is the minor allele. However this means that if the authors of sumstats mix up the naming of the A1 and A2 allele, MSS will not catch this mistake and will assume all effect columns relate to the major allele and will thus flip them to get the effect values relating to the minor allele. However if the columns were misnamed, we would not want the effect columns flipped.
I think an example of this is PMID:30846698, gwas data website: http://ftp.ebi.ac.uk/pub/databases/gwas/summary_statistics/GCST007001-GCST008000/GCST007560. Here they use ALLELE0 and ALLELE1 to be the non-effect and effect alleles respectively (taken from the README). However, the non-effect allele column are all minor alleles so we believe they misnamed these columns. However, MSS will not catch this as it does not assume the effect allele is the minor allele. An example SNP is rs10910006 which in the raw data had the reference allele as a G and the effect as an A but MSS flipped which is correct. So this means the effect values were also flipped by MSS.
To catch cases where the columns were mislabelled we would need to assume the effect column is the minor allele. I think it would be worthwhile to have as a parameter: effect_is_mionor_allele (default to FALSE as I'm not sure if this is a sensible assumption) for MSS.
It should be easy enough to implement by just checking which of the two effect columns match the reference genome more and assigning that an unambiguous name like non-effect column and the other as effect column (see my 'on reference genome' and called in
format_sumstats()
and 'infer reference genome build' and called informat_sumstats()
, checks in the package, it would be nearly identical to these). I think this check should be added on line 567 of format sumstats here.The text was updated successfully, but these errors were encountered: