-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update header mappings #168
Conversation
Hey, Thanks for going through these and the vast majority seem sensible and useful. However, some I wouldn't change and I'll try explain why:
Keeping these is intentional to since MSS can infer the direction of the SNPs so then if the effect of frequency columns relate to A1 it means A1 is the effect column. When MSS checks the A1 and A2 SNP nucleotides, it will pick up on A1 being the effect column and switch these along with flipping the effect and performing 1-FRQ. You can test this out with the formatted example sumstats:
Though let me know if you think I'm missing something here? Cheers! |
Hey @mykmal great to hear your thoughts on the above, the deadline for changes to the dev branch is fast approaching so would like to get this in for you beforehand (not the end of the world if not though, you could just commit a new pull request to the new dev branch with all the same changes if we miss it). |
Hey @Al-Murphy, sorry for the delay! I wish that MSS was smart enough to recognize when A1 is the effect column, but it actually does not have the behavior you described. The reason is because each column header mapping is performed independently of the rest. For example, the presence of a column named A1BETA in the original summary statistics file will not affect how MSS maps A1 and A2. In this case, A1BETA will be mapped to BETA, A1 will be treated as the reference allele, and A2 will be treated as the effect allele. Consequently, effect sizes in the formatted summary statistics file will have the wrong directions. Analogous reasoning applies when the frequency column in the original summary statistics file represents the frequency of A1 rather than A2. The example you provided actually supports my point. Notice how effect directions and allele frequencies in library(data.table)
a <- MungeSumstats::formatted_example()
b <- copy(a)
#make A1 the effect col not A2
#change FRQ and BETA to reflect this change
setnames(b,"BETA","BETA1")
setnames(b,"A2","A3")
setnames(b,"A1","A2")
setnames(b,"A3","A1")
setnames(b,"FRQ","A1FRQ")
# Notice that after formatting, BETA and FRQ are flipped relative to their true values
b_for <- MungeSumstats::format_sumstats(b, return_data = TRUE)
# But if the allele columns are named unambiguously, formatting will be performed correctly
b_renamed <- copy(b)
setnames(b_renamed, "A1", "effect_allele")
setnames(b_renamed, "A2", "non_effect_allele")
b_renamed_for <- MungeSumstats::format_sumstats(b_renamed, return_data = TRUE)
# Here's an arbitrarily chosen SNP for illustration
# In the original dataset, the C allele has a negative effect and FRQ < 0.5
a[SNP == "rs1008078", ]
#SNP CHR BP A1 A2 FRQ BETA SE P
#1: rs1008078 1 91189731 T C 0.3731 -0.016 0.003 6.005e-10
# In the naively processed dataset, the T allele has a negative effect and FRQ < 0.5 -- this is wrong
b_for[SNP == "rs1008078", ]
#SNP CHR BP A1 A2 FRQ BETA SE P
#1: rs1008078 1 91189731 C T 0.3731 -0.016 0.003 6.005e-10
# Here the T allele has a positive effect and FRQ > 0.5, which is consistent with dataset a
b_renamed_for[SNP == "rs1008078", ]
#SNP CHR BP A1 A2 FRQ BETA SE P
#1: rs1008078 1 91189731 C T 0.6269 0.016 0.003 6.005e-10 I think this is a serious issue, but I have not been able to come up with a fool-proof programmatic solution. Whenever I use MSS, I first manually check the original GWAS summary statistics file. If the allele columns are named unambiguously (e.g. effect_allele and non_effect_allele) then I feel confident passing that file into MSS as-is. But when the allele columns are named A1 and A2, I search the documentation for that GWAS to determine which column represents the effect allele. If the definitions of A1/A2 are reversed from what MSS expects, I rename those columns before running Removing mappings for columns such as A1FRQ and BETA1, as I did in this commit, is a partial solution because it will cause MSS to error out if such columns exist, forcing users to look into the original file and rename the columns appropriately. Now that I think of it, an even better solution would be to add a check to What do you think? Is my understanding of MSS correct? |
When is the deadline for dev branch changes? If you agree with my reasoning and the proposed checks that I described in the last paragraph above, I would be happy to take a shot at implementing them. This would also entail adding the A1-specific effect and frequency header mappings back in. However, since that would take me some time to do, perhaps it makes sense to merge the current PR as a temporary partial solution to the A1/A2 mapping issue. Or, if you have any better ideas about how to automatically infer reference and effect alleles, please let me know! |
I see the issue now thanks for clarifying. I think an automated solution is absolutely necessary and we should be able to make a decent attempt at it. I agree that we should have a check that if A1 & A2 naming conventions are used, we should have a function to infer the effect allele. There are a few steps we can do for this for which I think the order is important:
I've made these changes to my version of v1.9.18 and have also included your updated header mappings (including the ones I originally said should be kept - see get_eff_frq_allele_combns.R for how eff/frq columns that infer the allele are captured in the mapping file) Let me know your thoughts and please run it on a few sumstats that gave trouble before to see how it performs - let's improve on it where we can but I think it's a good place. See below for the new results on the same examples:
|
Thank you for implementing such a robust solution to the A1/A2 inference issue! This is a major improvement to MSS. I agree with your reasoning and the order of the three checks you proposed. Although the assumption that non-effect alleles will match the reference genome may not always hold, it is a reasonable guess and the best we can do if the other two checks fail. Two comments:
|
Great I have fixed both of these and will push today. I'll close this PR now - thanks again for bringing this to my attention. |
This update only modifies the column header mappings file.
I added mappings for several new column headers that I encountered while processing real GWAS data. Additionally, I went through all of the existing header mappings and fixed or removed those that seemed incorrect to me. The full list of changes is described in NEWS.md. Let me know if you disagree with anything.
Note that I already sorted the updated mappings file by following all the steps listed in data.R.