Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update header mappings #168

Closed
wants to merge 1 commit into from
Closed

Update header mappings #168

wants to merge 1 commit into from

Conversation

mykmal
Copy link
Contributor

@mykmal mykmal commented Sep 25, 2023

This update only modifies the column header mappings file.

I added mappings for several new column headers that I encountered while processing real GWAS data. Additionally, I went through all of the existing header mappings and fixed or removed those that seemed incorrect to me. The full list of changes is described in NEWS.md. Let me know if you disagree with anything.

Note that I already sorted the updated mappings file by following all the steps listed in data.R.

@Al-Murphy
Copy link
Owner

Hey,

Thanks for going through these and the vast majority seem sensible and useful. However, some I wouldn't change and I'll try explain why:

  • Removed the mappings for "OR-A1", "OR.A1", "OR_A1", and "BETA1" because MSS
    assumes that A2 is the effect allele
  • Removed mappings for "A1FREQ", "A1FRQ", "AF1", "FREQ.A1.1000G.EUR",
    "FREQ.A1.ESP.EUR", "FREQ.ALLELE1.HAPMAPCEU", "FREQ1", "FREQ1.HAPMAP", and
    "FRQ_A1" because MSS defines "FRQ" to be the allele frequency of A2

Keeping these is intentional to since MSS can infer the direction of the SNPs so then if the effect of frequency columns relate to A1 it means A1 is the effect column. When MSS checks the A1 and A2 SNP nucleotides, it will pick up on A1 being the effect column and switch these along with flipping the effect and performing 1-FRQ. You can test this out with the formatted example sumstats:

library(data.table)
a <- MungeSumstats::formatted_example()
b <- copy(a)
#make A1 the effect col not A2
#change FRQ and BETA to reflect this change
setnames(b,"BETA","BETA1")
setnames(b,"A2","A3")
setnames(b,"A1","A2")
setnames(b,"A3","A1")
setnames(b,"FRQ","A1FRQ")

b_for <- MungeSumstats::format_sumstats(b)

Though let me know if you think I'm missing something here?

Cheers!

@Al-Murphy
Copy link
Owner

Hey @mykmal great to hear your thoughts on the above, the deadline for changes to the dev branch is fast approaching so would like to get this in for you beforehand (not the end of the world if not though, you could just commit a new pull request to the new dev branch with all the same changes if we miss it).

@mykmal
Copy link
Contributor Author

mykmal commented Oct 12, 2023

Hey @Al-Murphy, sorry for the delay!

I wish that MSS was smart enough to recognize when A1 is the effect column, but it actually does not have the behavior you described. The reason is because each column header mapping is performed independently of the rest. For example, the presence of a column named A1BETA in the original summary statistics file will not affect how MSS maps A1 and A2. In this case, A1BETA will be mapped to BETA, A1 will be treated as the reference allele, and A2 will be treated as the effect allele. Consequently, effect sizes in the formatted summary statistics file will have the wrong directions. Analogous reasoning applies when the frequency column in the original summary statistics file represents the frequency of A1 rather than A2.

The example you provided actually supports my point. Notice how effect directions and allele frequencies in b_for are flipped relative to their true values from a. But if we specify that A1 is now the effect allele column and A2 is the reference allele column by renaming them to something unambiguous, then MSS will produce correct output.

library(data.table)
a <- MungeSumstats::formatted_example()
b <- copy(a)

#make A1 the effect col not A2
#change FRQ and BETA to reflect this change
setnames(b,"BETA","BETA1")
setnames(b,"A2","A3")
setnames(b,"A1","A2")
setnames(b,"A3","A1")
setnames(b,"FRQ","A1FRQ")

# Notice that after formatting, BETA and FRQ are flipped relative to their true values
b_for <- MungeSumstats::format_sumstats(b, return_data = TRUE)

# But if the allele columns are named unambiguously, formatting will be performed correctly
b_renamed <- copy(b)
setnames(b_renamed, "A1", "effect_allele")
setnames(b_renamed, "A2", "non_effect_allele")
b_renamed_for <- MungeSumstats::format_sumstats(b_renamed, return_data = TRUE)

# Here's an arbitrarily chosen SNP for illustration
# In the original dataset, the C allele has a negative effect and FRQ < 0.5
a[SNP == "rs1008078", ]
#SNP CHR       BP A1 A2    FRQ   BETA    SE         P
#1: rs1008078   1 91189731  T  C 0.3731 -0.016 0.003 6.005e-10

# In the naively processed dataset, the T allele has a negative effect and FRQ < 0.5 -- this is wrong
b_for[SNP == "rs1008078", ]
#SNP CHR       BP A1 A2    FRQ   BETA    SE         P
#1: rs1008078   1 91189731  C  T 0.3731 -0.016 0.003 6.005e-10

# Here the T allele has a positive effect and FRQ > 0.5, which is consistent with dataset a
b_renamed_for[SNP == "rs1008078", ]
#SNP CHR       BP A1 A2    FRQ  BETA    SE         P
#1: rs1008078   1 91189731  C  T 0.6269 0.016 0.003 6.005e-10

I think this is a serious issue, but I have not been able to come up with a fool-proof programmatic solution. Whenever I use MSS, I first manually check the original GWAS summary statistics file. If the allele columns are named unambiguously (e.g. effect_allele and non_effect_allele) then I feel confident passing that file into MSS as-is. But when the allele columns are named A1 and A2, I search the documentation for that GWAS to determine which column represents the effect allele. If the definitions of A1/A2 are reversed from what MSS expects, I rename those columns before running format_sumstats().

Removing mappings for columns such as A1FRQ and BETA1, as I did in this commit, is a partial solution because it will cause MSS to error out if such columns exist, forcing users to look into the original file and rename the columns appropriately. Now that I think of it, an even better solution would be to add a check to format_sumstats() that determines whether any column names signify that A1 is the effect allele (e.g. if a column named BETA1 is present) and then rename the allele columns accordingly. This check would have to be performed before calling the header mapping function. However, neither solution will address scenarios where A1 is the effect allele but this is not reflected in the other column names (e.g. the effect column is simply BETA). In such cases, I think it is impossible to automatically infer which column contains the effect alleles. Perhaps we should warn users whenever their GWAS file contains ambiguously named allele columns whose definitions cannot be inferred by considering other column names.

What do you think? Is my understanding of MSS correct?

@mykmal
Copy link
Contributor Author

mykmal commented Oct 12, 2023

Hey @mykmal great to hear your thoughts on the above, the deadline for changes to the dev branch is fast approaching so would like to get this in for you beforehand (not the end of the world if not though, you could just commit a new pull request to the new dev branch with all the same changes if we miss it).

When is the deadline for dev branch changes? If you agree with my reasoning and the proposed checks that I described in the last paragraph above, I would be happy to take a shot at implementing them. This would also entail adding the A1-specific effect and frequency header mappings back in. However, since that would take me some time to do, perhaps it makes sense to merge the current PR as a temporary partial solution to the A1/A2 mapping issue.

Or, if you have any better ideas about how to automatically infer reference and effect alleles, please let me know!

@Al-Murphy
Copy link
Owner

I see the issue now thanks for clarifying. I think an automated solution is absolutely necessary and we should be able to make a decent attempt at it.

I agree that we should have a check that if A1 & A2 naming conventions are used, we should have a function to infer the effect allele. There are a few steps we can do for this for which I think the order is important:

  1. Check if ambiguous naming conventions are used (i.e. allele 1 and 2 or equivalent). If not exit, otherwise continue to next checks. This can be checked by using the mapping file and splitting A1/A2 mappings by those that contain 1 or 2 (ambiguous) or doesn't contain 1 or 2 e.g. effect, tested (unambiguous so fine for MSS to handle as is).
  2. Look for effect column/frequency column where the A1/A2 explicitly mentioned, if found then we know the direction and should update A1/A2 naming so A2 is the effect column. We can look for such columns by getting every combination of A1/A2 naming and effect/frq naming.
  3. If note found in 2, a final check should be against the reference genome, whichever of A1 and A2 has more of a match with the reference genome should be taken as not the effect allele. There is an assumption in this but I think it is the best bet given the ambiguous allele naming but let me know if you see issues with this?

I've made these changes to my version of v1.9.18 and have also included your updated header mappings (including the ones I originally said should be kept - see get_eff_frq_allele_combns.R for how eff/frq columns that infer the allele are captured in the mapping file)

Let me know your thoughts and please run it on a few sumstats that gave trouble before to see how it performs - let's improve on it where we can but I think it's a good place. See below for the new results on the same examples:

library(data.table)
a <- MungeSumstats::formatted_example()
b <- copy(a)

#check if inferring effect direction is working

#make A1 the effect col not A2
#change FRQ and BETA to reflect this change
data.table::setnames(b,"A2","A3")
data.table::setnames(b,"A1","A2")
data.table::setnames(b,"A3","A1")
c <- copy(b)
data.table::setnames(b,"BETA","BETA1")
data.table::setnames(c,"FRQ","A1FRQ")

# if the allele columns are named unambiguously, formatting will be performed 
# correctly - use this to test against as eff will be flipped compared to org
b_renamed <- copy(b)
data.table::setnames(b_renamed, "A1", "effect_allele")
data.table::setnames(b_renamed, "A2", "non_effect_allele")
b_renamed_for <- MungeSumstats::format_sumstats(b_renamed, return_data = TRUE,
                                                #all just make MSS run faster
                                                ref_genome = 'GRCh37',
                                                on_ref_genome = FALSE,
                                                strand_ambig_filter = FALSE,
                                                bi_allelic_filter = FALSE,
                                                allele_flip_check = FALSE,
                                                dbSNP=144)

# what if BETA gives the direction
b_for <- MungeSumstats::format_sumstats(b, return_data = TRUE, 
                                        #all just make MSS run faster
                                        ref_genome = 'GRCh37',
                                        on_ref_genome = FALSE,
                                        strand_ambig_filter = FALSE,
                                        bi_allelic_filter = FALSE,
                                        allele_flip_check = FALSE,
                                        dbSNP=144)

# what if FRQ gives the direction
c_for <- MungeSumstats::format_sumstats(b, return_data = TRUE, 
                                        #all just make MSS run faster
                                        ref_genome = 'GRCh37',
                                        on_ref_genome = FALSE,
                                        strand_ambig_filter = FALSE,
                                        bi_allelic_filter = FALSE,
                                        allele_flip_check = FALSE,
                                        dbSNP=144)

testthat::expect_equal(
  all.equal(b_renamed_for, b_for,ignore.col.order = TRUE),
  TRUE
)
#> TRUE
testthat::expect_equal(
  all.equal(b_renamed_for, c_for,ignore.col.order = TRUE),
  TRUE
)
#> TRUE


#finally check if the ref genome can be used to infer rather than being told in 
#eff/frq cols - just subset for speed
snps <- c("rs11210860","rs34305371","rs1008078","rs11588857","rs1777827",
          "rs76076331","rs2457660","rs10496091","rs4851251","rs12987662",
          "rs10930008","rs301800","rs2568955","rs61787263","rs2992632",
          "rs11689269","rs11690172")
d <- copy(b)
data.table::setnames(d,"BETA1","BETA")
d_for <- MungeSumstats::format_sumstats(d[SNP %in% snps,], return_data = TRUE, 
                                        on_ref_genome = TRUE,
                                        #all just make MSS run faster
                                        ref_genome = 'GRCh37',
                                        strand_ambig_filter = FALSE,
                                        bi_allelic_filter = FALSE,
                                        allele_flip_check = FALSE,
                                        dbSNP=144)
setkey(b_renamed_for,"SNP")
setkey(d_for,"SNP")
testthat::expect_equal(
  all.equal(b_renamed_for[SNP %in% snps], d_for,
            ignore.col.order = TRUE),
  TRUE
)
#> TRUE

@mykmal
Copy link
Contributor Author

mykmal commented Oct 16, 2023

Thank you for implementing such a robust solution to the A1/A2 inference issue! This is a major improvement to MSS.

I agree with your reasoning and the order of the three checks you proposed. Although the assumption that non-effect alleles will match the reference genome may not always hold, it is a reasonable guess and the best we can do if the other two checks fail.

Two comments:

  1. I think it would be informative to have MSS print a message when the third check is utilized, just to make users aware of the assumption being made. The function prints "Found direction from effect/frq column naming" when direction is inferred from other column names, but nothing is currently printed when direction is inferred by matching with the reference genome.
  2. You have a typo in the tests (both in the testthat R file and in your comment above). When creating the c_for datatable, you are calling format_sumstats() on the b datatable instead of on the c datatable. I verified that the function will work correctly on c as well, but the typo should be fixed so that you have a proper test.

@Al-Murphy
Copy link
Owner

Great I have fixed both of these and will push today. I'll close this PR now - thanks again for bringing this to my attention.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants