Skip to content

Commit

Permalink
Update LDSC output format
Browse files Browse the repository at this point in the history
  • Loading branch information
Al-Murphy committed Jan 15, 2024
1 parent c55aaa7 commit cccf77b
Show file tree
Hide file tree
Showing 11 changed files with 98 additions and 20 deletions.
2 changes: 1 addition & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
Package: MungeSumstats
Type: Package
Title: Standardise summary statistics from GWAS
Version: 1.11.2
Version: 1.11.3
Authors@R:
c(person(given = "Alan",
family = "Murphy",
Expand Down
8 changes: 8 additions & 0 deletions NEWS.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,11 @@
## CHANGES IN VERSION 1.11.3

### Bug fix
* For LDSC format, rename A1 and A2 as LDSC expects A1 to be the effect column
rather than A2 (the opposite to MSS's default) - see more [here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68).
Although, this didn't seem to make any difference to results in tests, see more
[here](https://github.com/neurogenomics/MungeSumstats/issues/160#issuecomment-1891899253).

## CHANGES IN VERSION 1.11.2

### Bug fix
Expand Down
25 changes: 22 additions & 3 deletions R/format_sumstats.R
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,12 @@
#' @param ldsc_format DEPRECATED, do not use. Use save_format="LDSC" instead.
#' @param save_format Output format of sumstats. Options are NULL - standardised
#' output format from MungeSumstats, LDSC - output format compatible with LDSC
#' and openGWAS - output compatible with openGWAS VCFs. Default is NULL.
#' and openGWAS - output compatible with openGWAS VCFs. Default is NULL.
#' **NOTE** - If LDSC format is used, the naming convention of A1 as the
#' reference (genome build) allele and A2 as the effect allele will be reversed
#' to match LDSC (A1 will now be the effect allele). See more info on this
#' [here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68). Note that any
#' effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
#' @param log_folder_ind Binary Should log files be stored containing all
#' filtered out SNPs (separate file per filter). The data is outputted in the
#' same format specified for the resulting sumstats file. The only exception to
Expand Down Expand Up @@ -285,8 +290,7 @@ format_sumstats <- function(path,
#### Setup multi-threading ####
data.table::setDTthreads(threads = nThread)
#### Setup empty variables ####
rsids <- NULL
orig_dims <- NULL
rsids <- orig_dims <- A1_n <- A2 <- A1 <- NULL
log_files <- vector(mode = "list")
t1 <- Sys.time()

Expand Down Expand Up @@ -1036,6 +1040,21 @@ format_sumstats <- function(path,
### Check 39: Ensure CHR follows the requested style ###
CHR <- NULL
sumstats_return$sumstats_dt[, CHR := GenomeInfoDb::mapSeqlevels(CHR, style = chr_style)]

### IF LDSC, rename A1 and A2, effect columns are fine
if (!is.null(save_format) &&
tolower(save_format)=="ldsc") {
message("Renaming A1,A2 to match LDSC format.")
#For LDSC format, rename A1 and A2 as LDSC expects A1 to be the effect
#column rather than A2 (the opposite to MSS's default) - see more
#[here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68).Although,
#this didn't seem to make any difference to results in tests, see more
#https://github.com/neurogenomics/MungeSumstats/issues/160#issuecomment-1891899253
sumstats_return$sumstats_dt[,A1_n:=A2]
sumstats_return$sumstats_dt[,A2:=A1]
sumstats_return$sumstats_dt[,A1:=A1_n]
sumstats_return$sumstats_dt[,A1_n:=NULL]
}

#### WRITE data.table TO PATH ####
check_save_out$save_path <- write_sumstats(
Expand Down
21 changes: 12 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,22 @@
`MungeSumstats`: Standardise the format of GWAS summary statistics
================
<h5> ¶ <i>Authors</i>: Alan Murphy, Brian Schilder and Nathan Skene ¶
<h5>
<i>Authors</i>: Alan Murphy, Brian Schilder and Nathan Skene
</h5>
<h5>
<i>Updated</i>: Jan-15-2024
</h5>
<h5> ¶ <i>Updated</i>: Jul-13-2023 ¶ </h5>

<!-- Readme.md is generated from Readme.Rmd. Please edit that file -->
<!-- badges: start -->

[![](https://img.shields.io/badge/release%20version-1.8.0-black.svg)](https://www.bioconductor.org/packages/MungeSumstats)
[![](https://img.shields.io/badge/devel%20version-1.9.11-black.svg)](https://github.com/neurogenomics/MungeSumstats)
[![](https://img.shields.io/badge/release%20version-1.10.1-black.svg)](https://www.bioconductor.org/packages/MungeSumstats)
[![](https://img.shields.io/badge/devel%20version-1.11.3-black.svg)](https://github.com/neurogenomics/MungeSumstats)
[![R build
status](https://github.com/neurogenomics/MungeSumstats/workflows/rworkflows/badge.svg)](https://github.com/neurogenomics/MungeSumstats/actions)
[![](https://img.shields.io/github/last-commit/neurogenomics/MungeSumstats.svg)](https://github.com/neurogenomics/MungeSumstats/commits/master)
[![](https://codecov.io/gh/neurogenomics/MungeSumstats/branch/master/graph/badge.svg)](https://codecov.io/gh/neurogenomics/MungeSumstats)
[![](https://img.shields.io/badge/download-5460/total-blue.svg)](https://bioconductor.org/packages/stats/bioc/MungeSumstats)
[![](https://img.shields.io/badge/download-11379/total-blue.svg)](https://bioconductor.org/packages/stats/bioc/MungeSumstats)
[![License:
Artistic-2.0](https://img.shields.io/badge/license-Artistic--2.0-blue.svg)](https://cran.r-project.org/web/licenses/Artistic-2.0)
[![](https://img.shields.io/badge/doi-https://doi.org/10.1093/bioinformatics/btab665-blue.svg)](https://doi.org/https://doi.org/10.1093/bioinformatics/btab665)
Expand Down Expand Up @@ -154,10 +157,10 @@ We would like to acknowledge all those who have contributed to

<div id="ref-Skene2018" class="csl-entry">

<span class="csl-left-margin">1. </span><span
class="csl-right-inline">Nathan G. Skene, T. E. B., Julien Bryois.
Genetic identification of brain cell types underlying schizophrenia.
*Nature Genetics* (2018).
<span class="csl-left-margin">1.
</span><span class="csl-right-inline">Nathan G. Skene, T. E. B., Julien
Bryois. Genetic identification of brain cell types underlying
schizophrenia. *Nature Genetics* (2018).
doi:[10.1038/s41588-018-0129-5](https://doi.org/10.1038/s41588-018-0129-5)</span>

</div>
Expand Down
7 changes: 6 additions & 1 deletion man/check_ldsc_format.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion man/format_sumstats.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion man/import_sumstats.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion man/validate_parameters.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

7 changes: 6 additions & 1 deletion man/write_sumstats.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

14 changes: 14 additions & 0 deletions tests/testthat/test-vcf_formatting.R
Original file line number Diff line number Diff line change
Expand Up @@ -146,12 +146,26 @@ test_that("VCF is correctly formatted", {
ldsc_cols <- c("SNP", "N", "A1", "A2", "Z")
testthat::expect_true(all(ldsc_cols %in% names(res)))

#also ensure A1 and A2 have been renamed
#For LDSC format, rename A1 and A2 as LDSC expects A1 to be the effect
#column rather than A2 (the opposite to MSS's default) - see more
#[here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68).Although,
#this didn't seem to make any difference to results in tests, see more
#[here](https://github.com/neurogenomics/MungeSumstats/issues/160#issuecomment-1891899253).
data.table::setnames(res,c("A1","A2"),c("A2","A1"))
res[,CHR:=as.character(CHR)]
testthat::expect_true(all.equal(res[,c("SNP","CHR","BP","A1","A2","END",
"FILTER","FRQ","BETA","LP","SE",
"P")],rtrn_dt))


testthat::expect_equal(reformatted_lines[1:5], corr_res)
} else {
testthat::expect_true((is_32bit_windows||!Sys.info()["sysname"]=="Linux"))
testthat::expect_true((is_32bit_windows||!Sys.info()["sysname"]=="Linux"))
testthat::expect_true((is_32bit_windows||!Sys.info()["sysname"]=="Linux"))
testthat::expect_true((is_32bit_windows||!Sys.info()["sysname"]=="Linux"))
testthat::expect_true((is_32bit_windows||!Sys.info()["sysname"]=="Linux"))
testthat::expect_true((is_32bit_windows||!Sys.info()["sysname"]=="Linux"))
}
})
13 changes: 11 additions & 2 deletions vignettes/MungeSumstats.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -165,7 +165,11 @@ flexibility to export the reformatted file as tab-delimited, VCF or R
native objects such as data.table, GRanges or VRanges objects. The
output can also be outputted in an **LDSC ready** format which means the
file can be fed directly into LDSC without the need for additional
munging.
munging. **NOTE** - If LDSC format is used, the naming convention of A1 as the
reference (genome build) allele and A2 as the effect allele will be reversed
to match LDSC (A1 will now be the effect allele). See more info on this
[here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68). Note that any
effect columns (e.g. Z) will be inrelation to A1 now instead of A2.

# Data

Expand Down Expand Up @@ -419,7 +423,12 @@ conducted by *MungeSumstats* are:
("data.table","vranges","granges").
- **save_format** Ensure that output format meets all requirements to
be passed directly into LDSC ("ldsc") without the need for additional
munging or for IEU OpenGWAS format ("opengwas") before saving as a VCF
munging or for IEU OpenGWAS format ("opengwas") before saving as a VCF.
**NOTE** - If LDSC format is used, the naming convention of A1 as the
reference (genome build) allele and A2 as the effect allele will be reversed
to match LDSC (A1 will now be the effect allele). See more info on this
[here](https://groups.google.com/g/ldsc_users/c/S7FZK743w68). Note that any
effect columns (e.g. Z) will be inrelation to A1 now instead of A2.
- **log_folder_ind** Should log files be stored containing all
filtered out SNPs (separate file per filter). The data is outputted
in the same format specified for the resulting sumstats file.
Expand Down

0 comments on commit cccf77b

Please sign in to comment.