Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDSC munge for UKB sumstats #26

Open
yk-tanigawa opened this issue Jul 5, 2020 · 11 comments
Open

LDSC munge for UKB sumstats #26

yk-tanigawa opened this issue Jul 5, 2020 · 11 comments
Assignees

Comments

@yk-tanigawa
Copy link
Contributor

We convert the UKB sumstats into LDSC munge format.

This will enable us to perform

@yk-tanigawa
Copy link
Contributor Author

Focusing on the finalized summary statistic files, we started LDSC munge.

There are 20,940 such files across 7 populations and pushed the computation.

As of now,

  • 15,213 files are converted to LDSC munge
  • 5,727 files: still running.

Please see the analysis scripts for more info.

@yk-tanigawa
Copy link
Contributor Author

It turned out that there was an issue in filtering conditions and we are computing LDSC munge for all sum stats in gwas/current directory.

We now have 19,163+ munged sumstats (3,669 for WB).

Once GWAS is finalized, we can identify the updated sum stats (~1,100 in total; ~880 will be overwritten and ~230 will be added) and re-apply LDSC munge.

@yk-tanigawa
Copy link
Contributor Author

yk-tanigawa commented Jul 6, 2020

We considered applying LDSC munge for the meta-analyzed summary statistics (to get a phenotyping mapping for #25), but we decided to use the WB sum stats for mapping between FinnGen and UKB

@yk-tanigawa
Copy link
Contributor Author

Files are in /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc

@yk-tanigawa
Copy link
Contributor Author

With progress on #21, we should refresh this and update the #27 analysis

@yk-tanigawa
Copy link
Contributor Author

In 1_remove-incomplete-20200713.sh, we fixed the previous error in the filtering condition.

In the original version of 1_generate_input_list.sh, we incorrectly specified `NR>1 || $NF == 1080969`, but it should have been `NR>1 && $NF == 1080969`. This results resulted in 909 extra munged files.
Those were NOT used in the heritability analysis. In this script, we remove those 909 files.

@yk-tanigawa
Copy link
Contributor Author

yk-tanigawa commented Jul 18, 2020

With the finalized GWAS results (#21), we apply LDSC munge again.

1_LDSC_munge.20200717-210250.job.lst

has 2714 files. = 905 * 3

bash 1_generate_input_list.sh | tee 1_LDSC_munge.$(date +%Y%m%d-%H%M%S).job.lst | tee /dev/stderr | wc -l

ml load resbatch
ml R/3.6 gcc

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-905 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-210250.job.lst 3

Submitted batch job 4255901

@yk-tanigawa
Copy link
Contributor Author

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc -type f -name "*.gz" | wc -l
20295
ml load resbatch R/3.6 gcc

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-1000 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.lst 1

# Submitted batch job 4260541

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-877 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.part2.lst 1

# Submitted batch job 4260621

@yk-tanigawa
Copy link
Contributor Author

We also apply LDSC munge on the meta-analyzed sumstats.


ml load R/3.6 gcc resbatch

sbatch -p mrivas,normal,owners --time=1:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge_meta --output=logs/munge_meta.%A_%a.out --error=logs/munge_meta.%A_%a.err --array=1-949 $parallel_sbatch_sh 1b_LDSC_munge.sh 1_LDSC_munge.20200718-134522.metal.job.lst 4
Submitted batch job 4279977

@yk-tanigawa
Copy link
Contributor Author

There are some failed files...

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/metal/
-type f -name "*.gz" | wc
   3417    3417  399695

[ytanigaw@sh02-09n54 ~/repos/rivas-lab/ukbb-tools/07_LDSC/jobs/202007_LDSC]$ wc 1_LDSC_munge.20200718-134522.metal.job.lst
  3794   3794 340971 1_LDSC_munge.20200718-134522.metal.job.lst

@guhanrv
Copy link
Collaborator

guhanrv commented Nov 19, 2020

An update on this - as a result of needing to run the pairwise rg calculations across all traits, I needed to convert all of the summary statistics to the munged format. I've tabulated the phenotypes for which the sumstats munge failed, with an error similar to the following:

Traceback (most recent call last):
  File "/opt/ldsc/munge_sumstats.py", line 701, in munge_sumstats
    check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
  File "/opt/ldsc/munge_sumstats.py", line 373, in check_median
    raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.11 (should be close to 0.0). This column may be mislabeled.

These are at https://github.com/rivas-lab/ukbb-tools/blob/master/07_LDSC/helpers/affected_metal_traits.txt.

A quick check on the gwas.qc.tsv file for the array-combined dataset indicates these are summary statistics that are low-N traits overall.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants