Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LDSC rg between FinnGen and UKB #27

Closed
yk-tanigawa opened this issue Jul 5, 2020 · 22 comments
Closed

LDSC rg between FinnGen and UKB #27

yk-tanigawa opened this issue Jul 5, 2020 · 22 comments
Assignees

Comments

@yk-tanigawa
Copy link
Contributor

To generate GBE_ID mapping between FinnGen and UKB, we apply LDSC rg between UKB and FinnGen.

We prepared FinnGen in LDSC munge format here.

We are also preparing UKB in LDSC munge format in issue #26.

We use WB sum stats for this rg analysis.

@yk-tanigawa yk-tanigawa added this to the GBE Global Meta-Analysis milestone Jul 5, 2020
@yk-tanigawa yk-tanigawa self-assigned this Jul 5, 2020
@yk-tanigawa
Copy link
Contributor Author

We have computation started.

find /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg/ -name "*.log" | wc
415

While we can run this preliminary analysis, this approach is not scalable.

There are 1801 FinnGen summary statistics. When focusing on HC phenotypes with N_WB >= 1000, there are 580 traits.

This means > 1M (= 1801 * 580) comparison. Each of them takes ~ 5 min...

@yk-tanigawa
Copy link
Contributor Author

A pilot analysis of 5000 comparisons is finished. It seems like some comparison dumped NA due to low N or h2 in the input trait.

To reduce the number of comparisons, we can run LDSC h2 first for both UKB and FinnGen, drop phenotypes with NA as the h2 estimates, and impose a min h2 threshold.

@yk-tanigawa
Copy link
Contributor Author

The heritability filter reduced the combinations.

FinnGen: 1801 --> 1482
UKB: 580 --> 523

cat /oak/stanford/groups/mrivas/public_data/finngen_r3/ldsc_h2.tsv | awk 'NR>1 && $2 > 0' | wc

cat /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/05_gbe/phenotype_info.tsv | awk -v ukb_WB_min_N=1000 -v FS='\t' '(NR>1 && $8 >= ukb_WB_min_N){print $1}' | egrep '^HC' | sort | comm -12 /dev/stdin <(cat /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/h2.white_british.tsv | egrep '^HC' |  awk '($2 > 0){print $1}' | sort) | wc
  • 1801 * 580 = 1044580
  • 1482 * 523 = 775086

--> 20% reduction.

@yk-tanigawa
Copy link
Contributor Author

171k comparison is done by now.

find /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg -name "*.log" | wc -l
171894

@yk-tanigawa
Copy link
Contributor Author

295,616 / 775,086 by now. ~980 jobs are running using Sherlock's owners queue.

Hopefully, we will have an initial results by the weekend.

@yk-tanigawa
Copy link
Contributor Author

580,450/775,086

@yk-tanigawa
Copy link
Contributor Author

753,824/775,086 almost there

@yk-tanigawa
Copy link
Contributor Author

This initial batch is now finished.

Here we have some commands that I used to check the progress of the computation.

bash 7_ldsc_rg.generate_input.sh > 7_ldsc_rg.input.$(date +%Y%m%d-%H%M%S).tsv

seq 2 5000 > job.1.5000.lst

sbatch -p mrivas,normal,owners --nodes=1 --mem=8000 --cores=1 --time=3:00:00 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-1000 /oak/stanford/groups/mrivas/users/ytanigaw/repos/yk-tanigawa/resbatch/parallel-sbatch.sh 7_ldsc_rg.sh job.1.5000.lst 5
Submitted batch job 3634866

523 * 1482 = 775086

1-100000

seq 1 100000 > job.1.100000.lst
sbatch -p mrivas,normal,owners --nodes=1 --mem=8000 --cores=1 --time=12:00:00 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-1000 /oak/stanford/groups/mrivas/users/ytanigaw/repos/yk-tanigawa/resbatch/parallel-sbatch.sh 7_ldsc_rg.sh job.1.100000.lst 100

seq 775001 775086 > job.775001.775086.lst
sbatch -p mrivas,normal,owners --nodes=1 --mem=6000 --cores=1 --time=0:45:00 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-848 /oak/stanford/groups/mrivas/users/ytanigaw/repos/yk-tanigawa/resbatch/parallel-sbatch.sh 7_ldsc_rg.sh job.775001.775086.lst 6

seq 1 775086 > job.1.775086.lst
sbatch -p mrivas,normal,owners --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-999 $parallel_sbatch_sh 7_ldsc_rg.bugfixed.sh job.1.775086.lst 776
Submitted batch job 3776642
find /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg -name "*.log" | wc

@yk-tanigawa
Copy link
Contributor Author

Tabulate the results into a table

/oak/stanford/groups/mrivas/public_data/summary_stats/finngen_r3/UKB_WB_rg.20200711-165157.tsv

This would take some time...

@yk-tanigawa
Copy link
Contributor Author

773,617 comparisons in

/oak/stanford/groups/mrivas/public_data/summary_stats/finngen_r3/UKB_WB_rg.20200711-171504.tsv.gz

It turned out that there are 775265 LDSC rg log files but some of them are empty files (presumably due to the owners job).

@yk-tanigawa
Copy link
Contributor Author

We computed the LDSC rg between FinnGen sumstats (estimated heritability > 0) and UKB WB sumstats (HC phenotypes with estimated heritability > 0).

We investigated the distribution of p-value. Because there are ~580 UKB traits, we put p-value threshold of 1e-4 and focused on those significant associations.

We also checked the distribution of rg.

After imposing p < 1e-4 filter, there are 3,597 rg estimates across 282 FinnGen phenotypes and 158 UKB HC phenotypes. We sorted the table by FinnGen phenocode and p-value of rg and uploaded to a Google Spreadsheet.

https://docs.google.com/spreadsheets/d/1ul4hr00KKZy0JRUW2ZW5-LORWEyeBCt7B3pAiNKRj5g/edit?usp=sharing

Analysis repo: https://github.com/rivas-lab/ukbb-tools/tree/master/04_gwas/extras/finngen_r3#ldsc-rg
11_LDSC_rg_dist

@yk-tanigawa
Copy link
Contributor Author

To finish up the missing files (files of size 0), we resubmitted the jobs.

cd /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/finngen_r3

sbatch -p mrivas --qos=high_p --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-999 $parallel_sbatch_sh 7_ldsc_rg.sh job.1.775086.lst 776

Submitted batch job 3922105

@yk-tanigawa
Copy link
Contributor Author

Commands to check the empty files:

cd /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc
find $(pwd)/UKB_WB_rg -type f -size 0 > UKB_WB_rg.size0.$(date +%Y%m%d-%H%M%S).lst

@yk-tanigawa
Copy link
Contributor Author

Empty files were recomputed.

@yk-tanigawa
Copy link
Contributor Author

Yet, we don't have comparison for cancer phenotypes.
--> simply because we used HC only from UKB.

Let's compute rg for UKB cancer phenotypes.

@yk-tanigawa
Copy link
Contributor Author

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/finngen_r3
bash 7_ldsc_rg.generate_input.cancer.sh  7_ldsc_rg.cancer.$(date +%Y%m%d-%H%M%S)
7_ldsc_rg.cancer.20200712-135520
7_ldsc_rg.cancer.20200712-135520.cancer.ukb.tsv
7_ldsc_rg.cancer.20200712-135520.cancer.finngen.tsv
[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/07_LDSC/jobs/202007_LDSC]$ find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/h2 -name "*log" | sed -e 's%/oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/h2/white_british.%%g' | sed -e 's/.log//g'| sed -e 's/[0-9]//g' | sort | uniq -c
    218 BIN
    494 BIN_FC
     10 FH
   1246 HC
   1487 INI
     49 QT_FC

We don't have heritability estimates for cancer phenotypes.

@yk-tanigawa
Copy link
Contributor Author

Let's make some progress on #21 first...

@yk-tanigawa
Copy link
Contributor Author

update the LDSC rg analysis with cancer and BIN_FC phenotypes (and a few updated HC phenotypes)

UKB GWAS is now updated.

Let's generate the list of UKB (and Finngen) traits and see how many phenotypes do we have in HC, BIN_FC, and cancer.

$ bash 7_ldsc_rg.generate_input.sh 7_ldsc_rg.$(date +%Y%m%d-%H%M%S)

$ wc 7_ldsc_rg.20200718-152556*
  1483   2966 224301 7_ldsc_rg.20200718-152556.finngen.tsv
   991   1982 135005 7_ldsc_rg.20200718-152556.ukb.tsv
  2474   4948 359306 total

This is a good increase from the previous run.

We have rg computed for a subset of traits from the previous iteration -- let's skip those and push the computation for the rest of the traits.

rm 7_ldsc_rg.20200718-152556.finngen.tsv # this is the same as 7_ldsc_rg.20200706-144408.finngen.tsv

comm -23 <(cat 7_ldsc_rg.20200718-152556.ukb.tsv | tr '\t' ':' | sort) <(cat 7_ldsc_rg.20200706-144408.ukb.tsv | tr '\t' ':' | sort) | tr ':' '\t' | cat <(cat 7_ldsc_rg.20200718-152556.ukb.tsv | head -n1) /dev/stdin > 7_ldsc_rg.20200718-152556.diff.ukb.tsv

ln -s 7_ldsc_rg.20200706-144408.finngen.tsv 7_ldsc_rg.20200718-152556.diff.finngen.tsv

With this, we have the input files for the rg computation.

$ wc *7_ldsc_rg.20200718-152556.diff*
  1483   2966 224301 7_ldsc_rg.20200718-152556.diff.finngen.tsv
   468    936  67731 7_ldsc_rg.20200718-152556.diff.ukb.tsv
  1951   3902 292032 total

We updated the script and tested its behavior

bash 7_ldsc_rg.sh

There are 1483 * 468 = 694,044 comparison. That can be performed with 695 rg computation * 999 jobs

sbatch -p mrivas,normal,owners --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-999 $parallel_sbatch_sh 7_ldsc_rg.sh job.1.775086.lst 695

Submitted batch job 4284086

@yk-tanigawa
Copy link
Contributor Author

Let's check the jobs again.

1483 * 991 < 1470 * 1000

sbatch -p mrivas,normal,owners --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-1000 $parallel_sbatch_sh 7_ldsc_rg.sh job.1.775086.lst 1470

@yk-tanigawa
Copy link
Contributor Author

LDSC rg table (full)

We used out scripts to aggregate LDSC rg results.

    10_ldsc_rg_view_batch_step1.sh
    10_ldsc_rg_view_batch_step2.sh
    10_ldsc_rg_view_batch_step3.sh

This resulted in:

/scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg.20200722-093650/LDSC.rg.tsv.gz

@yk-tanigawa
Copy link
Contributor Author

We computed the LDSC rg between FinnGen sumstats (estimated heritability > 0) and UKB WB sumstats (HC phenotypes with estimated heritability > 0).

We investigated the distribution of p-value. Because there are ~990 UKB traits, we put a p-value threshold of 5e-5 and focused on those significant associations.

LDSC rg

After imposing p < 5e-5 filter, there are 6,511 rg estimates across 289 FinnGen phenotypes and 292 UKB HC phenotypes. We sorted the table by FinnGen phenocode and p-value of rg and uploaded it to a Google Spreadsheet.

@yk-tanigawa
Copy link
Contributor Author

The corresponding directory in the codebase: https://github.com/rivas-lab/ukbb-tools/blob/master/04_gwas/extras/finngen_r3/README.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant