LDSC rg between FinnGen and UKB #27

yk-tanigawa · 2020-07-05T21:48:01Z

To generate GBE_ID mapping between FinnGen and UKB, we apply LDSC rg between UKB and FinnGen.

We prepared FinnGen in LDSC munge format here.

We are also preparing UKB in LDSC munge format in issue #26.

We use WB sum stats for this rg analysis.

The text was updated successfully, but these errors were encountered:

yk-tanigawa · 2020-07-05T22:55:48Z

We have computation started.

find /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg/ -name "*.log" | wc
415

While we can run this preliminary analysis, this approach is not scalable.

There are 1801 FinnGen summary statistics. When focusing on HC phenotypes with N_WB >= 1000, there are 580 traits.

This means > 1M (= 1801 * 580) comparison. Each of them takes ~ 5 min...

yk-tanigawa · 2020-07-06T00:48:17Z

A pilot analysis of 5000 comparisons is finished. It seems like some comparison dumped NA due to low N or h2 in the input trait.

To reduce the number of comparisons, we can run LDSC h2 first for both UKB and FinnGen, drop phenotypes with NA as the h2 estimates, and impose a min h2 threshold.

yk-tanigawa · 2020-07-06T21:29:35Z

The heritability filter reduced the combinations.

FinnGen: 1801 --> 1482
UKB: 580 --> 523

cat /oak/stanford/groups/mrivas/public_data/finngen_r3/ldsc_h2.tsv | awk 'NR>1 && $2 > 0' | wc

cat /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/05_gbe/phenotype_info.tsv | awk -v ukb_WB_min_N=1000 -v FS='\t' '(NR>1 && $8 >= ukb_WB_min_N){print $1}' | egrep '^HC' | sort | comm -12 /dev/stdin <(cat /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/h2.white_british.tsv | egrep '^HC' |  awk '($2 > 0){print $1}' | sort) | wc

1801 * 580 = 1044580
1482 * 523 = 775086

--> 20% reduction.

yk-tanigawa · 2020-07-09T14:48:20Z

171k comparison is done by now.

find /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg -name "*.log" | wc -l
171894

yk-tanigawa · 2020-07-09T21:56:03Z

295,616 / 775,086 by now. ~980 jobs are running using Sherlock's owners queue.

Hopefully, we will have an initial results by the weekend.

yk-tanigawa · 2020-07-10T15:23:47Z

580,450/775,086

yk-tanigawa · 2020-07-11T05:56:53Z

753,824/775,086 almost there

yk-tanigawa · 2020-07-11T23:59:02Z

This initial batch is now finished.

Here we have some commands that I used to check the progress of the computation.

bash 7_ldsc_rg.generate_input.sh > 7_ldsc_rg.input.$(date +%Y%m%d-%H%M%S).tsv

seq 2 5000 > job.1.5000.lst

sbatch -p mrivas,normal,owners --nodes=1 --mem=8000 --cores=1 --time=3:00:00 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-1000 /oak/stanford/groups/mrivas/users/ytanigaw/repos/yk-tanigawa/resbatch/parallel-sbatch.sh 7_ldsc_rg.sh job.1.5000.lst 5
Submitted batch job 3634866

523 * 1482 = 775086

1-100000

seq 1 100000 > job.1.100000.lst
sbatch -p mrivas,normal,owners --nodes=1 --mem=8000 --cores=1 --time=12:00:00 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-1000 /oak/stanford/groups/mrivas/users/ytanigaw/repos/yk-tanigawa/resbatch/parallel-sbatch.sh 7_ldsc_rg.sh job.1.100000.lst 100

seq 775001 775086 > job.775001.775086.lst
sbatch -p mrivas,normal,owners --nodes=1 --mem=6000 --cores=1 --time=0:45:00 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-848 /oak/stanford/groups/mrivas/users/ytanigaw/repos/yk-tanigawa/resbatch/parallel-sbatch.sh 7_ldsc_rg.sh job.775001.775086.lst 6

seq 1 775086 > job.1.775086.lst
sbatch -p mrivas,normal,owners --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-999 $parallel_sbatch_sh 7_ldsc_rg.bugfixed.sh job.1.775086.lst 776
Submitted batch job 3776642

find /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg -name "*.log" | wc

yk-tanigawa · 2020-07-12T00:01:30Z

Tabulate the results into a table

/oak/stanford/groups/mrivas/public_data/summary_stats/finngen_r3/UKB_WB_rg.20200711-165157.tsv

This would take some time...

yk-tanigawa · 2020-07-12T01:14:44Z

773,617 comparisons in

/oak/stanford/groups/mrivas/public_data/summary_stats/finngen_r3/UKB_WB_rg.20200711-171504.tsv.gz

It turned out that there are 775265 LDSC rg log files but some of them are empty files (presumably due to the owners job).

yk-tanigawa · 2020-07-12T02:08:42Z

We computed the LDSC rg between FinnGen sumstats (estimated heritability > 0) and UKB WB sumstats (HC phenotypes with estimated heritability > 0).

We investigated the distribution of p-value. Because there are ~580 UKB traits, we put p-value threshold of 1e-4 and focused on those significant associations.

We also checked the distribution of rg.

After imposing p < 1e-4 filter, there are 3,597 rg estimates across 282 FinnGen phenotypes and 158 UKB HC phenotypes. We sorted the table by FinnGen phenocode and p-value of rg and uploaded to a Google Spreadsheet.

https://docs.google.com/spreadsheets/d/1ul4hr00KKZy0JRUW2ZW5-LORWEyeBCt7B3pAiNKRj5g/edit?usp=sharing

Analysis repo: https://github.com/rivas-lab/ukbb-tools/tree/master/04_gwas/extras/finngen_r3#ldsc-rg

yk-tanigawa · 2020-07-12T05:18:05Z

To finish up the missing files (files of size 0), we resubmitted the jobs.

cd /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/finngen_r3

sbatch -p mrivas --qos=high_p --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-999 $parallel_sbatch_sh 7_ldsc_rg.sh job.1.775086.lst 776

Submitted batch job 3922105

yk-tanigawa · 2020-07-12T05:24:27Z

Commands to check the empty files:

cd /scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc
find $(pwd)/UKB_WB_rg -type f -size 0 > UKB_WB_rg.size0.$(date +%Y%m%d-%H%M%S).lst

yk-tanigawa · 2020-07-12T20:36:02Z

Empty files were recomputed.

yk-tanigawa · 2020-07-12T20:36:30Z

Yet, we don't have comparison for cancer phenotypes.
--> simply because we used HC only from UKB.

Let's compute rg for UKB cancer phenotypes.

yk-tanigawa · 2020-07-12T21:02:15Z

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/finngen_r3
bash 7_ldsc_rg.generate_input.cancer.sh  7_ldsc_rg.cancer.$(date +%Y%m%d-%H%M%S)
7_ldsc_rg.cancer.20200712-135520
7_ldsc_rg.cancer.20200712-135520.cancer.ukb.tsv
7_ldsc_rg.cancer.20200712-135520.cancer.finngen.tsv

[ytanigaw@sh02-09n53 ~/repos/rivas-lab/ukbb-tools/07_LDSC/jobs/202007_LDSC]$ find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/h2 -name "*log" | sed -e 's%/oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/h2/white_british.%%g' | sed -e 's/.log//g'| sed -e 's/[0-9]//g' | sort | uniq -c
    218 BIN
    494 BIN_FC
     10 FH
   1246 HC
   1487 INI
     49 QT_FC

We don't have heritability estimates for cancer phenotypes.

yk-tanigawa · 2020-07-12T21:14:40Z

Let's make some progress on #21 first...

yk-tanigawa · 2020-07-19T00:02:02Z

update the LDSC rg analysis with cancer and BIN_FC phenotypes (and a few updated HC phenotypes)

UKB GWAS is now updated.

Let's generate the list of UKB (and Finngen) traits and see how many phenotypes do we have in HC, BIN_FC, and cancer.

$ bash 7_ldsc_rg.generate_input.sh 7_ldsc_rg.$(date +%Y%m%d-%H%M%S)

$ wc 7_ldsc_rg.20200718-152556*
  1483   2966 224301 7_ldsc_rg.20200718-152556.finngen.tsv
   991   1982 135005 7_ldsc_rg.20200718-152556.ukb.tsv
  2474   4948 359306 total

This is a good increase from the previous run.

We have rg computed for a subset of traits from the previous iteration -- let's skip those and push the computation for the rest of the traits.

rm 7_ldsc_rg.20200718-152556.finngen.tsv # this is the same as 7_ldsc_rg.20200706-144408.finngen.tsv

comm -23 <(cat 7_ldsc_rg.20200718-152556.ukb.tsv | tr '\t' ':' | sort) <(cat 7_ldsc_rg.20200706-144408.ukb.tsv | tr '\t' ':' | sort) | tr ':' '\t' | cat <(cat 7_ldsc_rg.20200718-152556.ukb.tsv | head -n1) /dev/stdin > 7_ldsc_rg.20200718-152556.diff.ukb.tsv

ln -s 7_ldsc_rg.20200706-144408.finngen.tsv 7_ldsc_rg.20200718-152556.diff.finngen.tsv

With this, we have the input files for the rg computation.

$ wc *7_ldsc_rg.20200718-152556.diff*
  1483   2966 224301 7_ldsc_rg.20200718-152556.diff.finngen.tsv
   468    936  67731 7_ldsc_rg.20200718-152556.diff.ukb.tsv
  1951   3902 292032 total

We updated the script and tested its behavior

bash 7_ldsc_rg.sh

There are 1483 * 468 = 694,044 comparison. That can be performed with 695 rg computation * 999 jobs

sbatch -p mrivas,normal,owners --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-999 $parallel_sbatch_sh 7_ldsc_rg.sh job.1.775086.lst 695

Submitted batch job 4284086

yk-tanigawa · 2020-07-21T18:14:37Z

Let's check the jobs again.

1483 * 991 < 1470 * 1000

sbatch -p mrivas,normal,owners --time=2-0:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=FGrg --output=logs/FGrg.%A_%a.out --error=logs/FGrg.%A_%a.err --array=1-1000 $parallel_sbatch_sh 7_ldsc_rg.sh job.1.775086.lst 1470

yk-tanigawa · 2020-07-22T18:19:39Z

LDSC rg table (full)

We used out scripts to aggregate LDSC rg results.

    10_ldsc_rg_view_batch_step1.sh
    10_ldsc_rg_view_batch_step2.sh
    10_ldsc_rg_view_batch_step3.sh

This resulted in:

/scratch/groups/mrivas/public_data/summary_stats/finngen_r3/ldsc/UKB_WB_rg.20200722-093650/LDSC.rg.tsv.gz

yk-tanigawa · 2020-07-22T18:27:27Z

We computed the LDSC rg between FinnGen sumstats (estimated heritability > 0) and UKB WB sumstats (HC phenotypes with estimated heritability > 0).

We investigated the distribution of p-value. Because there are ~990 UKB traits, we put a p-value threshold of 5e-5 and focused on those significant associations.

After imposing p < 5e-5 filter, there are 6,511 rg estimates across 289 FinnGen phenotypes and 292 UKB HC phenotypes. We sorted the table by FinnGen phenocode and p-value of rg and uploaded it to a Google Spreadsheet.

Google Spreadsheet

yk-tanigawa · 2020-07-22T18:28:19Z

The corresponding directory in the codebase: https://github.com/rivas-lab/ukbb-tools/blob/master/04_gwas/extras/finngen_r3/README.md

yk-tanigawa added this to the GBE Global Meta-Analysis milestone Jul 5, 2020

yk-tanigawa self-assigned this Jul 5, 2020

yk-tanigawa mentioned this issue Jul 6, 2020

LDSC h2 for FinnGen #28

Closed

yk-tanigawa mentioned this issue Jul 11, 2020

GWAS finishing effort - re-run for 236 traits #24

Closed

yk-tanigawa mentioned this issue Jul 12, 2020

LDSC munge for UKB sumstats #26

Open

yk-tanigawa closed this as completed Jul 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LDSC rg between FinnGen and UKB #27

LDSC rg between FinnGen and UKB #27

yk-tanigawa commented Jul 5, 2020

yk-tanigawa commented Jul 5, 2020

yk-tanigawa commented Jul 6, 2020

yk-tanigawa commented Jul 6, 2020

yk-tanigawa commented Jul 9, 2020

yk-tanigawa commented Jul 9, 2020

yk-tanigawa commented Jul 10, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 19, 2020

yk-tanigawa commented Jul 21, 2020

yk-tanigawa commented Jul 22, 2020

yk-tanigawa commented Jul 22, 2020

yk-tanigawa commented Jul 22, 2020

LDSC rg between FinnGen and UKB #27

LDSC rg between FinnGen and UKB #27

Comments

yk-tanigawa commented Jul 5, 2020

yk-tanigawa commented Jul 5, 2020

yk-tanigawa commented Jul 6, 2020

yk-tanigawa commented Jul 6, 2020

yk-tanigawa commented Jul 9, 2020

yk-tanigawa commented Jul 9, 2020

yk-tanigawa commented Jul 10, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 11, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 12, 2020

yk-tanigawa commented Jul 19, 2020

update the LDSC rg analysis with cancer and BIN_FC phenotypes (and a few updated HC phenotypes)

yk-tanigawa commented Jul 21, 2020

yk-tanigawa commented Jul 22, 2020

LDSC rg table (full)

yk-tanigawa commented Jul 22, 2020

yk-tanigawa commented Jul 22, 2020