Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GWAS finishing effort - Simple line counts check #21

Closed
yk-tanigawa opened this issue Jul 4, 2020 · 12 comments
Closed

GWAS finishing effort - Simple line counts check #21

yk-tanigawa opened this issue Jul 4, 2020 · 12 comments
Assignees

Comments

@yk-tanigawa
Copy link
Contributor

As a QC of the GWAS sum stats freeze, we perform line counts.

We identify the list of (pop, GBE_ID) pairs that satisfy the minimum N >= 100 criteria. We then ask whether we have the results in the array-combined/gwas/current directory.

For the files linked from array-combined/gwas/current directory, we apply wc -l to see if the sum stats are complete.

Summary

missing sum stats

As of 2020/6/27, we have the following number of traits missing in the gwas/current dir

Screenshot 2020-07-04 13 22 55

The corresponding analysis notebook.

For others and related, the jobs were submitted.

incomplete sum stats

As of 2020/6/29, here is the summary of wc -l across populations.

Screenshot 2020-07-04 13 15 23

The corresponding analysis notebook.

@yk-tanigawa yk-tanigawa self-assigned this Jul 4, 2020
@yk-tanigawa
Copy link
Contributor Author

Update on counts

1. Finalized summary statistics files

wc_l population n
1080969 african 2696
1080969 e_asian 1917
1080969 non_british_white 3226
1080969 others 3294
1080969 related 3357
1080969 s_asian 2863
1080969 white_british 3587
1080600 others 1
1080600 related 3
1080278 others 144
1080278 related 83

2. Files that will be fixed with the on-going computation

wc_l population n
1059397 african 216
1059397 e_asian 252
1059397 non_british_white 148
1059397 s_asian 132
1059397 white_british 132

We are computing the sum stats for those in #19

3. File(s) that need attention

wc_l population n
1080567 white_british 1

I thought #17 fixed this file, but it was not the case.

Also, #20 have some fix.

4. Other incomplete or missing files

There are 302 files that need to be generated and/or refreshed.

For more information, please check here

$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | wc
    302    2114   43103

$ cat gwas-current-gz-wc.20200704-155715.combined.tsv | awk '$5 < 1059397' | cut -f2 | sort | uniq -c
     20 african
     96 e_asian
     41 non_british_white
     12 others
     11 related
     14 s_asian
    108 white_british

@yk-tanigawa
Copy link
Contributor Author

#20 is now finished.

@yk-tanigawa
Copy link
Contributor Author

So, in terms of the remaining jobs, we have

@yk-tanigawa
Copy link
Contributor Author

The patch (#19) generated 824 files with 1080278 lines (-691) because the chrY variants were skipped and one file with 1080969 lines.

Skipping chrY in --glm regression on phenotype 'PHENO1'

@yk-tanigawa
Copy link
Contributor Author

Re-computing wc -l

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt

# gwas-current-gz-list.20200711-232847.txt
# 22040 lines

ml load resbatch
ml load R/3.6 gcc

sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-959 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200711-232847.txt 23

# Submitted batch job 3923943

bash check.missing_pop_GBE.sh

@yk-tanigawa
Copy link
Contributor Author

missing_pop_GBE.minN100.20200711-233434.tsv

ToDo --> aggregate the wc -l following the instruction here

@yk-tanigawa
Copy link
Contributor Author

Aggregate the wc -l results

find logs/ -name "wc.392*err" | parallel 'tail {}' | grep array-end | wc -l
959

bash gwas-current-gz-wc-cat.sh
gwas-current-gz-wc.20200712-071851.tsv

rm gwas-current-gz-list.20200711-232847.txt

@yk-tanigawa
Copy link
Contributor Author

Update on counts

$ cat gwas-current-gz-wc.20200712-071851.tsv | awk '(NR>1){print $3}' | sort -nr | uniq -c
  20952 1080969
      4 1080600
      1 1080567
   1051 1080278
      1 1024521
      1 1001756
      1 891019
      1 890538
      1 837659
      1 836299
      1 779198
      1 727614
      1 726430
      1 725231
      1 672894
      1 671362
      1 454904
      1 412155
      1 402771
      1 400782
      1 399908
      1 347741
      1 345881
      1 314970
      1 238354
      1 238278
      1 184890
      1 183541
      1 163630
      1 163593
      1 130073
      1 129516
      1 106772
      1 75190
      1 75167
      1 21171

We've already investigated the followings

Unknown error (?)

on-going effort

@yk-tanigawa
Copy link
Contributor Author

wc -l refresh

cd ~/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/gwas/current -name "*.gz" -type l | sort > /oak/stanford/groups/mrivas/users/ytanigaw/repos/rivas-lab/ukbb-tools/04_gwas/extras/202006-GWAS-finish/gwas-current-gz-list.$(date +%Y%m%d-%H%M%S).txt

# gwas-current-gz-list.20200717-000322.txt
# 22172 lines

ml load resbatch
ml load R/3.6 gcc

sbatch -p mrivas --qos=high_p --time=1:0:00 --mem=6000 --nodes=1 --cores=1 --job-name=wc --output=logs/wc.%A_%a.out --error=logs/wc.%A_%a.err --array=1-964 $parallel_sbatch_sh gwas-current-gz-wc.sh gwas-current-gz-list.20200717-000322.txt 23

# Submitted batch job 4220848

bash check.missing_pop_GBE.sh
# missing_pop_GBE.minN100.20200717-001023.tsv

@yk-tanigawa
Copy link
Contributor Author

Screenshot 2020-07-17 10 02 13

@yk-tanigawa
Copy link
Contributor Author

Screenshot 2020-07-17 17 40 23

@yk-tanigawa
Copy link
Contributor Author

The line count looks good.

Based on these results, we started the following computation:

We can now jump on QC: #32

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant