Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clumping and pairwise LD for specified SNP lists #24

Open
explodecomputer opened this issue Sep 24, 2018 · 2 comments
Open

Clumping and pairwise LD for specified SNP lists #24

explodecomputer opened this issue Sep 24, 2018 · 2 comments

Comments

@explodecomputer
Copy link

I was alerted to your package recently and it looks extremely valuable, congratulations!

I did have a couple of feature requests, apologies if this is already implemented I didn't see the documentation.

  1. Clumping - where SNPs are ordered based on their p-value in GWAS and are iteratively filtered by removing any SNPs in LD with the SNP with the lowest p-value
  2. Creating an LD matrix for a list of SNPs (e.g. rather than a region)
@mklarqvist
Copy link
Owner

Thanks for these suggestions @explodecomputer . These features are not yet implemented in Tomahawk.

There is a big update coming to Tomahawk in the next few weeks and I will be sure to implement your suggestions.

@explodecomputer
Copy link
Author

Hi @mklarqvist I just wanted to follow up on this. We have a service that performs LD calculations on the fly, currently using plink 1.9. This is the service: https://gwas-api.mrcieu.ac.uk/
The order of operations is typically pretty small. There are say 5000 SNPs which reach genome-wide significance, and we need to clump them, meaning

  1. Rank the p-values from lowest to highest
  2. Any SNP that is in LD at some threshold and within a physical distance window with the top hit is removed
  3. return to (2) with the new remaining top hit
  4. Once no more SNPs left, return each independent top hit

It's quite a simple algorithm, and plink 1.9 provides good performance on the LD reference panel that we're using which is ~500 european individuals from the 1000 genomes data, retaining only SNPs with maf > 0.01

Running clumping on say 2000 SNPs in this reference dataset in plink takes around 5 seconds.

The next thing that we want to do is increase the sample size of this reference dataset so that more precise estimates of LD can be obtained. Tomahawk looks like a potentially good choice, but I just wanted to get your advice on this before I explore further.

If clumping isn't implemented I'm happy to try implementing it in a fork and create a pull request.
Also - do you have plans to allow indels to be included?
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants