Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve chromosome standardization #157

Closed
wants to merge 3 commits into from
Closed

Improve chromosome standardization #157

wants to merge 3 commits into from

Conversation

mykmal
Copy link
Contributor

@mykmal mykmal commented Jul 11, 2023

This update makes the following three changes, as also listed in the NEWS.md file:

  • check_chr() now automatically removes all SNPs with nonstandard CHR entries (anything other than 1-22, X, Y, and MT)
  • check_chr() now ensures that the "chr" prefix is lowercase if kept
  • The rmv_chrPrefix parameter is no longer ignored when rmv_chr is NULL

Initially my plan was to only fix the bug in the third bullet point, but I ended up rewriting the entire check_chr() function to improve how the CHR column is standardized. I think that the new function is more efficient and handles some additional edge cases that could trip up the current version of MSS.

I tested the updated package on my GWAS data and it appears to work as expected, but I'm happy to explain my code in more detail and/or make any additional changes if needed.

@Al-Murphy
Copy link
Owner

Thanks for your input! Your changes make a lot of sense and do help clean up that function. Really not sure why I never used the rmv_chrPrefix parameter. I have run checks locally and it all seems fine to me.- Just two small changes before I merge:

  • You set the acceptable chromosomes when keeping the chr prefix to be:
    standard_chrs <- c(paste0("chr", 1:22), "X", "Y", "MT")
    In your opinion, is that standard to have X rather than chrX - can you point to somewhere this is the standard? If this is the case (which I believe so), I would want to add a line to convert any chrX, chrY and chrMT to X, Y, MT. Could you add this to your else condition on line 42 of your check_chr() updated function
  • I'm not sure what file did this but on your version I'm getting the following bioc check error (I don't get this on my master branch):
  • Checking individual file sizes...
    • WARNING: Package files exceed the 5MB size limit.

It appears one file you added to the git (and then likely removed after but is still in history? Is too large. Could you have a look at removing the file causing the issue from git history? I have info on doing that here You may find it cleaner to just clone the repo and put all your changes and resubmit a pull request? Up to you!

  • Secondly, this is optional but I think you should add your name to the list of contributors in the README with a link to your github page.

Alan.

@mykmal
Copy link
Contributor Author

mykmal commented Jul 11, 2023

Thank you for carefully reviewing my (very first) pull request! Below are responses to each of your points.

CHR standards

In your opinion, is that standard to have X rather than chrX - can you point to somewhere this is the standard?

Good catch! Here are the commonly used standards:

> GenomeInfoDb::genomeStyles("Homo sapiens")
   circular  auto   sex NCBI  UCSC dbSNP Ensembl
1     FALSE  TRUE FALSE    1  chr1   ch1       1
2     FALSE  TRUE FALSE    2  chr2   ch2       2
3     FALSE  TRUE FALSE    3  chr3   ch3       3
4     FALSE  TRUE FALSE    4  chr4   ch4       4
5     FALSE  TRUE FALSE    5  chr5   ch5       5
6     FALSE  TRUE FALSE    6  chr6   ch6       6
7     FALSE  TRUE FALSE    7  chr7   ch7       7
8     FALSE  TRUE FALSE    8  chr8   ch8       8
9     FALSE  TRUE FALSE    9  chr9   ch9       9
10    FALSE  TRUE FALSE   10 chr10  ch10      10
11    FALSE  TRUE FALSE   11 chr11  ch11      11
12    FALSE  TRUE FALSE   12 chr12  ch12      12
13    FALSE  TRUE FALSE   13 chr13  ch13      13
14    FALSE  TRUE FALSE   14 chr14  ch14      14
15    FALSE  TRUE FALSE   15 chr15  ch15      15
16    FALSE  TRUE FALSE   16 chr16  ch16      16
17    FALSE  TRUE FALSE   17 chr17  ch17      17
18    FALSE  TRUE FALSE   18 chr18  ch18      18
19    FALSE  TRUE FALSE   19 chr19  ch19      19
20    FALSE  TRUE FALSE   20 chr20  ch20      20
21    FALSE  TRUE FALSE   21 chr21  ch21      21
22    FALSE  TRUE FALSE   22 chr22  ch22      22
23    FALSE FALSE  TRUE    X  chrX   chX       X
24    FALSE FALSE  TRUE    Y  chrY   chY       Y
25     TRUE FALSE FALSE   MT  chrM  chMT      MT

Notice that the NCBI and Ensembl styles are identical, and match the format that MSS enforces when rmv_chrPrefix is TRUE. On the other hand, the UCSC style includes a "chr" prefix and codes mitochondrial DNA as "chrM" (without the T). In light of this, do you think it would make more sense to replace the rmv_chrPrefix parameter with a chr_style parameter that lets users choose between the NCBI/Ensembl and UCSC styles? (I've never encountered the dbSNP style in GWAS, but I could also provide that option if you want.)

File size checks

It appears one file you added to the git (and then likely removed after but is still in history? Is too large.

Weird, I'm not sure what is causing that. Once we decide on how to handle chromosome styles, I'll go ahead and re-commit my changes in a clean branch, as you suggested.

Contributor list

Secondly, this is optional but I think you should add your name to the list of contributors in the README with a link to your github page.

Thanks for inviting me to join the list of contributors! I started actively using MSS this summer and find it very convenient, so this will likely not be my last contribution.

@Al-Murphy
Copy link
Owner

Thank you for carefully reviewing my (very first) pull request!

Thank you very much for contributing! People doing so really takes (at least a little of) the burden of upkeep of the package off me and ensures it remains useful going forward.

In light of this, do you think it would make more sense to replace the rmv_chrPrefix parameter with a chr_style parameter that lets users choose between the NCBI/Ensembl and UCSC styles? (I've never encountered the dbSNP style in GWAS, but I could also provide that option if you want.)

I think adding a chr_style parameter instead is a great idea - can you set the default to Ensembl/NCBI and add an explanation and example of each style in the documentation? Probably best to also mention that this used to be rmv_chrPrefix in that parameter documentation too. Then just enforce that the user picks one of these 3 in the validate_parameters function? rmv_chrPrefix was actually never passed to this function but it should have been too.

Weird, I'm not sure what is causing that. Once we decide on how to handle chromosome styles, I'll go ahead and re-commit my changes in a clean branch, as you suggested.

Great, thanks!!

Thanks for inviting me to join the list of contributors! I started actively using MSS this summer and find it very convenient, so this will likely not be my last contribution.

Good to hear you find it useful, always happy to add in new features!

Cheers,
Alan.

@mykmal
Copy link
Contributor Author

mykmal commented Jul 11, 2023

Closing this in order to open a clean PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants