Skip to content

Regression, Set based clumping and Refactoring

Compare
Choose a tag to compare
@choishingwan choishingwan released this 08 Oct 20:27
· 506 commits to master since this release

Update Log

  • Almost refactored the whole code base to make code cleaner and easier to read, thus hopefully reduce the number of bugs etc (Have not refactor code bases related permutation)
  • A bug was found in set based clumping. When more than 62 (or 30 for 32bit machine) sets were provided, the only the last few sets were properly clumped with the possibility of leaving some correlated SNPs in earlier sets
  • New glm algorithm for PRSice was sensitive to collinearity and can give very different result when compared to those calculated from R. This problem is now fixed
  • Problem regarding the --target-list in the Rscript is now fixed
  • Some changes to the log to make things a bit clearer
  • Add some more unit tests
  • Fix problem when bgen file are used for --ld where the sample size can be wrong when no external sample is provided and a phenotype file is provided for the target.

Manually tested feature

There are a lot of functionality of PRSice and I have not been able to write unit tests for most of the features (currently unit test coverage is less than 20%). The following features are tested manually using some toy data:

  • Binary PLINK input
    1. Clumping should generate identical results as PLINK 1.9
    2. PRS calculation should be identical as PLINK 1.9 (after considering flipping and when using the same input)
    3. MAF filtering should be identical to those calculated in PLINK (when there's no founder)
    4. Genotype missingness calculated should be identical to those calculated in PLINK
    5. Clumping with a reference panel should generate identical result as PLINK
    6. Filtering on LD reference panel work as expected
  • Binary GEN input
    1. Clumping should generate identical results as PLINK 1.9 (doesn't matter if whether we use --hard or --allow-inter)
    2. Automatic hard coding (--hard) should generate identical PRS as those calculated using PLINK
    3. PRS calculated on using dosage scoring (without --hard) are highly correlated with those generated with (--hard)
    4. Geno filtering and MAF filtering on target sample worked as expected.
    5. Geno filtering and MAF filtering on reference sample worked as expected.

Things that we have not tested

  1. We have not test data with founder samples
  2. We have only used the default --missing and --score parameters for our testing
  3. All permutation algorithms (--perm and --set-perm)
  4. Only tested the default genetic model (additive) of the --model parameter
  5. The window compilation are not tested
  6. --no-regress and --all-score were also not tested, but should in theory be ok

Features that might be problematic (use with caution)

  1. The INFO score calculated using --info filtering differ from those calculated from qctools. (Correlated, but in some situation, differ quite a lot). We have contacted author of qctool to see if that's an algorithmic difference or if there's a bug in PRSice (we have tried to follow the algorithm in the MaCH paper and in our manual testing, the number calculated from PRSice and those calculated manually are identical)

Note

I try my best to test run as many features as possible and are trying to implement as many unit test as possible. However, the lack of manpower means that there will always be features that I missed / things that are not thoroughly tested.

Please let us know if there are any problem or if PRSice didn't generate the expected results.