- Cope with new PLINK2 URLs in
download_plink2()
.
- Now properly errors when
ncol(G) != length(infos.chr)
insnp_clumping()
. Also whennrow(gwas) != length(infos.chr)
insnp_manhattan()
.
- Add function
snp_projectSelfPCA()
(onlybed_projectSelfPCA()
existed).
- Add a new
min.maf = 0.02
parameter tosnp_autoSVD()
andbed_autoSVD()
. Then variants are now discarded when they have either a small MAC or a small MAF.
- Can now use matrix accessors for class
bed_light
as well.
- Minor improvements to
snp_autoSVD()
andbed_autoSVD()
:- error when
min.mac = 0
, - return a better
attr(, "lrldr")
.
- error when
- In functions
snp_autoSVD()
andbed_autoSVD()
, now perform the MAC thresholding before the clumping step. This reordering should not change results, but this should be faster now.
- In function
snp_ancestry_summary()
, add parametersum_to_one
to optionally allows for ancestry coefficients to have a sum lower than 1 (whenFALSE
; default isTRUE
).
- In function
snp_modifyBuild()
, you can now providelocal_chain
as a vector of two, for when usingcheck_reverse
. You can now also modify thebase_url
from where to download the chain files.
- In function
snp_ancestry_summary()
, now also report correlations between input frequencies and each reference frequencies as well as predicted frequencies. Also add a new parametermin_cor
to error when the latter correlation is too small.
- In function
snp_modifyBuild()
, fix a ftp broken link, and add the possibility to use a local chain file specified by the new parameterlocal_chain
.
- Fix issue with
snp_subset()
when either$fam
or$map
are missing.
- Add function
snp_asGeneticPos2()
(anddownload_genetic_map()
) where you can provide any reference genetic map as a data frame. This function uses linear interpolation to transform physical positions (in bp) to genetic positions (in cM).
- Add parameter
p_bounds
in LDpred2-auto to provide bounds for the estimate of the polygenicity p.
- Fix sampling issue of
snp_simuPheno()
whenlength(ind.possible)
is 1.
- Implement matrix accessors
[,]
for bed objects.
- Use a safer detection of strong divergences in LDpred2 and lassosum2.
-
Add new parameter
ind.corr
tosnp_lassosum2()
,snp_ldpred2_grid()
andsnp_ldpred2_auto()
to be able to use a subset ofcorr
without making a copy of it. -
Add new parameter
ind.beta
tosnp_ldsc2()
to use a subset of the full LD scores corresponding todf_beta
.
- Add new parameter
pos_scaled
tosnp_ldsplit()
.
- Fix C++ code that used integers to store the positions for clumping.
- Add other architectures (AMD / ARM) as options for PLINK2.
- Add option
use_MLE
in LDpred2-auto to allow, when usingFALSE
, for running LDpred2-auto as in previous versions (e.g. v1.10.8), which did not include alpha in the model. Default isTRUE
.
- Detect strong divergence in LDpred2-auto, and return missing values in that case.
- Fix font rendering issue of
>=
in subtitle ofsnp_manhattan()
.
- Autocomplete PLINK builds to be downloaded (fix #383).
- Extend and improve LDpred2-auto to allow for estimating
$h^2$ ,$p$ , and$\alpha$ , a new third parameter that controls how expected effect sizes relate to minor allele frequencies.
- Can now run
snp_ldsc2()
withcorr
as an SFBM.
- Now use a sparse format for sampling betas returned in LDpred2-auto, instead of a dense matrix that could require quite some memory to store.
- Add two new parameters to
snp_simuPheno()
:alpha
andprob
.
- Fix a liftOver error in
snp_modifyBuild()
.
- Better
snp_ldsplit()
:- also return
$cost2
, the sum of squared sizes of the blocks, - for equivalent splits (with the same cost), now return the one that also minimizes cost2,
- now return unique splits only (e.g. could get equivalent splits with different
max_size
).
- also return
-
Slightly change the default parameters of lassosum2:
delta
fromc(0.001, 0.005, 0.02, 0.1, 0.6, 3)
toc(0.001, 0.01, 0.1, 1)
,nlambda
from 20 to 30,maxiter
from 500 to 1000.
-
Add a penalty multiplicative factor for delta and lambda to regularize variants with smaller GWAS sample sizes more (when they are different, as in meta-analyses with different sets of variants).
- Now use the same updating strategy for residuals in LDpred2 as in lassosum2. This can make LDpred2-grid and LDpred2-auto an order of magnitude faster, especially for small p.
- Better
snp_modifyBuild()
: more variants should be mapped + add some QC on the mapping (a position is not mapped to more than one, the chromosome is the same, and possibly check whether we can go back to the initial position -> cf. https://doi.org/10.1093/nargab/lqaa054).
- Add two new parameters to
snp_ldsplit()
:max_r2
, the maximum squared correlation allowed outside blocks, andmax_cost
, the maximum cost of reported solutions (i.e. the sum of all squared correlations outside blocks). Usingmax_r2
offers an extra guarantee that the splitting is very good, and makes the function much faster by discarding lots of possible splits.
-
LDpred2-grid does not use OpenMP for parallelism anymore, it now simply uses multiple R processes.
-
LDpred2-grid and LDpred2-auto can now make use of
set.seed()
to get reproducible results. Note that LDpred2-inf and lassosum2 do not use any sampling.
- Enforce
scipen = 50
when writing files to turn off scientific format (e.g. for physical positions stored asdouble
).
- Use a better strategy for appending to an SFBM (
$add_columns()
).
- Fix an issue in
snp_readBGI()
when using an outdated version of package {bit64}.
snp_cor()
andbed_cor()
now use less memory.
-
Remove parameter
info
fromsnp_cor()
andbed_cor()
because this correction is not useful after all. -
snp_cor()
andbed_cor()
now return NaNs when e.g. the standard deviation is 0 (and warn about it). Before, these values were not reported (i.e. treated as 0).
- You can now return information on all variants with
snp_readBGI()
.
- Fix
snp_manhattan()
when non-ordered (chr, pos) are provided.
- Enhance function
snp_ancestry_summary()
by allowing to estimate ancestry proportions after PCA projection (instead of directly using the allele frequencies).
-
Add function
bed_cor()
(similar tosnp_cor()
but with bed files/objects directly). -
Add functions
snp_ld_scores()
andbed_ld_scores()
.
- Add function
snp_ancestry_summary()
to estimate ancestry proportions from a cohort using only its summary allele frequencies.
- Add function
snp_scaleAlpha()
, which is similar tosnp_scaleBinom()
, but has a parameteralpha
that controls the relation between the scaling and the allele frequencies.
- Function
snp_cor()
now also uses the upper triangle (@uplo = "U"
) when the sparse correlation matrix is diagonal, so that it is easier to use with e.g.as_SFBM()
.
- Add parameter
type
insnp_asGeneticPos()
to also be able to use interpolated genetic maps from here.
- Add parameter
return_flip_and_rev
tosnp_match()
for whether to return internal boolean variables"_FLIP_"
and"_REV_"
.
- Add
$perc_kept
in the output ofsnp_ldsplit()
, the percentage of initial non-zero values kept within the blocks defined.
- Faster
snp_prodBGEN()
.
- Add function
snp_prodBGEN()
to compute a matrix product between BGEN files and a matrix (or a vector). This removes the need to read an intermediate FBM object withsnp_readBGEN()
to compute the product. Moreover, when using dosages, they are not rounded to two decimal places anymore.
-
Trade new parameter
num_iter_change
for a simplerallow_jump_sign
. -
Change defaults in LDpred2-auto to use 500 burn-in iterations (was 1000 before) followed by 200 iterations (500 before). Such a large number of iterations is usually not really needed.
- New compact format for SFBMs which should be really useful for LDpred2 (should require about half of memory and be twice as fast). The only thing that you need to change is
as_SFBM(corr0, compact = TRUE)
. Make sure to reinstall {bigsnpr} after updating to {bigsparser} v0.5.
- Prepare for incoming paper on (among other things) improved robustness of LDpred2-auto:
- add parameter
shrink_corr
to shrink off-diagonal elements of the LD matrix, - add parameter
num_iter_change
to control when starting to shrink the variants that change sign too much, - also return
corr_est
, the "imputed" correlations between variants and phenotypes, which can be used for post-QCing variants by comparing those tobeta / sqrt(n_eff * beta_se^2 + beta^2)
.
- add parameter
-
Replace parameter
s
bydelta
insnp_lassosum2()
. This new parameterdelta
better reflects that the lassosum model also uses L2-regularization (therefore, elastic-net regularization). -
Now detect strong divergence in lassosum2 and LDpred2-grid, and return missing values for the corresponding effect sizes.
- Now use a better formula for computing standard errors in
snp_ldsc()
when using blocks with different sizes.
- Add parameter
info
tosnp_cor()
to correct correlations when they are computed from imputed dosage data.
- Function
snp_readBGEN()
now also returns frequencies and imputation INFO scores.
- Add parameter
rsid
tosnp_asGeneticPos()
to also allow matching with rsIDs.
- Add function
snp_lassosum2()
to train the lassosum models using the exact same input data as LDpred2.
- Add parameter
report_step
insnp_ldpred2_auto()
to report some of the internal sampling betas.
- Fix crash in
snp_readBGEN()
when using BGEN files containing~
.
- Add parameter
thr_r2
insnp_cor()
.
- Remove penalization in
snp_ldsplit()
. Instead, report the best splits for a range of numbers of blocks desired.
- Penalization in
snp_ldsplit()
now makes more sense. Also fix a small bug that prevented splitting the last block in some cases.
- Add function
snp_ldsplit()
for optimally splitting variants in nearly independent blocks of LD.
- Add option
file.type = "--gzvcf"
for using gzipped VCF insnp_plinkQC()
.
- Finally remove function
snp_assocBGEN()
; prefer reading small parts withsnp_readBGEN()
as a temporarybigSNP
object and do the association test with e.g.big_univLinReg()
.
- Add function
snp_thr_correct()
for correcting for winner's curse in summary statistics when using p-value thresholding.
-
Use a better formula for the scale in LDpred2, useful when there are some variants with very large effects (e.g. explaining more than 10% phenotypic variance).
-
Simplify LDpred2; there was not really any need for initialization and ordering of the Gibbs sampler.
- Add option
return_sampling_betas
insnp_ldpred2_grid()
to return all sampling betas (after burn-in), which is useful for assessing the uncertainty of the PRS at the individual level (see https://doi.org/10.1101/2020.11.30.403188).
-
Faster cross-product with an SFBM, which should make all LDpred2 models faster.
-
Also return
$postp_est
,$h2_init
and$p_init
in LDpred2-auto.
- Add multiple checks in
snp_readBGEN()
to make sure of the expected format.
- Add function
snp_fst()
for computing Fst.
- Workaround for error
could not find function "ldpred2_gibbs_auto"
.
- Can now directly do
as_SFBM(corr0)
instead ofbigsparser::as_SFBM(as(corr0, "dgCMatrix"))
. This should also use less memory and be faster.
- Add option
sparse
to enable getting also a sparse solution in LDpred2-auto.
-
Faster
as_SFBM()
. -
Allow for format
01
or1
for chromosomes in BGI files.
- Fasten
snp_match()
. Also now remove duplicates by default.
- Fix a bug when using very large correlation matrices in LDpred2 (although we do not recommend to do so).
-
All 3 LDpred2 functions now use an SFBM as input format for the correlation matrix.
-
Allow for multiple initial values for p in
snp_ldpred2_auto()
. -
Add function
coef_to_liab()
for e.g. converting heritability to the liability scale.
- Change default of parameter
alpha
of functionsnp_cor()
to1
.
-
Add functions
snp_ldpred2_inf()
,snp_ldpred2_grid()
andsnp_ldpred2_auto()
for running the new LDpred2-inf, LDpred2-grid and LDpred2-auto. -
Add functions
snp_ldsc()
andsnp_ldsc2()
for performing LD score regression. -
Add function
snp_asGeneticPos()
for transforming physical positions to genetic positions. -
Add function
snp_simuPheno()
for simulating phenotypes.
- Also use OpenMP for the parallelization of
snp_pcadapt()
,bed_pcadapt()
,snp_readBGEN()
andsnp_fastImputeSimple()
.
-
Parallelization of clumping algorithms has been modified. Before, chromosomes were imputed in parallel. Now, chromosomes are processed sequentially, but computations within each chromosome are performed in parallel thanks to OpenMP. This should prevent major slowdowns for very large samples sizes (due to swapping).
-
Use OpenMP to parallelize other functions as well (possibly only sequential until now).
-
Can now run
snp_cor()
in parallel. -
Parallelization of
snp_fastImpute()
has been modified. Before this version, chromosomes were imputed in parallel. Now, chromosomes are processed sequentially, but computation of correlation between variants and XGBoost models are performed using parallelization.
- Add function
snp_subset()
as alias of methodsubset()
for subsettingbigSNP
objects.
- Use new class
bed_light
internally to make parallel algorithms faster because they have to transfer less data to clusters. Also define differently functions used inbig_parallelize()
for the same reason.
- Use the new implementation of robust OGK Mahalanobis distance in {bigutilsr}.
- Fix error
object 'obj.bed' not found
insnp_readBed2()
.
- Cope with new read-only option in {bigstatsr} version >= 1.1.
- Add option
backingfile
tosubset.bigSNP()
.
- Add option
byrow
tobed_counts()
.
-
Add memory-mapping on PLINK (.bed) files with missing values + new functions:
bed()
bed_MAF()
bed_autoSVD()
bed_clumping()
bed_counts()
bed_cprodVec()
bed_pcadapt()
bed_prodVec()
bed_projectPCA()
bed_projectSelfPCA()
bed_randomSVD()
bed_scaleBinom()
bed_tcrossprodSelf()
download_1000G()
snp_modifyBuild()
snp_plinkKINGQC()
snp_readBed2()
sub_bed()
-
Add 3 parameters to
autoSVD()
:alpha.tukey
,min.mac
andmax.iter
. -
Remove option for changing ploidy (that was only partially supported).
-
Automatically apply
snp_gc()
topcadapt
.
- Add
snp_fastImputeSimple()
: fast imputation via mode, mean or sampling according to allele frequencies.
- Fix a bug in
snp_readBGEN()
that could not handle duplicated variants or individuals.
- When using
snp_grid_PRS()
, it now stores not only the FBM, but also the input parameters as attributes (the whole result basically).
-
Add 3 SCT functions
snp_grid_*()
to improve from Clumping and Thresholding (preprint coming soon). -
Add
snp_match()
function to match between summary statistics and some SNP information.
- Parameter
is.size.in.bp
is deprecated.
- Add parameter
read_as
forsnp_readBGEN()
. It is now possible to sample BGEN probabilities as random hard calls usingread_as = "random"
. Default remains reading probabilities as dosages.
-
For memory-mapping, now use mio instead of boost.
-
snp_clumping()
(andsnp_autoSVD()
) now has asize
that is inversely proportional tothr.r2
. -
snp_pruning()
is deprecated (and will be removed someday); now always usesnp_clumping()
.
- When reading bed files, switch reading of Os and 2s to be consistent with other software.
- Add function
snp_assocBGEN()
for computing quick association tests from BGEN files. Could be useful for quick screening of useful SNPs to read in bigSNP format. This function might be improved in the future.
- Change url to download PLINK 1.9.
- Add function
snp_readBGEN()
to read UK Biobank BGEN files inbigSNP
format.
-
Add parameter
is.size.in.bp
tosnp_autoSVD()
for the clumping part. -
Change the threshold of outlier detection in
snp_autoSVD()
(it now detects less outliers). See the documentation details if you don't have any information about SNPs.
- Keep up with {bigstatsr}.
- Provide function
snp_gene
(as a gist) to get genes corresponding to 'rs' SNP IDs thanks to package {rsnps} from rOpenSci. See README.
- Package {bigsnpr} is published in Bioinformatics.
- Faster defaults + possibility to estimate correlations based on a subset of individuals for
snp_fastImpute
. Also store information in an FBM (instead of a data frame) so that imputation can be done by parts (you can stop the imputation by killing the R processes and come back to it later). Note that the defaults used in the Bioinformatics paper werealpha = 0.02
andsize = 500
(instead of1e-4
and200
now, respectively). These new defaults are more stringent on the SNPs that are used, which makes the imputation faster (30 min instead of 42-48 min), without impacting accuracy (still 4.7-4.8% of errors).
- This package won't be on CRAN. (Okay, it has been back on CRAN since; I was just pissed at BR :D)
- No longer download PLINK automatically (because it is a CRAN policy violation).