Deploying to gh-pages from @ 8097704 🚀

martinjzhang · Aug 12, 2024 · 060c342 · 060c342
commit 060c342
Show file tree

Hide file tree

Showing 107 changed files with 23,433 additions and 0 deletions.
diff --git a/.buildinfo b/.buildinfo
@@ -0,0 +1,4 @@
+# Sphinx build info version 1
+# This file hashes the configuration used when building these files. When it is not found, a full rebuild will be done.
+config: 917d04b4ef0d16e096ff918a3f2bf883
+tags: 645f666f9bcd5a90fca523b33c5a78b7
diff --git a/.nojekyll b/.nojekyll
@@ -0,0 +1 @@
+
diff --git a/_images/notebooks_quickstart_15_0.png b/_images/notebooks_quickstart_15_0.png
diff --git a/_images/notebooks_quickstart_18_0.png b/_images/notebooks_quickstart_18_0.png
diff --git a/_images/notebooks_quickstart_20_0.png b/_images/notebooks_quickstart_20_0.png
diff --git a/_images/notebooks_quickstart_9_0.png b/_images/notebooks_quickstart_9_0.png
diff --git a/_images/notebooks_quickstart_9_1.png b/_images/notebooks_quickstart_9_1.png
diff --git a/_sources/downloads.rst.txt b/_sources/downloads.rst.txt
@@ -0,0 +1,16 @@
+Downloads
+=========
+
+Code and data to reproduce results of the paper
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+* Data in `figshare <https://figshare.com/projects/Single-cell_Disease_Relevance_Score_scDRS_/118902>`_ and details in `scDRS_paper <https://github.com/martinjzhang/scDRS_paper>`_. 
+* `Weighted GWAS gene sets <https://figshare.com/articles/dataset/scDRS_data_release_030122/19312583?file=34300898>`_ (**.gs** files) for 74 diseases and complex traits.
+* `scDRS results using weighted gene sets <https://figshare.com/articles/dataset/scDRS_data_release_030122_score_file_tmsfacs/19312607>`_ (**.score.gz** and **.full_score.gz** files) for TMS FACS + 74 diseases/traits (latest in revision 1).
+
+
+Other versions
+~~~~~~~~~~~~~~
+
+* `Binary GWAS gene sets <https://figshare.com/articles/dataset/scDRS_data_release_092121/16664080?file=30853708>`_ (**.gs** files) for 74 diseases and complex traits (used in initial submission).
+* `scDRS results using binary gene sets <https://figshare.com/articles/dataset/scDRS_data_release_092121_score_file_tmsfacs/16664077>`_ (**.score.gz** and **.full_score.gz** files) for TMS FACS + 74 diseases/traits (initial submission).
diff --git a/_sources/faq.rst.txt b/_sources/faq.rst.txt
@@ -0,0 +1,93 @@
+FAQ
+################################
+
+Here are some frequently asked questions about scDRS.
+
+
+Typical scDRS workflow
+======================
+
+1. Create scDRS gene set file (:code:`.gs`) from GWAS 
+
+    a. Generate gene-level p-values/z-scores for a given trait (pval_file/zscore_file) using `MAGMA <https://ctg.cncr.nl/software/magma>`_ from GWAS summary statistics.
+    b. Create :code:`.gs` file from pval_file/zscore_file using scDRS CLI :code:`scdrs munge-gs`.
+
+2. Compute scDRS individual cell-level results.
+
+    a. Use scDRS CLI :code:`scdrs compute-score`
+    b. Inputs: scDRS gene set file (:code:`.gs`), scRNA-seq data (:code:`.h5ad`), covariates (:code:`.cov`). 
+    c. Outputs: scDRS score file (:code:`<trait>.score.gz`), scDRS full score file (:code:`<trait>.full_score.gz`).
+
+3. Perform scDRS downstream cell-group analyses.
+
+    a. Use scDRS CLI :code:`scdrs perform-downstream` for a) cell group-trait association, b) within-cell group association heterogeneity, c) correlating scDRS disease score with a cell variable, d) correlating scDRS disease score with gene expression.
+    b. Use customized code for other analyses:
+
+        i. Compute a test statistic using the normalized disease score across a given set of cells.
+        ii. Compute the same test statistic using each of the :code:`n_ctrl` sets of normalized control scores across the same given set of cells.
+        iii. MC-pvalue = :code:`(# of control statistics exceeding disease statistic) / (n_ctrl+1)`
+    b. Input: scDRS full score file (:code:`<trait>.full_score.gz`), cell annotations stored in :code:`adata.obs` of scRNA-seq :code:`.h5ad` file.
+    c. Output: test statistics/p-values based on the scDRS MC tests.
+
+
+Which GWAS and scRNA-seq data to use?
+======================================================
+
+To ensure a reasonable number of scDRS discoveries, we recommend using GWAS data with a heritability z-score greater than 5 or a sample size greater than 100K. We also recommend using scRNA-seq data with a diverse set of cells potentially relevant to disease, although a smaller number of cells should not affect the scDRS power.
+
+
+How to create MAGMA gene sets?
+==============================
+
+Please see the `instructions <https://github.com/martinjzhang/scDRS/issues/2>`_
+
+
+Use scDRS for other gene sets?
+=====================================
+
+Yes, you can use other gene sets instead of GWAS gene sets with scDRS to identify cells in scRNA-seq with excess expression of genes in the gene set.
+
+
+Requirement of gene set size?
+========================================
+
+The gene set should have a moderate size (e.g., >50 genes and <20% of all genes) for the scDRS results to be statistically valid. In practice, we observe a reasonable performance as long as the gene set size is >=10. Please see details in Methods in Zhang & Hou et al. Nat Genet 2022. 
+
+
+Computational complexity?
+====================================
+
+scDRS scales linearly with the number of cells and number of control gene sets for both computation and memory use. It takes around 3 hours and 60GB for a single-cell data set with a million cells). Please see details in Methods in Zhang & Hou et al. Nat Genet 2022. 
+
+
+scDRS detected few significant cells (FDR<0.2)?
+==================================================
+
+scDRS may be underpowered for certain GWAS/scRNA-seq data sets. In these cases, the ensuing scDRS group analyses may still have sufficient power, because scDRS group analyses aggregate results of individual cells and hence have higher power than the scDRS individual cell-level analyses. To assess if scDRS has sufficient power, we suggest performing the `scDRS group analyses <https://martinjzhang.github.io/scDRS/reference_cli.html#perform-downstream>`_ to assess significance at an aggregated level. In addition, it is helpful to visually inspect the scDRS normalized disease score on the UMAP plot. Localized enrichments of high scDRS disease scores on the UMAP usually indicate that scDRS have detected interesting biological signals.
+
+
+MC z-scores are much more significant than MC p-values in group analysis due to the MC limit? 
+===========================================================================================
+
+Increasing :code:`--n-ctrl` in `compute-score` will produce more control scores, which will be used in the group analysis to increase the number of MC samples for MC tests. Alternatively, you can compute a p-value from assoc_mcz when assoc_mcp is reasonably small. As mentioned in the Methods section of our manuscript: "We recommend using MC P values to determine statistical significance and using MC z-scores to further prioritize associations whose MC P values have reached the MC limit. "
+
+
+Use scDRS for other types of single-cell data?
+====================================================
+
+scDRS is tailored for scRNA-seq. Best practices for using scDRS on other data types and systematic comparisons with alternative methods remain interesting future directions.
+
+We empirically observed that scDRS works for other types of RNA-seq data like spatial transcriptomics. 
+
+We empirically observed that scDRS works for single-cell DNA methylation data. 
+
+scDRS should work for single-cell ATAC-seq in principle, although you may need some customized Python codes. To do this,
+
+1. Right after calling :code:`scdrs.preprocess`, create a categorical :code:`adata.obs['atac_match']` for your control gene matching criteria by dividing features (genes/peaks) into discrete bins. We recommend >20 features per bin. We recommend matching for mean accessibility and GC contents, as done in `gchromVAR <https://github.com/caleblareau/gchromVAR>`_.
+2. When calling :code:`scdrs.score_cell`, tell scDRS to use this matching criteria by :code:`ctrl_match_key='atac_match'`.
+
+Relevant works for individual cell-level associations for scATAC-seq: `Ulirsch et al. Nat Genet 2019 <https://www.nature.com/articles/s41588-019-0362-6>`_, `Chiou et al. Nat Genet 2021 <https://www.nature.com/articles/s41588-021-00823-0>`_, `Yu et al. Nat Biotechnol 2022 <https://www.nature.com/articles/s41587-022-01341-y>`_.
+
+
+
+
diff --git a/_sources/file_format.rst.txt b/_sources/file_format.rst.txt
@@ -0,0 +1,165 @@
+File formats
+============
+
+.sumstats
+~~~~~~~~~
+GWAS summary statistics following the `LDSC format <https://github.com/bulik/ldsc/wiki/Summary-Statistics-File-Format>`_.
+
+.. csv-table:: Example .sumstats file
+   :header: "GENE", "BMI", "HEIGHT"
+   :delim: space
+
+   SNP A1 A2 N CHISQ Z
+   rs7899632 A G 59957 3.4299 -1.852
+   rs3750595 A C 59957 3.3124 1.82
+
+.h5ad
+~~~~~
+
+Single-cell data :code:`.h5ad` file as defined in `AnnData <https://anndata.readthedocs.io/en/latest/>`_ and `Scanpy <https://scanpy.readthedocs.io/en/stable/>`_.
+
+
+pval_file,zscore_file
+~~~~~~~~~~~~~~~~~~~~~
+GWAS gene-level p-values / z-scores for different traits. A :code:`.tsv` file with first column corresponding to genes and other columns corresponding to p-values / z-scores of traits (one trait per column).
+
+.. csv-table:: Example pval_file
+   :header: "GENE", "BMI", "HEIGHT"
+   :delim: space
+
+   OR4F5   0.001  0.01
+   DAZ3    0.01   0.001
+
+.gs
+~~~~
+
+scDRS gene set file. A :code:`.tsv` file with two columns :code:`["TRAIT", "GENESET"]` and one line per trait. Can be generated using customized code or from p-value or z-score files using scDRS CLI :code:`scdrs munge-gs`.
+
+TRAIT
+    Trait (gene set) identifier.
+GENESET
+    Comma-separated list of gene-weight pairs with the form "gene1\:weight1,gene2\:weight2,..." 
+    or "gene1,gene2,..." (meaning weights are 1). 
+
+
+.. csv-table:: Example weighted .gs file
+   :header: "TRAIT", "GENESET"
+   :delim: space
+   :align: center
+   :width: 50%
+
+   PASS_HbA1C FN3KRP:1.2,FN3K:2.3,HK1:4.7,GCK:5.2
+   PASS_MedicationUse_Wu2019 FTO:3,SEC16B:0.6,ADCY3:1.5,DNAJC27:1.3
+
+.. csv-table:: Example unweighted .gs file
+   :header: "TRAIT", "GENESET"
+   :delim: space
+   :align: center
+   :width: 50%
+
+   PASS_HbA1C FN3KRP,FN3K,HK1,GCK
+   PASS_MedicationUse_Wu2019 FTO,SEC16B,ADCY3,DNAJC27
+
+
+.cov
+~~~~
+
+scDRS covariate file for the :code:`.h5ad` single-cell data. :code:`.tsv` file.
+
+- First column: cell names, consistent with :code:`adata.obs_names`.
+- Other comlumns: covariates with numerical values.
+
+.. csv-table:: Example .cov file
+   :header: "index", "const", "n_genes", "sex_male", "age"
+   :align: center
+   :width: 50%
+
+   A10_B000497_B009023_S10, 1, 2706, 1, 18 
+   A10_B000497_B009023_S10, 1, 2501, 0, 24 
+
+
+<trait>.score.gz
+~~~~~~~~~~~~~~~~
+
+scDRS score file for a give trait. :code:`.tsv.gz` file.
+
+- First column: cell names, should be the same as :code:`adata.obs_names`.
+- raw_score: raw disease score.
+- norm_score: normalized disease score.
+- mc_pval: cell-level MC p-value. Raw p-value without multiple testing adjustment.
+- pval: cell-level scDRS p-value. Raw p-value without multiple testing adjustment.
+- nlog10_pval: -log10(pval).
+- zscore: z-score converted from pval.
+
+.. csv-table:: Example <trait>.score.gz file
+   :header: "index", "raw_score", "norm_score", "mc_pval", "pval", "nlog10_pval", "zscore"
+
+   A10_B000497_B009023_S10, 0.730, 7.04, 0.0476, 0.00166, 2.78, 2.94
+   A10_B000756_B007446_S10, 0.725, 7.30, 0.0476, 0.00166, 2.78, 2.94
+
+
+<trait>.full_score.gz
+~~~~~~~~~~~~~~~~~~~~~
+
+scDRS full score file for a give trait. :code:`.tsv.gz` file.
+
+- All columns of :code:`{trait}.score.gz` file.
+- ctrl_raw_score_<i_ctrl> : raw control scores, specified by :code:`--flag_return_ctrl_raw_score True`.
+- ctrl_norm_score_<i_ctrl> : normalized control scores, specified by :code:`--flag_return_ctrl_norm_score True`.
+
+
+<trait>.scdrs_group.<annot>
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Results for scDRS group-level analysis for a give trait and a given cell-group annotation (e.g., cell type). :code:`.tsv` file.
+
+- <trait> : trait name consistent with :code:`<trait>.full_score.gz` file.
+- <annot> : cell-annotation in :code:`adata.obs.columns`, specified by :code:`group_analysis` in CLI.
+- First column: different cell groups in :code:`adata.obs[<annot>]`.
+- n_cell: number of cells from the cell group.
+- n_ctrl: number of control gene sets.
+- assoc_mcp: MC p-value for cell group-disease association. Raw p-value without multiple testing adjustment.
+- assoc_mcz: MC z-score for cell group-disease association.
+- hetero_mcp:  MC p-value for within-cell group heterogeneity in association with disease. Raw p-value without multiple testing adjustment.
+- hetero_mcz:  MC z-score for within-cell group heterogeneity in association with disease.
+
+.. csv-table:: Example <trait>.scdrs_group.<annot> file
+   :header: "", "n_cell", "n_ctrl", "assoc_mcp", "assoc_mcz", "hetero_mcp", "hetero_mcz"
+
+   causal_cell    , 10.0,   20.0, 0.04761905, 12.297529 , 1.0, 1.0
+   non_causal_cell, 20.0,   20.0, 0.9047619 , -1.1364214, 1.0, 1.0
+
+
+<trait>.scdrs_cell_corr
+~~~~~~~~~~~~~~~~~~~~~~~
+
+Results for scDRS cell-level correlation analysis for a given trait. :code:`.tsv` file.
+
+- <trait> : trait name consistent with :code:`<trait>.full_score.gz` file.
+- First column: all cell-level variables, specified by specified by :code:`corr_analysis` in CLI.
+- n_ctrl: number of control gene sets.
+- corr_mcp: MC p-value for cell-level variable association with disease. Raw p-value without multiple testing adjustment.
+- corr_mcz: MC z-score for cell-level variable association with disease.
+
+.. csv-table:: Example <trait>.scdrs_cell_corr file
+   :header: "", "n_cell", "corr_mcp", "corr_mcz"
+
+   causal_variable    , 20.0, 0.04761905, 3.4574268
+   non_causal_variable, 20.0, 0.23809524, 0.8974108
+   covariate          , 20.0, 0.1904762 , 1.1220891
+
+<trait>.scdrs_gene
+~~~~~~~~~~~~~~~~~~
+
+Results for scDRS gene-level correlation analysis for a given trait. :code:`.tsv` file.
+
+- <trait> : trait name consistent with :code:`<trait>.full_score.gz` file.
+- First column: genes in :code:`adata.var_names`.
+- CORR: correlation with scDRS disease score across all cells in :code:`adata`.
+- RANK: rank of correlation across genes (starting from 0).
+
+.. csv-table:: Example <trait>.scdrs_gene file
+   :header: "index", "CORR", "RANK"   
+
+   Serping1, 0.314, 0
+     Lmna  , 0.278, 1