experiment_key_column
: Column in cell metadata used to generate pseudobulk datasets and calculate cell type proportions (combined with termproportion_covariate_column
)anndata_cell_label
: Column in cell metadata representing cell types to perform differential expression and gene set enrichment
Parameters are applied within each cell type as denoted by anndata_cell_label
mean_cp10k_filter
: Remove genes with mean counts per 10,000 (CP10K) expression <=mean_cp10k_filter
models
: List of configurations for differential expression methods to runfilter_options
: Options for the optional pre-filter.filter
: Value to set as minimum for filter.modality
:[cp10k|counts]
metric
: Default ismean
. TODO -- alternate metrics implemented?by_comparison
: TODO -- what is this?
method
: String in the formatpackage::resolution::model
. Possible values for each:package
: "mast", "edger", "deseq"resolution
: "singlecell" or "pseudobulk"model
: Options:- For MAST, "bayesglm", "glmer" (needed for random effect models), or "glm"
- For edgeR, "glmQLFit" or "glmLRT"
- For DESeq2, "glmGamPoi" (recommended), "parametric", "local", or "mean"
formula
: Formula to model the gene expression. Terms should be columns in cell metadata (Ex: "~ sex + age + disease_status"). The pipeline also supports the following operations:- R formula functions, such as "I(age^2)" and interaction effects, denoted with ":" (e.g., "time_point:disease_status"). See: https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/formula
- Random effects as denoted by "(1|participant_id)"
variable_target
: Term informula
to test (e.g., disease_status)variable_continuous
: Terms informula
that should be cast as continuous covariatesvariable_discrete
: Terms informula
that should be cast as discrete (i.e., categorical) covariatesvariable_discrete_level
: Reference information for discrete covariates, formatted as "cov_1::ref;;cov_2::ref"pre_filter_genes
: Logical (e.g., true or false) to applymean_cp10k_filter
before or after performing differential expressionproportion_covariate_column
: Column in cell metadata to calculate the proportion of cells from each experiment (defined byexperiment_key_column
) representing each value. For instance, if same asanndata_cell_label
, pipeline will calculate the proportion of each cell type for each experiment key.include_proportion_covariates
: Logical (e.g., true or false) to include proportions fromproportion_covariate_column
informula
ruvseq
: Logical (e.g., true or false) to run RUVSeqruvseq_n_empirical_genes
: Number of empirical genes to use as input into RUVSeq. If value<1, we will take the proportion of total genes (value * # genes total). If value>1, we will use value as the number of genesruvseq_min_pvalue
: Number representing minimum p-value threshold for empirical genes. Only genes with p-value > value will be used as empirical genesruvseq_k
: Number of RUVSeq factors to adjust for
de_merge_config
: Configuration for merge settingsihw_correction
: Configuration for IHW correctioncovariates
: Comma-separated list of covariates to include in IHW correction (e.g., "cell_label,disease_status")alpha
: See IHW documentation
de_plot_config
: Parameters for plotting differential expression resultsmean_expression_filter
: List of mean expression thresholds to drop for plots for each group inanndata_cell_label
. For example: if gene A expression is 0 counts in cluster 1 and 10 in cluster 2, it will be dropped from cluster 1 but not cluster 2.
goenrich_config
:go_terms
: Ontology terms: MF (Molecular Function), CC (Cellular Component), BP (Biological Process). Multiple terms can be specified, separated by commas (e.g., 'BP,MF,CC').clustering_method
: Method to cluster terms. Options: "binary_cut", "louvain", "mclust"
gsea_config
: Parameters for running gene set analysesfgsea_parameters
: List of alternate configurations for fgseasample_size
: See fGSEA documentationscore_type
: See fGSEA documentationmin_set_size
: See fGSEA documentationmax_set_size
: See fGSEA documentationeps
: See fGSEA documentationdatabase
: Comma-separated list of databases to test for enrichments. Detailed descriptions of databases can be found here. Options:c2.cgp
: Chemical and genetic perturbationsc2.cp.biocarta
: BioCartac2.cp.kegg
: KEGGc2.cp.reactome
: Reactomec2.cp
: PIDc5.bp
: GO biological processc5.cc
: GO cellular componentc5.mf
: GO molecular functionc6.all
: Oncogenic signaturesc7.all
: Immunologic signaturesall
: All gene sets (c2.cp.reactome, c2.cp.kegg, c5.bp, c5.cc, c5.mf)
gsea_summarize_parameters
: Parameters to summarize GSEA datadistance_metric
: Metric to calculate distance between terms. Options: "kappa", "jaccard", "dice", "overlap"clustering_method
: Method to cluster terms. Options: "binary_cut", "louvain", "mclust"