Establishes some general functions relating to single-cell DGE by dtm2451 · Pull Request #18 · UCSF-DSCOLAB/essential_scripts

dtm2451 · 2025-12-29T22:55:12Z

In a recent DS Working Group meeting, we discussed the utility of adding some standardized functions for performing DGE with various tools. We also laid out a few helper functions -- pseudobulking, gene filtering -- that felt required across tools.

This PR will directly include the helper functions and I'd propose that we use this sc-dge-functions-branch as the base branch that we'll PR all of our tool-specific DGE function builds in to!

Planned functions:

pseudobulking
- Initial implementation
  - Uses Seurat::AggregateExpression, grouping by a cell.by metadata and any number of sample.by metadata.
  - Adds a cell count to the metadata & allows trimming pseudobulks based on too few cells (10 by default)
  - Also pulls in additional metadata, as Seurat only grabs the 'group.by' metadata automatically.
  - Allows targeting only certain features or cell identities
  - Allows outputting as either Seurat or a list ("counts" matrix + "metadata" data.frame)
- add dreamlet and scuttle implementations?
- add normalization options?
python pseudobulking
- Initial implementation
  - Uses scanpy.get.aggregate
  - Allows MuData (Muon) or AnnData (scanpy) objects.
  - Allows outputting as either AnnData or a dict ("counts" array + "feature_names" list + "metadata" DataFrame)
  - Very similar functionality to the R version, but definitely faster!
gene filtering
DGE_comparison_assessment: for parsing groups and check if sufficient samples to trust results.

Planned Standardizations:

function names:
- run_<method> for the functions that directly run DGE on a set of samples
- run_<method>_within_cells for the functions that loop across cell types, running the run_<method> within each cell type.
input names:
- counts = raw count matrix after feature selection
- metadata = sample metadata
- dge_by = column name of metadata holding sample groups
- case_group = name of the case group to be used as numerator in log2FC calculation
- reference_group = name of the reference group to be used as denominator in log2FC calculation
- contrast = vector of contrasts, e.g., c(“varCase - varRef”, “varCase2 - varRef”, etc)
- fixed_effects = names of metadata columns to be used as fixed effects
- random_effects = names of metadata columns to be used as random effects
- min_frac = threshold for the minimum number of samples with CPM > 1 expression to be used in selecting genes to retain for DGE
- min_cells = threshold for the minimum number of cells that a sample should contain in order to be used.
- ?cell_by = (?=unconfirmed) column name of metadata holding cell annotation or cluster identities. Only needed in functions that require this!
- ?cell_targets = (?=unconfirmed) optional string vector holding cell annotation or cluster identities to focus.
- dge_groups = (=used within internal functions only; allows simpler provision / merging info from multiple inputs above) vector of all groups of dge_by that should be retained for the analysis

Side Note:

I have often found it hard to consistently import python functions across distinct methods of running -- jupyter notebooks, interactive shell, running a script with python -u <script>. The method that has been working for me best recently is:

import sys
sys.path.insert(0, '/path/to/folder/ABOVE/essential_scripts/')
from essential_scripts.python_utils.ts_log import ts_log
from essential_scripts.single_cell.pseudobulk_function import dsco_pseudobulk

… function

…im if would leave none, py-only catch if no metadata would be added, py-only catch and remove fake pseudobulks created by scanpy

erflynn · 2026-01-05T19:16:49Z

this looks awesome!
in case you want the dreamlet version:

 sce = SingleCellExperiment(list(counts=object@assays$RNA@counts), colData=object@meta.data)
 pb <- aggregateToPseudoBulk(sce,
                              assay = "counts",
                              cluster_id = cell.by,
                              sample_id = sample.by, 
                              verbose = FALSE)

erflynn · 2026-01-09T18:25:59Z

I think it would be worthwhile to include a pre-pseudobulk filter -- e.g. only pseudobulk a sample/cell type pair if there are at least X cells of that cell type in that sample
I'm looking at the dreamlet::ProcessAssays() again and it does this as well, will make a note on my wrapper.

And possibly downstream a corresponding DEG filter that only pulls a DEG comparison if there are a least N samples per group?

dtm2451 · 2026-01-09T19:45:20Z

single_cell/pseudobulk_function.R

+        too_small <- psobject@meta.data[,output.metadata.cell.count] < min.cells
+        if (too_small == ncol(psobject)) {
+            warning(paste0("Skipping triming pseudobulks smaller than 'min_cells' as NONE were built from more than ", min_cells, " cells."))
+        } else if (too_small > 0) {
+            msg_if("\tTrimming ", too_small, " pseudobulks built from fewer than ", min_cells, " cells.")
+            psobject <- psobject[,psobject@meta.data[,output.metadata.cell.count] >= min.cells]
+        }


I think it would be worthwhile to include a pre-pseudobulk filter -- e.g. only pseudobulk a sample/cell type pair if there are at least X cells of that cell type in that sample

This is included already, here for the R function! It runs after the pseudobulking currently, but could move it to before instead if there's good reason.

oh awesome! apologies, I should have looked more carefully. I just realized it is not in the dreamlet pseudobulk function, but then is implemented in the processAssays, so was kind of making a note to myself

I do think the downstream DEG filter to at least min.samples per category though is also useful

agreed! got pulled away before posting that half =)

dtm2451 · 2026-01-09T21:34:06Z

And possibly downstream a corresponding DEG filter that only pulls a DEG comparison if there are a least N samples per group?

Hmm Agreed. Perhaps a function that assesses the requested DGE comps per the dge.group.by and related vars, and contrast setups for however we end up setting that up to work instead... Adding a ToDo for this, 'dge_comparison_assement function' 👍, but I'm feeling there are extra bits to scope out before I'd start filling in this one.

This reverts commit 13078a4.

dtm2451 added 7 commits December 29, 2025 17:15

initialize 'dsco_pseudobulk' function

2402410

add missed 'Seurat::' callout

af57a19

actually remove ts_log need when 'verbose = FALSE'

f1281b2

initialize python pseudobulk function, slight parity alignments for R…

ec90754

… function

python pseudobulk fxn docs update

9d0aff7

pseudobulk functions, multiple: messaging tweaks, skip 'min.cells' tr…

b4ef070

…im if would leave none, py-only catch if no metadata would be added, py-only catch and remove fake pseudobulks created by scanpy

pseudobulk functions: fix 'min.cells' checks

1dfe483

dtm2451 commented Jan 9, 2026

View reviewed changes

dtm2451 and others added 3 commits January 16, 2026 16:02

stub file to create 'single_cell/dge' folder

1934e13

initial commit of deseq2 function

13078a4

Revert "initial commit of deseq2 function"

290fd95

This reverts commit 13078a4.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Establishes some general functions relating to single-cell DGE#18

Establishes some general functions relating to single-cell DGE#18
dtm2451 wants to merge 10 commits intomainfrom
sc-dge-functions

dtm2451 commented Dec 29, 2025 •

edited

Loading

Uh oh!

erflynn commented Jan 5, 2026

Uh oh!

erflynn commented Jan 9, 2026 •

edited

Loading

Uh oh!

dtm2451 Jan 9, 2026 •

edited

Loading

Uh oh!

erflynn Jan 9, 2026

Uh oh!

erflynn Jan 9, 2026

Uh oh!

dtm2451 Jan 9, 2026

Uh oh!

dtm2451 commented Jan 9, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dtm2451 commented Dec 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Planned Standardizations:

Side Note:

Uh oh!

erflynn commented Jan 5, 2026

Uh oh!

erflynn commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dtm2451 Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

erflynn Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

erflynn Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

dtm2451 Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

dtm2451 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dtm2451 commented Dec 29, 2025 •

edited

Loading

erflynn commented Jan 9, 2026 •

edited

Loading

dtm2451 Jan 9, 2026 •

edited

Loading

dtm2451 commented Jan 9, 2026 •

edited

Loading