Establishes some general functions relating to single-cell DGE#18
Establishes some general functions relating to single-cell DGE#18
Conversation
…im if would leave none, py-only catch if no metadata would be added, py-only catch and remove fake pseudobulks created by scanpy
|
this looks awesome! |
|
I think it would be worthwhile to include a pre-pseudobulk filter -- e.g. only pseudobulk a sample/cell type pair if there are at least X cells of that cell type in that sample And possibly downstream a corresponding DEG filter that only pulls a DEG comparison if there are a least N samples per group? |
| too_small <- psobject@meta.data[,output.metadata.cell.count] < min.cells | ||
| if (too_small == ncol(psobject)) { | ||
| warning(paste0("Skipping triming pseudobulks smaller than 'min_cells' as NONE were built from more than ", min_cells, " cells.")) | ||
| } else if (too_small > 0) { | ||
| msg_if("\tTrimming ", too_small, " pseudobulks built from fewer than ", min_cells, " cells.") | ||
| psobject <- psobject[,psobject@meta.data[,output.metadata.cell.count] >= min.cells] | ||
| } |
There was a problem hiding this comment.
I think it would be worthwhile to include a pre-pseudobulk filter -- e.g. only pseudobulk a sample/cell type pair if there are at least X cells of that cell type in that sample
This is included already, here for the R function! It runs after the pseudobulking currently, but could move it to before instead if there's good reason.
There was a problem hiding this comment.
oh awesome! apologies, I should have looked more carefully. I just realized it is not in the dreamlet pseudobulk function, but then is implemented in the processAssays, so was kind of making a note to myself
There was a problem hiding this comment.
I do think the downstream DEG filter to at least min.samples per category though is also useful
There was a problem hiding this comment.
agreed! got pulled away before posting that half =)
Hmm Agreed. Perhaps a function that assesses the requested DGE comps per the |
In a recent DS Working Group meeting, we discussed the utility of adding some standardized functions for performing DGE with various tools. We also laid out a few helper functions -- pseudobulking, gene filtering -- that felt required across tools.
This PR will directly include the helper functions and I'd propose that we use this
sc-dge-functions-branch as the base branch that we'll PR all of our tool-specific DGE function builds in to!Planned functions:
Planned Standardizations:
run_<method>for the functions that directly run DGE on a set of samplesrun_<method>_within_cellsfor the functions that loop across cell types, running therun_<method>within each cell type.counts= raw count matrix after feature selectionmetadata= sample metadatadge_by= column name of metadata holding sample groupscase_group= name of the case group to be used as numerator in log2FC calculationreference_group= name of the reference group to be used as denominator in log2FC calculationcontrast= vector of contrasts, e.g., c(“varCase - varRef”, “varCase2 - varRef”, etc)fixed_effects= names of metadata columns to be used as fixed effectsrandom_effects= names of metadata columns to be used as random effectsmin_frac= threshold for the minimum number of samples with CPM > 1 expression to be used in selecting genes to retain for DGEmin_cells= threshold for the minimum number of cells that a sample should contain in order to be used.cell_by= (?=unconfirmed) column name of metadata holding cell annotation or cluster identities. Only needed in functions that require this!cell_targets= (?=unconfirmed) optional string vector holding cell annotation or cluster identities to focus.dge_groups= (=used within internal functions only; allows simpler provision / merging info from multiple inputs above) vector of all groups ofdge_bythat should be retained for the analysisSide Note:
I have often found it hard to consistently import python functions across distinct methods of running -- jupyter notebooks, interactive shell, running a script with
python -u <script>. The method that has been working for me best recently is: