Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement group_by and with_groups for tidySingleCellExperiment and tidyseurat #71

Open
stemangiola opened this issue Jul 20, 2023 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@stemangiola
Copy link
Owner

stemangiola commented Jul 20, 2023

the issue for tidyseurat is here

stemangiola/tidyseurat#65

@stemangiola stemangiola added the enhancement New feature or request label Jul 21, 2023
@stemangiola stemangiola changed the title Implement group_by and with_groups for tidySingleCellExperiment Implement group_by and with_groups for tidySingleCellExperiment and tidyseurat Jul 23, 2023
@stemangiola stemangiola changed the title Implement group_by and with_groups for tidySingleCellExperiment and tidyseurat Implement group_by and with_groups for tidySCE and tidyseurat` Jul 28, 2023
@stemangiola stemangiola changed the title Implement group_by and with_groups for tidySCE and tidyseurat` Implement group_by and with_groups for tidySCE and tidyseurat Jul 28, 2023
@stemangiola stemangiola changed the title Implement group_by and with_groups for tidySCE and tidyseurat Implement group_by and with_groups for tidySingleCellExperiment and tidyseurat Jul 28, 2023
@stemangiola
Copy link
Owner Author

group_split is also a new useful function

@B0ydT
Copy link
Contributor

B0ydT commented Jul 29, 2023

I'm interested in having a go at this. I've looked at the dplyr code and think I have a handle on it. Are there any implementations, especially part of tidyomics that you'd prefer I use as a starting point?

@stemangiola
Copy link
Owner Author

stemangiola commented Jul 29, 2023

Hello @B0ydT, great initiative.

This is probably the transcriptomics hardest challenge (the one for tidySE is even harder).

Currently, group_by leads to a table in return, as I never approached the hard problem of making a proper API for this method.

This challenge needs a design document first, as we need to iron out all possible pitfalls of grouping an arbitrary set of columns and doing some operations on them. Some combinations may lead to valid single cell experiment objects, while others may lead to invalid objects, and therefore A tibble must be returned.

Let's start from group_by, imagine how many ways a user could group_by across cell metadata and reduced dimensions, and how many ways it could operate on this object, e.g. mutate, summarize, something else.

Then pretty much you have creative freedom in proposing a design document, describing the logics of the group_by API for SingleCellExperiment

@stemangiola
Copy link
Owner Author

I'm interested in having a go at this. I've looked at the dplyr code and think I have a handle on it. Are there any implementations, especially part of tidyomics that you'd prefer I use as a starting point?

Any news? Let me know if you need help.

@B0ydT
Copy link
Contributor

B0ydT commented Sep 15, 2023

Hi @stemangiola. Apologies for the slow response, I've been in the middle of changing jobs. I certainly wasn't looking to jump on the thorniest issue of the lot; It took your comment for me to see how this could be quite a lot trickier than the dplyr implementation.

I have tried to think through the issues you've raised. From what I remember of the dplyr implementation, you define the groups when you group_by, but most error detection occurs when you attempt to perform an operation on the grouped object. If we're starting with with_groups as opposed to methods for each relevant function, I could catch those errors and provide informative feedback to the user. Otherwise, I'm very happy to help with implementation, but I'm not sure I'm the right person to work it through from first principles.

@stemangiola
Copy link
Owner Author

Hello @B0ydT, congrats for your new job!

You can start with group_split, it is useful and much easier.

A quick function for backend I found is

splitColData <- function(x, f) {
  # This is by @jma1991
  # at https://github.com/drisso/SingleCellExperiment/issues/55
  
  i <- split(seq_along(f), f)
  
  v <- vector(mode = "list", length = length(i))
  
  names(v) <- names(i)
  
  for (n in names(i)) { v[[n]] <- x[, i[[n]]] }
  
  return(v)
  
}

Of course our interface will allow any of the cell-wise info, including reduced dimensions, etc..

@B0ydT
Copy link
Contributor

B0ydT commented Sep 23, 2023

Thank you very much! That looks great. I'll get it sorted shortly.

@B0ydT B0ydT mentioned this issue Oct 15, 2023
@B0ydT
Copy link
Contributor

B0ydT commented Dec 10, 2023

Stoked to have this merged. I'll repurpose for tidyseurat and tidySE in the next couple days. I'm also pretty confident I could get a decent implementation of with_groups built on top of this. Thanks for all of your feedback, I've been learning heaps.

Just to follow up, the groups column itself is preserved just fine (if .keep = TRUE), but if I try and add a new column for logical comparisons it gets mangled. dplyr adds a column for these comparisons, i.e. a > 5

tibble(a = 1:10) |> group_by(a>5)
# A tibble: 10 × 2
# Groups:   a > 5 [2]
       a `a > 5`
   <int> <lgl>  
 1     1 FALSE  
 2     2 FALSE  
 3     3 FALSE  
 4     4 FALSE  
 5     5 FALSE  
 6     6 TRUE   
 7     7 TRUE   
 8     8 TRUE   
 9     9 TRUE   
10    10 TRUE  

That said, if you're fine with it I'm fine with it, just wanted to be clear.

@stemangiola
Copy link
Owner Author

stemangiola commented Dec 10, 2023

Stoked to have this merged. I'll repurpose for tidyseurat and tidySE in the next couple days.

Repurposing to tidyseurat will take 3 minutes.

Repurposing to tidySE might take 3 weeks.

For tidySE, to avoid inefficiencies (some large pseudobulk might include 1 Billion rows), specific flows for samples-only queries and features-only queries must be designed, if the query is samples and features, we should decide what to do. For example, (1) it does not make sense, or (2) use the inefficient method with message, (3)...

For tidySE look at nest, unnest, where I developed such strategies.

In fact you could get away using the nest quite optimised framework I developed. I would suggest doing sonmething like this in the backend

data |> mutate(...) |> nest(data_nested = -...) |> pull(data_nested)

@B0ydT
Copy link
Contributor

B0ydT commented Dec 10, 2023

Repurposing to tidySE might take 3 weeks.

Aha, noted. I may not have much time between now and Christmas, but I'll see what I can do.

@stemangiola
Copy link
Owner Author

stemangiola commented Dec 10, 2023

Repurposing to tidySE might take 3 weeks.

Aha, noted. I may not have much time between now and Christmas, but I'll see what I can do.

On the other hand, if you manage to pull off my trick could take 5 minutes :)

@william-hutchison
Copy link
Collaborator

william-hutchison commented Jan 4, 2024

Hi @B0ydT, thanks for your great contribution! Please add your details to this authorship list if you would like to be included in our upcoming publication:

https://docs.google.com/spreadsheets/d/19XqhN3xAMekCJ-esAolzoWT6fttruSEermjIsrOFcoo/edit?usp=sharing

@B0ydT
Copy link
Contributor

B0ydT commented Jan 4, 2024

Thanks @william-hutchison, I will need to clear it with my supervisor but should complete it shortly.

@B0ydT
Copy link
Contributor

B0ydT commented Jan 4, 2024

@william-hutchison Have added my details, thanks again

@william-hutchison
Copy link
Collaborator

@B0ydT great, thanks for your work!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants