Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature requeste for full join of GRanges #68

Open
alexyfyf opened this issue Aug 27, 2019 · 11 comments
Open

Feature requeste for full join of GRanges #68

alexyfyf opened this issue Aug 27, 2019 · 11 comments

Comments

@alexyfyf
Copy link

Hi All,

I've been working on combing GRanges from several datasets, and wish to have a function from your package that works similar to the tidyverse full_join which includes any extry from at least one Granges object. Also, metadata full_joined.

I have written a simple function to use by myself and worked on my dataset (GRanges with meta data). But I found it hard to incorporate to your packages because I think yours works on a range of *Ranges objects. Anyway, I would like to help and share it if anyone is interested to develop such a function.

join_overlap_full <- function(x,y){
  options(stringsAsFactors = F)
  hit <- findOverlaps(x,y)
  colnames(mcols(x)) <- paste0(colnames(mcols(x)),".x")
  colnames(mcols(y)) <- paste0(colnames(mcols(y)),".y")
  cns.x <- mcols(x) %>% colnames()
  cns.y <- mcols(y) %>% colnames()
  
  z <- x[queryHits(hit),]
  hitz <- cbind(mcols(y)[subjectHits(hit),],
                mcols(x)[queryHits(hit),])
  colnames(hitz) <- c(cns.y,cns.x)
  mcols(z) <- hitz
  uniqex <- x[-queryHits(hit),]
  uniqey <- y[-subjectHits(hit),]
  mcols(uniqex)[, (length(cns.x)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqex)) <- c(cns.x,cns.y)
  mcols(uniqex) <- mcols(uniqex)[,c((length(cns.x)+1):(length(cns.x)+length(cns.y)),1:(length(cns.x)))]
  mcols(uniqey)[, (length(cns.y)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqey)) <- c(cns.y,cns.x)
  res <- c(z, uniqex,uniqey)
  mcols(res) <- mcols(res) %>% data.frame() %>% mutate_all(as.character())
  return(res)
}
@sa-lee
Copy link
Collaborator

sa-lee commented Aug 27, 2019

Hi there, I'm not sure exactly what a full join would look like? Could you provide a simple example? If it makes sense you are more than welcome to make a pull request. Look at the code for https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-joins-outer.R and https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-find.R to see how I've constructed join_overlap_left()

@alexyfyf
Copy link
Author

Hi there, I'm not sure exactly what a full join would look like? Could you provide a simple example? If it makes sense you are more than welcome to make a pull request. Look at the code for https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-joins-outer.R and https://github.com/sa-lee/plyranges/blob/master/R/ranges-overlap-find.R to see how I've constructed join_overlap_left()

Thank you for your reply. I'll have a look and try to make a pull request.
The example would like this. If there's any function from your package could do this, please let me know.

gr1 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 3, 4)),
               IRanges(1:10, width=5),
               strand='-',
               score=101:110, GC=runif(10))
gr2 <- GRanges(seqnames=Rle(c('chr1', 'chr2', 'chr3'), c(3, 4, 3)),
               IRanges(6:15, width=10),
               strand='-',
               score=21:30)
join_overlap_full(gr1,gr2)

The output will be like:
gr1

GRanges object with 10 ranges and 2 metadata columns:
       seqnames    ranges strand |     score                 GC
          <Rle> <IRanges>  <Rle> | <integer>          <numeric>
   [1]     chr1       1-5      - |       101  0.930739698465914
   [2]     chr1       2-6      - |       102  0.566913234768435
   [3]     chr1       3-7      - |       103  0.682077450212091
   [4]     chr2       4-8      - |       104  0.889251226792112
   [5]     chr2       5-9      - |       105 0.0200071565341204
   [6]     chr2      6-10      - |       106  0.208122601732612
   [7]     chr3      7-11      - |       107 0.0573086701333523
   [8]     chr3      8-12      - |       108  0.578096957644448
   [9]     chr3      9-13      - |       109  0.802684629801661
  [10]     chr3     10-14      - |       110   0.71569463587366
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

gr2

GRanges object with 10 ranges and 1 metadata column:
       seqnames    ranges strand |     score
          <Rle> <IRanges>  <Rle> | <integer>
   [1]     chr1      6-15      - |        21
   [2]     chr1      7-16      - |        22
   [3]     chr1      8-17      - |        23
   [4]     chr2      9-18      - |        24
   [5]     chr2     10-19      - |        25
   [6]     chr2     11-20      - |        26
   [7]     chr2     12-21      - |        27
   [8]     chr3     13-22      - |        28
   [9]     chr3     14-23      - |        29
  [10]     chr3     15-24      - |        30
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

The full_joined

GRanges object with 17 ranges and 3 metadata columns:
       seqnames    ranges strand |   score.y   score.x               GC.x
          <Rle> <IRanges>  <Rle> | <integer> <integer>          <numeric>
   [1]     chr1       2-6      - |        21       102  0.566913234768435
   [2]     chr1       3-7      - |        21       103  0.682077450212091
   [3]     chr1       3-7      - |        22       103  0.682077450212091
   [4]     chr2       5-9      - |        24       105 0.0200071565341204
   [5]     chr2      6-10      - |        24       106  0.208122601732612
   ...      ...       ...    ... .       ...       ...                ...
  [13]     chr3      8-12      - |      <NA>       108  0.578096957644448
  [14]     chr1      8-17      - |        23      <NA>               <NA>
  [15]     chr2     11-20      - |        26      <NA>               <NA>
  [16]     chr2     12-21      - |        27      <NA>               <NA>
  [17]     chr3     15-24      - |        30      <NA>               <NA>
  -------
  seqinfo: 3 sequences from an unspecified genome; no seqlengths

@jacel
Copy link

jacel commented May 7, 2020

Hi, I was wondering if you've made progress on this? I'd really like to be able to do full joins in plyranges!

@sa-lee
Copy link
Collaborator

sa-lee commented May 8, 2020

Hi, no sorry! I'm pretty busy with other projects at the moment, but will see if I have time over the next couple of months to implement it.

@jacel
Copy link

jacel commented May 8, 2020

No problem! Thanks so much for creating plyranges in the first place. It's an awesome package, I use it all the time.

@alexyfyf
Copy link
Author

alexyfyf commented May 9, 2020

I use this function to full_join GRanges, might not fit in everyone's case (or maybe everyone has a different definition full_join).
I'm sure someone can make it better, so I would like to share here.

join_overlap_full <- function(x,y){
  options(stringsAsFactors = F)
  hit <- findOverlaps(x,y)
  colnames(mcols(x)) <- paste0(colnames(mcols(x)),".x")
  colnames(mcols(y)) <- paste0(colnames(mcols(y)),".y")
  cns.x <- mcols(x) %>% colnames()
  cns.y <- mcols(y) %>% colnames()
  
  z <- x[queryHits(hit),]
  hitz <- cbind(mcols(y)[subjectHits(hit),],
                mcols(x)[queryHits(hit),])
  colnames(hitz) <- c(cns.y,cns.x)
  mcols(z) <- hitz
  uniqex <- x[-queryHits(hit),]
  uniqey <- y[-subjectHits(hit),]
  mcols(uniqex)[, (length(cns.x)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqex)) <- c(cns.x,cns.y)
  mcols(uniqex) <- mcols(uniqex)[,c((length(cns.x)+1):(length(cns.x)+length(cns.y)),1:(length(cns.x)))]
  mcols(uniqey)[, (length(cns.y)+1):(length(cns.x)+length(cns.y))] <- NA
  colnames(mcols(uniqey)) <- c(cns.y,cns.x)
  res <- c(z, uniqex,uniqey)
  mcols(res) <- mcols(res) %>% data.frame() %>% mutate_all(as.character())
  return(res)
}

@sa-lee
Copy link
Collaborator

sa-lee commented May 9, 2020

This looks like it's on the right track! Take a look at https://github.com/sa-lee/plyranges/blob/ce998131510d1da7b3c9e47deba5946b5c33ab1d/R/ranges-overlap-joins-outer.R#L55-L86 which provides the template for doing the left outer join. You're basically adding an extra step to pull in the right hand side, i.e. only_right[subjectHits(hits)] <- FALSE. Happy for you to be added as a ctb if you want to put this in as PR with some unit tests :D

@alexyfyf
Copy link
Author

Hi @sa-lee , I will have a try.
So this overlaop_joins_outer is actually the function join_overlap_left?
I didn't see any other function in this wrapper. Do you think I should add my join_overlap_full on top of this?
Sounds a bit confusing.

@egenomics
Copy link

egenomics commented Jul 19, 2021

@sa-lee Have you resolved this issue? I don't understand if the function provided by @alexyfyf is valid or not.

@alexyfyf
Copy link
Author

At least it is valid in the test case, again, I am not sure if everyone has the same definition of full_join GRanges.

@markphillippebworth
Copy link

This is a fantastic package, and it's one of the critical dependencies for my own package in CRAN. Having a full_join function would be good addition, and round out the 'join' operations.

The previous function doesn't sort out the metadata right, tacking on .x and .y labels, even if the metadata column name is unique to one group.

The other way you could do a full join for objects GR1 and GR2 is as follows:
uniqueGR2 <- filter_by_non_overlaps(GR2, GR1)
overlapGR<- join_overlap_left(GR1, GR2)
bind_ranges(overlapGR, uniqueGR2)

This works once I've essentially pre-figured out which metadata columns I want to keep from each GRanges. Another way would be to add the ability to merge metadata if the column names are the same (i.e. sum, list, paste, etc..), or just add the .x and .y.
Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants