Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange behaviour of join_nearest* #76

Closed
smped opened this issue Feb 7, 2020 · 9 comments
Closed

Strange behaviour of join_nearest* #76

smped opened this issue Feb 7, 2020 · 9 comments

Comments

@smped
Copy link

smped commented Feb 7, 2020

Hi Stuart,

Love the package (obviously), but I'm having some dramas with join_nearest*(). In my following example, I need to join feature1 with gene1. The gene is upstream, on the opposite strand to my feature.

library(plyranges)
a <- GRanges("1:11-12:+")
mcols(a)$name <- "feature1"
a
GRanges object with 1 range and 1 metadata column:
      seqnames    ranges strand |        name
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1     11-12      + |    feature1
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
b <- GRanges(c("1:1-2:-", "1:21-22:+"))
mcols(b)$gene <- c("gene1", "gene2")
b
GRanges object with 2 ranges and 1 metadata column:
      seqnames    ranges strand |        gene
         <Rle> <IRanges>  <Rle> | <character>
  [1]        1       1-2      - |       gene1
  [2]        1     21-22      + |       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Given that gene1 is on the opposite strand to feature 1, join_nearest_upstream() and join_nearest_downstream() behave exactly how I expect.

join_nearest_upstream(a, b)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name        gene
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
join_nearest_downstream(a, b)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        name        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]        1     11-12      + |    feature1       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

However, if strand information is ignored by join_nearest_left() and join_nearest_right(), I cannot see how the following is able to be explained.

join_nearest_left(a, b)
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name        gene
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
join_nearest_right(a, b)
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        name        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]        1     11-12      + |    feature1       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

Is the strand information only ignored for one of the ranges (x = a, y = b)?.
If I manually remove the strand information, I still cannot join these ranges.

join_nearest_left(unstrand(a), unstrand(b))
GRanges object with 0 ranges and 2 metadata columns:
   seqnames    ranges strand |        name        gene
      <Rle> <IRanges>  <Rle> | <character> <character>
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
join_nearest_right(unstrand(a), unstrand(b))
GRanges object with 1 range and 2 metadata columns:
      seqnames    ranges strand |        name        gene
         <Rle> <IRanges>  <Rle> | <character> <character>
  [1]        1     11-12      * |    feature1       gene2
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths

It seems that feature1 & gene1 are unable to be joined by any method here. Should I be using something else?

Thanks,

Steve

@sa-lee
Copy link
Collaborator

sa-lee commented Feb 7, 2020

Hi Steve,

That does seem odd! I’ll take a look at this in more detail next week. In general the docs for these functions could use a bit of work.

Have you tried -join_follow() and friends they might be useful too?

Thanks
Stuart

@smped
Copy link
Author

smped commented Feb 7, 2020

Thanks Stuart.

join_follow/join_follow_left() do work, so that might be a possible fix. We're trying to migrate a bedtools workflow to plyranges, so I'll compare results across the two methods once my collaborator is back in with the data from the original run.

Cheers,

Steve

@sa-lee
Copy link
Collaborator

sa-lee commented Feb 17, 2020

Hi Steve,

Glad to hear that follow works. Looking further I'm not so sure if this is a bug, but maybe the documentation is confusing.

plyranges uses IRanges::nearest() under the hood here, with select = "all" and ignore.strand = TRUE used by default. From the nearest docs:

The conventional nearest neighbor finder. Returns an integer vector containing the index of the nearest neighbor range in subject for each range in x. If there is no nearest neighbor (if subject is empty), NA's are returned.

In your example, running nearest(a,b) will always return a vector or hits object parallel to the input (i.e. length 1) and in your example the distance of both ranges in b to those in a is 8, so nearest selects the one with the largest index value.

IRanges::nearest(a,b, ignore.strand = TRUE, select = "all")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  -------
  queryLength: 1 / subjectLength: 2

IRanges::nearest(a, b, ignore.strand = FALSE, select = "all")
Hits object with 1 hit and 0 metadata columns:
      queryHits subjectHits
      <integer>   <integer>
  [1]         1           2
  -------
  queryLength: 1 / subjectLength: 2

Hence the identical results for both upstream/right. I realise here that the docs are probably not clear on this matter but hopefully that helps!

@ShanSabri
Copy link

Hi Stuart -- Thanks for the great package. I was wondering if it was possible to find the k-nearest neighbors as opposed to the single nearest. For example, I'm interested in tagging ATAC peaks with the 5 nearest genes. I've opened up an issue on GenomicRanges() regarding an unexported KNN function and was wondering if you had any insight.

The functions below seem to work perfectly for k=1 nearest neighbor, but I'd like to extend this to k>1.

>   IRanges::nearest(peaks, tss, ignore.strand = FALSE, select = "all") # k = 1; nearest peak to loci
Hits object with 295913 hits and 0 metadata columns:
           queryHits subjectHits
           <integer>   <integer>
       [1]         1       15215
       [2]         2       15215
       [3]         3       15215
       [4]         4       15215
       [5]         5       15215
       ...       ...         ...
  [295909]    295640       16535
  [295910]    295641       16535
  [295911]    295642       16535
  [295912]    295643       16535
  [295913]    295644       16535
  -------
  queryLength: 295644 / subjectLength: 18436

>   GenomicRanges::distanceToNearest(peaks, tss, select = "all"))# k = 1; nearest peak to loci
Hits object with 295913 hits and 1 metadata column:
           queryHits subjectHits |  distance
           <integer>   <integer> | <integer>
       [1]         1       15215 |    107265
       [2]         2       15215 |    107065
       [3]         3       15215 |    106865
       [4]         4       15215 |    106665
       [5]         5       15215 |    106465
       ...       ...         ... .       ...
  [295909]    295640       16535 |     42858
  [295910]    295641       16535 |     43058
  [295911]    295642       16535 |     43258
  [295912]    295643       16535 |     43458
  [295913]    295644       16535 |     43658
  -------
  queryLength: 295644 / subjectLength: 18436

Any help would be much appreciated!

BTW - It's Shan. We interned together at Genentech. I hope things are going well for you!

@sa-lee
Copy link
Collaborator

sa-lee commented Mar 2, 2020

Hi Shan,

Nice to hear from you, hope you're doing well!
One strategy you could try in this particular case is to expand on the transcription start site, then use the left overlap join + an aggregation. Take a look a here for an example of this.

also @ShanSabri - if you like could you please make another issue for this? that way i can get to implementing it at some point (but also PRs welcome ;P)

@ShanSabri
Copy link

ShanSabri commented Mar 2, 2020

expand on the transcription start site, then use the left overlap join + an aggregation

Hmm this is still unclear to me even after looking at the example. I should clarify that I'm able to compute the overlap over a range, but that's not really what I'm aiming to do here.

From the example you posted:

For any gene that has no overlaps, the DA peak columns will have NA’s

I'd like to take the, say 3, nearest genes to a given position (even if they are very far) while also returning their corresponding distances.

Let's continue discussion over here.

Thanks for getting back!

@sa-lee
Copy link
Collaborator

sa-lee commented Mar 2, 2020

No problem, I think the example is going the other way around from what you want as well.

@sa-lee
Copy link
Collaborator

sa-lee commented May 8, 2020

@steveped I'm assuming you managed to resolve this, so closing the issue now.

@sa-lee sa-lee closed this as completed May 8, 2020
@smped
Copy link
Author

smped commented May 11, 2020

Thanks @sa-lee. Passed the conversation to my confused collaborator & they took it from here. Really appreciated the help too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants