Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a leftjoin! (or match! or merge! or whatever it should be called) #2259

Closed
ExpandingMan opened this issue May 15, 2020 · 11 comments · Fixed by #2843
Closed

add a leftjoin! (or match! or merge! or whatever it should be called) #2259

ExpandingMan opened this issue May 15, 2020 · 11 comments · Fixed by #2843
Milestone

Comments

@ExpandingMan
Copy link
Contributor

I very frequently find myself needing to do

df = leftjoin(df, rdf, on=cols, validate=(false, true))

It's not that big a deal to have a new dataframe (after all, the underlying columns weren't copied) but I find myself having to do this so often (this is probably one of the most commonly used operations for me) that I would really love a mutating function for this. Maybe something like

match!(df, rdf, on=cols)

I started to look into what would be involved in doing this, but _join is pretty monolithic so it wasn't immediately obvious. I'd think that the biggest subtlety would be whether the ordering of rows in df can change (but I don't think they can even as things are?).

Thoughts? Am I the only one who does this all the time?

@bkamins
Copy link
Member

bkamins commented May 15, 2020

First we should finalize the API for #2243 as this is related. When we have a decision on "update" stuff, then this can be implemented at the same time.

leftjoin! should be relatively easy to add and it should just be leftjoin but in-place for its first argument.

@bkamins bkamins added this to the 1.x milestone May 15, 2020
@ExpandingMan
Copy link
Contributor Author

If you're open to doing it before 1.0, I'd be happy to work on it after #2243 is finalized, because, like I said, I really use this quite a lot. If you think of it, feel free to post here again when #2243 is ready and I'll likely jump on it.

@bkamins
Copy link
Member

bkamins commented May 15, 2020

Sure! I think 1.0 release will be in the end of 2020 so there is plenty of time for it. Thank you!

Note though that for performance reasons this should be a special implementation anyway (as we will not copy left data frame and we will only need to compute the indices for the right data frame)

@ExpandingMan
Copy link
Contributor Author

Yeah, if and when I do it I might look into breaking up _join into some more convenient and re-usable pieces, if you guys are open to that (don't necessarily need to discuss now or here).

@bkamins
Copy link
Member

bkamins commented May 15, 2020

Well - we will probably redesign _join anyway. It has a sub-par performance now, but we are not yet sure what we should do to improve it.

@ExpandingMan
Copy link
Contributor Author

I was really tempted to start messing around with _join, but I told myself I have to stay focused and finish what I was already working on. Maybe I'm just being naive, but I feel like it should be easier than split-apply-combine anyway.

I'd really like to rewrite the joins to use groupby, then perform set operations on the keys and then somehow concatenate the views (into Arrays) given by the groupby. Anyway, I'll open a separate issue if I ever really start digging into rewriting the joins, but more likely one of you guys will get to it before I do.

@bkamins
Copy link
Member

bkamins commented May 16, 2020

Actually I was considering to define *join for GroupedDataFrame objects (maybe as an additional definition). However, my initial testing has shown that it was slower to create two GroupedDataFrames and then join them than what we have now (but I might have made some inefficient implementation).

However, if we can make it fast, the additional benefit of such an approach would be that doing several joins using the same data frame would be much faster (groups have to be computed only once).

So in general: this direction is very promising and I would like to jointly explore it more soon!

@ExpandingMan
Copy link
Contributor Author

ExpandingMan commented May 16, 2020

Ok, interesting to hear we had the same thought about this.

I can't think of any reason why it couldn't be efficient. The only potential issue that I thought of with this approach that concerned me is that it might be less efficient to generate the group keys and then hash them rather than just calling some dead-simple broadcasted hash function on the rows before even getting into the groupby stuff. I would think that the most expensive part of all of this is the actual concatenation (we clearly don't want views for the joins), and I don't see any reason why that couldn't be just as fast using groupby than by going a more direct route. I think there may also be some extra "fluff" in the user-facing groupby function to make it nicer to work with and safer that we might strip out for extra efficiency in joins, but I wouldn't think any of that would actually be very expensive.

It seems to me that a groupby is a subroutine of a join, so even if we find some reason that using the existing groupby would be too slow, it might be worth it to modify groupby (admittedly quite tricky since that operation is already impressively fast) or add an alternative groupby algorithm that can be used by joins. As you implied, it would be really cool to give users the ability to do joins on GroupedDataFrames.

@bkamins
Copy link
Member

bkamins commented May 16, 2020

As I think of it I see the following performance optimization:

  1. do groupby on left data frame (possibly keeping some extra metadata)
  2. then when doing groupby on the right data frame, but do not do it anew, but "as if" it were further rows of a bigger data frame. In this way you get proper matching of groups between two data frames "for free". Actually for the right data frame you can store lists of rows matching each group number in the left data frame. In this way you can do everything in one pass.

@ExpandingMan
Copy link
Contributor Author

Hm, interesting, but wouldn't that require large modifications to the groupby code since it's not currently set up to track locations across multiple dataframes? I suppose you could create some sort of lazily concatenated DataFrame type instead (or perhaps you mean horizontally concatenating which is more-or-less free, but I don't think would handle the final concatenations of the joined rows for you). I see that this would allow you to easily get your joins just by calling combine, however I don't really understand what real performance benefits it would give you over calling groupby on the two dataframes separately (though I haven't looked at the groupby code very deeply, so I'm ignorant on this subject).

@bkamins
Copy link
Member

bkamins commented May 16, 2020

The benefit is. That by calling two groupby you create two vectors, say:

groups1 = [1,1,2,3,4,5,2,1]
groups2 = [1,2,1,2,3,1]

and after this you still need to do mapping between group numbers in groups1 and groups2, e.g. to determine if 1 in groups1 has some matching number in groups2, and if yes, what it would be.
And even after you have these mappings you still need to perform proper row matching.

On the other hand if you are doing it in a single processing step, you do not need to create groups2 at all, but create an array (probably this is not the best data structure, but it is simplest so I use it to show the idea):

groups1 = [1,1,2,3,4,5,2,1]
rows2groups1 = [[1,2], [5], [3, 4], [], []]

telling you which rows of the second data frame map to which groups of the first data frame. Out of such a data structure it is really easy to properly resize first data frame (you just look at the lengths of rows2groups1 entries and the only special case is if the length is 0 in which case you use length 1 ). For the second data frame you just concatenate the vectors from rows2groups1 matching groups in groups1, with the special case that if the vector has length 0 you instead fill-in with missing values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants