Skip to content

Accelerate rename detection and range-diff #519

@dscho

Description

@dscho
Member

When detecting renames between m and n files, as well as when it detects similarities in git range-diff between m and n commits, Git currently performs m times n comparisons, which is quite expensive for larger m and n.

There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.

A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in libgit.a and use it for the rename detection and for commit matching in git range-diff.

Ideas

Approximate Nearest Neighbor Search

There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.

Locality-sensitive hashing

Locality-sensitive hashing was made famous by its application in web search.

Other methods

Activity

dscho

dscho commented on Jan 15, 2020

@dscho
MemberAuthor

I just noticed that @newren is scheduled to speak at Git Merge 2020 about "Scaling the Merge Machinery". The link https://git-merge.com/scaling-the-merge-machinery is not yet populated, apparently, but in a promotional email, I saw this description:

An overlooked dimension of scaling Git is the large number of renames. In this talk, Elijah will share how Palantir Technologies has worked to optimize rename detection and his learnings from the process.

This talk is therefore at least related to what is described in this here ticket.

newren

newren commented on Mar 10, 2020

@newren

Some general notes:

  • My talk did not discuss the approaches suggested here, and I have not implemented them. None of my approaches were incompatible with these ideas, thus they remain as possible additional optimizations. I'm not sure if they'll be needed given the improvements I have already made and am working on.
  • My early brief reading on some of these methods from a while ago (and I could have easily misunderstood) is that they are probabilistic in nature, which scared me a bit. Saying that we've only compared 5% of two files but are 95% sure they're a rename may sound good, but if we take that probability as good enough, then we run the risk of false positives and false negatives. In web searches, that's not a big deal. If we do merging based on bad detection and commit the result or force the user to deal with conflicts, then it's a bigger problem. However, if we only used it to guide decisions (e.g. these two files look likely to be renames, let's do the real comparison to check, and if they are, don't compare those files to any of the other files even if there might be a better match), then this could be very reasonable -- in fact it would be very similar to the basename comparisons I mention in my talk.
  • My talk could be summarized as "There are four 'special cases' where we can avoid including certain paths in the matrix of rename comparisons. These 'special cases' aren't special but are actually common. In fact, combined they are so common that they represent an overwhelming majority of cases. We can skip almost all the work, with virtually no change in behavior."

Some specifics about the talk I gave:

  • In a copy of the linux repo, when rebasing some hwmon-updates (35 patches) from 5.5 back to 5.4, I saw a speedup factor of 3. That case had very few renames, and rename detection only accounted for 50% of overall time (which would have suggested a maximum possible speedup factor of 2, but I achieved 3 because I optimized more than just rename detection).
  • I turned that same testcase into a stress test for rename detection by checking out 5.4 and renaming drivers/ -> pilots/ (resulting in 26k file renames), then trying to rebase the patches from 5.5 on top of that. The time required to rebase the patches was over one and a half hours on my laptop with git-2.25.0. With my updates, it dropped to half a minute.
  • The slides for my talk can be found at https://github.com/newren/presentations/blob/pdfs/merge-performance/merge-performance-slides.pdf with steps to reproduce the performance improvements I cited in my talk found at https://github.com/newren/git/blob/git-merge-2020-demo/README.md.

Hope that helps,
Elijah

sheikhhamza012

sheikhhamza012 commented on Mar 21, 2020

@sheikhhamza012

@dscho I would love to work on this or on the conversion of shell to builtin files , can you guide me through the process? :)

newren

newren commented on Mar 21, 2020

@newren

This might not be a good project to work on right now; it may turn out to not be useful at all after my other optimizations, plus you'd have to be working with diffcore-rename.c while I'm busy making all kinds of other modifications to it. I'm not stopping you if you really want to work on it, just giving you a heads up.

newren

newren commented on Mar 21, 2020

@newren

(The conversion of shell to builtin files or other things within git are fine projects to work on, of course.)

sheikhhamza012

sheikhhamza012 commented on Mar 21, 2020

@sheikhhamza012

@newren thanks for replying, actually i am also planning to participate google summer of code and i wanted to contribute to git and i would really appreciate if you could guide me, i have already submitted a patch for and open issue about git bisect not working in sub directories and i want to work more on git. Do you think it would be possible?

newren

newren commented on Mar 21, 2020

@newren

You're going off topic. ;-) This ticket should remain about its subject, accelerating rename detection and range-diff. Briefly, though, I don't have the capacity to take on a GSoC student this year. I did a quick search for your contributions and saw that they haven't made it to the git mailing list, which is where most potential mentors will be watching, not GitHub. I did a search and found git#736 and noticed that the gitgitgadget-git bot provided helpful suggestions on your patches, provided numerous links, and even pointed you at the git-mentoring@googlegroups.com mailing list (which is under-utilized), which I was going to suggest as well. Hopefully that'll help connect you where you need to go. Best of luck!

sheikhhamza012

sheikhhamza012 commented on Mar 22, 2020

@sheikhhamza012

thank you so much for the advice, I have just submitted the patch to mailing list and will look into the group you mentioned. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

      Development

      No branches or pull requests

        Participants

        @dscho@newren@sheikhhamza012

        Issue actions

          Accelerate rename detection and range-diff · Issue #519 · gitgitgadget/git