-
Notifications
You must be signed in to change notification settings - Fork 151
Description
When detecting renames between m
and n
files, as well as when it detects similarities in git range-diff
between m
and n
commits, Git currently performs m
times n
comparisons, which is quite expensive for larger m
and n
.
There are a number of ways to accelerate that, primarily by pre-processing the file contents/commits and then trying to find correspondences in a more guided manner on the processed items. The most obvious approaches are all based on performing some sort of Nearest Neighbor Search.
A large part of this project will be to compare the available approaches to determine which one to implement, then implement it in libgit.a
and use it for the rename detection and for commit matching in git range-diff
.
Ideas
Approximate Nearest Neighbor Search
There are quite a few algorithms to perform "Approximate Nearest Neighbor Search". See e.g. a comparison at https://github.com/erikbern/ann-benchmarks.
- Hierarchical Navigable Small World graphs
- Navigating Spreading-out Graph (there is also some MIT-licensed C++ source code)
- Neighborhood Graph and Tree
Locality-sensitive hashing
Locality-sensitive hashing was made famous by its application in web search.
Other methods
- Classifying with a pre-trained Support Vector Machine (see e.g. how Gerrit does it)
Activity
dscho commentedon Jan 15, 2020
I just noticed that @newren is scheduled to speak at Git Merge 2020 about "Scaling the Merge Machinery". The link https://git-merge.com/scaling-the-merge-machinery is not yet populated, apparently, but in a promotional email, I saw this description:
This talk is therefore at least related to what is described in this here ticket.
newren commentedon Mar 10, 2020
Some general notes:
Some specifics about the talk I gave:
Hope that helps,
Elijah
sheikhhamza012 commentedon Mar 21, 2020
@dscho I would love to work on this or on the conversion of shell to builtin files , can you guide me through the process? :)
newren commentedon Mar 21, 2020
This might not be a good project to work on right now; it may turn out to not be useful at all after my other optimizations, plus you'd have to be working with diffcore-rename.c while I'm busy making all kinds of other modifications to it. I'm not stopping you if you really want to work on it, just giving you a heads up.
newren commentedon Mar 21, 2020
(The conversion of shell to builtin files or other things within git are fine projects to work on, of course.)
sheikhhamza012 commentedon Mar 21, 2020
@newren thanks for replying, actually i am also planning to participate google summer of code and i wanted to contribute to git and i would really appreciate if you could guide me, i have already submitted a patch for and open issue about git bisect not working in sub directories and i want to work more on git. Do you think it would be possible?
newren commentedon Mar 21, 2020
You're going off topic. ;-) This ticket should remain about its subject, accelerating rename detection and range-diff. Briefly, though, I don't have the capacity to take on a GSoC student this year. I did a quick search for your contributions and saw that they haven't made it to the git mailing list, which is where most potential mentors will be watching, not GitHub. I did a search and found git#736 and noticed that the gitgitgadget-git bot provided helpful suggestions on your patches, provided numerous links, and even pointed you at the git-mentoring@googlegroups.com mailing list (which is under-utilized), which I was going to suggest as well. Hopefully that'll help connect you where you need to go. Best of luck!
sheikhhamza012 commentedon Mar 22, 2020
thank you so much for the advice, I have just submitted the patch to mailing list and will look into the group you mentioned. :)