[Discussion] Stacked Conflicts Representation #5

quark-zju · 2021-03-04T15:04:37Z

quark-zju
Mar 4, 2021

You probably have thought about this. I was trying to compare the current conflict representation (two lists, (adds, removes)) with a conservative recursive struct (a tree of (local, other, base)).

To simplify the problem, only file modifications are considered, no deletions.

Consider these recursive structs:

(local: (local: A, other: B, base: C), other: D, base: E)
(local: (local: A, other: D, base: C), other: B, base: E)
(local: (local: A, other: B, base: E), other: D, base: C)

They map to a same flat list representation (IIUC):

(adds: [A, B, D], removes: [C, E])

Apparently some information (ex. relations between merge base and local/other) is lost in this representation but does it matter? To reason about it, here is a question I tried to answer:

Suppose doing 3-way merges using one of the recursive structs is conflict-free. Is there an algorithm that takes (adds: [A, B, D], removes: [C, E]) and produces a similar conflict-free merge result?

I couldn't come up with such an algorithm easily. I wonder if that indicates a recursive representation is more practical. It seems there are other benefits, too, such as easier to integrate with existing merge tools and easier to store resolve progress.

martinvonz · 2021-03-05T06:02:48Z

martinvonz
Mar 5, 2021
Maintainer

I was trying to compare the current conflict representation (two lists, (adds, removes)) with a conservative recursive struct (a tree of (local, other, base)).

Good question, and thanks for asking! That recursive representation was actually how I stored it initially (it seems like the most intuitive representation). The ambiguous representation was actually something that bothered me and I was happy when I realized it could be replaced by two lists :) I don't think I considered that the different recursive representations might carry more information.

Practically, the biggest problem I had with the recursive representation was how to simplify conflicts. As a reminder, simplification is done when merging trees. It makes it so continuously rebasing a commit onto a new upstream doesn't make keep the conflict more complex every time. I found it hard to reason about and code the simplification when representing conflicts as recursive structs. For example, all of these can be simplified:

(local: (local: A, other: B, base: A), other: C, base: D) => (local: B, other: C, base: D) 
(local: (local: A, other: B, base: C), other: D, base: A) => (local: B, other: D, base: C) 
(local: (local: A, other: B, base: C), other: (local: D, other: E, base: A), base: F) => (local: (local: B, other: D, base: C), other: E, base: F)
(local: A, other: B, base: (local: A, other: C, base: D)) => (local: B, other: D, base: C)

At least for the last two there, it's non-trivial to see what the correct simplification is. The list representation makes it very simple to do: you simply remove pairs of adds and removes that are the same.

Suppose doing 3-way merges using one of the recursive structs is conflict-free. Is there an algorithm that takes (adds: [A, B, D], removes: [C, E]) and produces a similar conflict-free merge result?

I'm not sure what you mean. I can think of three different things you may be thinking about :)

Cases like where A==C in your example. That's the "conflict simplification" I described above.
Cases where the nested conflict (local: A, other: B, base: C) gets resolved without conflicts after merging the file contents. Yes, that is indeed simpler in the struct-based representation (you replace that node in the tree by the resolved state). However, it could also be that trying to instead do the content merge of (local: B, other: C, base: D) would result in a conflict-free result. You would presumably not consider trying that in the struct-based representation. You might very well consider trying all combinations (with some upper bound) when using the list-based representation. I also plan to push down the conflict handling to the file level (or maybe I should say chunk level), so that an unresolved conflict in a file can get resolved again if you apply a diff that undoes one side of the conflict, even if that diff also makes other changes. I'm not sure how to deal with chunks that only partially overlap, and how to present that to the user. We'll see, I guess.
Doing a 3-way merge where e.g. the base and the local side are themselves conflicts, but they happen to be the same conflict. This case should be handled the same regardless of representation. The code that does the tree-level merge will see the same value in base and local and choose the other side.

Did at least one of those bullets answer your question?

It seems there are other benefits, too, such as easier to integrate with existing merge tools and easier to store resolve progress.

I'm not sure I follow. Even if the internal representation is the list-based one, we could still call a 3-way merge tool with just "remove" and two "adds" as usual. Then if the tool or the user resolves those, we can replace those entries in the lists by just one "add" for the resolved content. Does that answer your question or did I misunderstand?

By the way, I recently added a new type of conflict markers to Mercurial (https://phab.mercurial-scm.org/D9551). The inspiration to that format came from Jujube's handling of conflicts. I haven't gotten around to using the same format in Jujube, however.

Off topic: I've recently finished making it so the "commit index" (like what Git calls the "split commit graph") is kept up to date within a transaction and so it is written incrementally to disk. I'm now converting existing callers to use the index to speed them up. Then I plan to implement revsets using that. Perhaps I'll rewrite the file diffing and file merging as mentioned above after that.

0 replies

quark-zju · 2021-03-05T17:42:29Z

quark-zju
Mar 5, 2021
Author

Thanks for the detailed explanation!

I don't think I considered that the different recursive representations might carry more information.

This looks surprising at first read. But I understand it now. It's an elegant design!

(local: (local: A, other: B, base: C), other: D, base: A) => (local: B, other: D, base: C)
(local: (local: A, other: B, base: C), other: (local: D, other: E, base: A), base: F) => (local: (local: B, other: D, base: C), other: E, base: F)
(local: A, other: B, base: (local: A, other: C, base: D)) => (local: B, other: D, base: C)

Great examples! It took me some time to see how they work.

However, it could also be that trying to instead do the content merge of (local: B, other: D, base: C) would result in a conflict-free result. You would presumably not consider trying that in the struct-based representation. You might very well consider trying all combinations (with some upper bound) when using the list-based representation.

This explains it (esp. "trying all combinations"). All combinations still seem computationally prohibitive. But I guess the commit graph, past conflicts, or just the content of conflict can be used as heuristics to select promising combinations for 3-way merge.

Even if the internal representation is the list-based one, we could still call a 3-way merge tool with just "remove" and two "adds" as usual.

Yeah, I wasn't thinking about using 3-way merge because it's unclear which "base" to use. I was thinking more about some non-traditional UI that presents more than 3 versions together (ex. hg absorb -e). Now I think commit graph heuristics could provide good "base" selection so 3-way merge can just work.

I recently added a new type of conflict markers to Mercurial

Looks useful. We might just steal it.

I've recently finished making it so the "commit index" (like what Git calls the "split commit graph")

Last time (long ago) I checked the git's commit graph, my understanding about its key ideas are 1) introduce integer identities (file offset is a kind of identity) for O(1) lookup and 2) use generation numbers for early exit in linear algorithms. I wasn't so impressed because hg revlog has achieved both already. Are you trying to be compatible with upstream git or building something different? How would you compare it with revlog?

For our commit graph, initially I thought about using generation numbers, or using commit hashes. Later I decided to use unique integer identities, similar to the hg's revision numbers. The main benefit is that commit hashes (hardly compressible) can now be lazy. We were jokingly talking about cloning an internal repo into a floppy disk size-wise. It is also friendly for large set representation (ex. spanset is O(1) space), and for contains check, ex. i in x:y is O(1).

0 replies

martinvonz · 2021-03-06T00:13:52Z

martinvonz
Mar 6, 2021
Maintainer

All combinations still seem computationally prohibitive.

Yeah, even for an octopus merge with 10 parents, it would get pretty expensive. I'll probably first try that idea of pushing the conflicts down to the chunk level as I said. I suspect that will be good enough for almost all practical cases. It feels like all practical cases I can think of will either have non-overlapping conflicts or will result in conflicts regardless of which order the versions are merged.

Looks useful. We might just steal it.

Please do :) I've found it quite useful, but it takes a little while to get used to.

Last time (long ago) I checked the git's commit graph, my understanding about its key ideas are 1) introduce integer identities (file offset is a kind of identity) for O(1) lookup and 2) use generation numbers for early exit in linear algorithms. I wasn't so impressed because hg revlog has achieved both already.

Yes, that's correct. Mercurial already achieves most of it with its revlog (generation numbers only provide a small benefit).

Are you trying to be compatible with upstream git or building something different?

I have two backends (git and my own naive backend, which is mostly to show that it works) and I'm trying to depend as little as possible on the backend, so I've chosen my own format (I don't think it could be compatible with git and still support my other backend). Like git's commit graph, it's purely a cache/index (can be recalculated).

For our commit graph, initially I thought about using generation numbers, or using commit hashes. Later I decided to use unique integer identities, similar to the hg's revision numbers. The main benefit is that commit hashes (hardly compressible) can now be lazy.

I remember that from your presentation at the SCM summit. I really like the "segmented revlog" idea. Unfortunately, I had already implemented my commit graph at that point (but it's taken until now to keep it up to date in memory within transactions). I'll definitely keep the idea in mind. I'd be happy to copy it, but there are so many more important pieces missing in Jujube still.

FYI, I've tried hard to make Jujube safe to use on a distributed file system (and also to make it lock-free for concurrent local processes), so I don't use any mutable files. My impression is that your segmented revlog idea would work well with that constraint, even if you don't do it that way (or maybe you do, I haven't checked).

0 replies

quark-zju · 2021-03-06T02:03:55Z

quark-zju
Mar 6, 2021
Author

FYI, I've tried hard to make Jujube safe to use on a distributed file system (and also to make it lock-free for concurrent local processes), so I don't use any mutable files. My impression is that your segmented revlog idea would work well with that constraint, even if you don't do it that way (or maybe you do, I haven't checked).

Currently it uses indexedlog, which is append-only and requires flock. I personally think the git-style immutable files are a bit too limiting and causes the repack headache. indexedlog is widely used in EdenSCM: not just for the dag, but also mutationstore (replaces obsstore), file and tree storage, and metalog (a minimalist source control that implements atomic commit of multi-file changes, aiming to remove headache of svfs read/write order race conditions). We no longer use repack.

The dag crate is abstracted so the backend can be other things. Currently there are 3 backends: memory, indexedlog, and (WIP) Mononoke's server-side backend. The memory backend is used for virtual graphs like "other people's commit cloud smartlog", or "mutation (obsolesce) graph of selected nodes", to run calculations (ex. "predecessors" becomes "ancestors" in the mutation graph), or render them out (ex. look at other people's commits). In theory it won't be too difficult to add new backends if it's really needed. In practice there are NFS users and it seems to work okay for them, despite the use of flock, append-only, etc. So I haven't been motivated to add other backends.

0 replies

quark-zju · 2021-03-09T19:07:30Z

quark-zju
Mar 9, 2021
Author

Closing this since the current representation can handle stacked conflicts. If there are anything to change, I think it might be worthwhile adding commit (and path) metadata so the commit graph can be used to pick merge bases (ex. for added A prefer A's p1 if A's p1 is in removes), but that's just nice-to-have features.

I personally like the current design as-is, than the chunk-level conflicts. The reason is chunks (and diff algorithm) are ambiguous. In my view, the diff and chunking belong to more of the "merge tool"'s job. With that mindset, the default internal merge algorithm can be opinionated on diffing (therefore chunking), and try to merge non-conflicted regions as much as possible. The storage remains the snapshot form so the opinionated diff algorithm does not enforce its opinion on the storage layer.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] Stacked Conflicts Representation #5

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

[Discussion] Stacked Conflicts Representation #5

quark-zju Mar 4, 2021

Replies: 5 comments

martinvonz Mar 5, 2021 Maintainer

quark-zju Mar 5, 2021 Author

martinvonz Mar 6, 2021 Maintainer

quark-zju Mar 6, 2021 Author

quark-zju Mar 9, 2021 Author

quark-zju
Mar 4, 2021

martinvonz
Mar 5, 2021
Maintainer

quark-zju
Mar 5, 2021
Author

martinvonz
Mar 6, 2021
Maintainer

quark-zju
Mar 6, 2021
Author

quark-zju
Mar 9, 2021
Author