Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated copy-tracing proposal to handle conflicts #4988

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

martinvonz
Copy link
Member

@martinvonz martinvonz commented Nov 27, 2024

Checklist

If applicable:

  • I have updated CHANGELOG.md
  • I have updated the documentation (README.md, docs/, demos/)
  • I have updated the config schema (cli/src/config-schema.json)
  • I have added tests to cover my changes

@martinvonz martinvonz force-pushed the push-msqlmmqooqmp branch 2 times, most recently from e178ecb to e1d524b Compare December 10, 2024 07:06
@martinvonz martinvonz force-pushed the push-msqlmmqooqmp branch 2 times, most recently from bebb15f to 3ef54f1 Compare January 13, 2025 16:04
@martinvonz martinvonz marked this pull request as ready for review January 13, 2025 16:04
@martinvonz
Copy link
Member Author

I'm getting tired of working on this doc, but I think it's good enough that it's reasonable clear how it will work. So please take a look if you're interested. If you spot problems, it's good to hear about them before I spend time polishing it.

thanks to the same-change rule, we won't consider it a conflict, so maybe it
actually works well in practice thanks.

For example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These examples appear incomplete - what is their relation, do they even have one?

Copy link
Contributor

@ilyagr ilyagr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read about half of the doc so far. It is very interesting; I think it's assembling into a sensible theory in my head (though my understanding is still slowly evolving).

There are a few things that I think could be clearer; I think I did manage to answer most of my own questions though. I still point them out; feel free to tell me if you'd like me to work on clarifying them.It might be a good exercise. I might try to get to the end of the doc first, though.

had names `bar` and `foo`.

To support merging two files into one, the list of past names is actually a DAG.
Merging can happen in a merge commit when two sides copy/rename different
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presentation: the simplest example of this in my mind is the backout of a commit that copied a file, we could mention that as well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's simpler to produce a merged in the copy graph that way I think it's less intuitive, so I prefer to leave it as is.


```
$ jj log
D delete baz
Copy link
Contributor

@ilyagr ilyagr Jan 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: This was a draft comment that got into the final review. I haven't answered this for myself yet, but I probably can.


What is the merge before editing? Conflict between bar and baz? Or both?

What if copies?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should ideally make it a conflict and let the user decide what to do. This would presumably mean leaving the three trees in the commit even though they could be resolved if copy info is not considered.

The `2:bar->1:foo` syntax above means that copy ID 2 has file `bar`, which was
copied from copy ID `1`, where it was called `foo`.

Diffing from `L` to `M` finds 2,3,4,5 as changed copy IDs. By walking their
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to include 1?

Also, it isn't clear to me how it is determined which copy IDs are "changed". Are you diffing the pairs (file_contents, copy_id) for each path? I don't immediately see whether this makes sense, but probably.


Update: I think the picture I explain two comments down might explain this.

copied from copy ID `1`, where it was called `foo`.

Diffing from `L` to `M` finds 2,3,4,5 as changed copy IDs. By walking their
graphs, we find that 1,2, and 5 are related, while 3 and 4 are not.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presentation: it might be nice to explain what happens in simpler cases: e.g. the diff between M and K, as well as L and K. I think the latter example might be helpful for illustrating my question above of how "changed copy ids" are determined.

Diffing from `L` to `M` finds 2,3,4,5 as changed copy IDs. By walking their
graphs, we find that 1,2, and 5 are related, while 3 and 4 are not.

Because the `bar` and `baz` files have unrelated copy graphs, we break up their
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add: "in the commit M" I suppose it's true in "L" as well, but it's important which ones you are looking at.

More vaguely, I feel like this and the next paragraph need a clearer explanation if possible. One way to think of it is to draw the copy ID DAGs and point out which commits each node is in. The, we declare a node an add/delete if it's not connected to any node from the other commit. For each pair of (node from L) and (node from M) that are connected, we create a diff. Then explain how the diff is created by following the unique shortest path betweem the nodes.

I can try to write it, if it helps.

Copy link
Contributor

@ilyagr ilyagr Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graph LR

    subgraph L["Commit L"]
        2["2:bar"]
        3["3:baz"]
    subgraph K["Commit K"]
        1["1:foo"]
    end
    end
    
    subgraph M["Commit M"]
        4["4:bar"]
        5["5:baz"]
    end
    
    2 --> 1
    5 --> 1
Loading

As you say in the text, here 4:bar is new in M, 3:baz is deleted in L, and we have diffs between (5:baz from M and 1:foo from L), as well as (5:baz from M and 2:bar from L).

A good way to think of it is what I said above: for each pair of (node from L) and (node from M) that are connected via the DAG (or represent the same name, i.e. the same node in the DAG, which is a special case), we create a diff.

Copy link
Contributor

@ilyagr ilyagr Jan 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type of diff (whether it's a rename, merge, or a split) seem to depend on which way the arrows in the copy ID DAG point, I'm guessing.

hese might get pretty complicated to describe, e.g. if you get something like the following path between A (a path in the commit that's the source of the diff) and D (a path in the target of the diff) in the DAG. I'm not sure what to do in these cases1.

A <---- B ----> C <----- D

Another weird case: what if we care about the diff between A and D in

        B ----> C <---- D
A <---- B               D
        B <---- E ----> D

I'm guessing (but not sure, this is very much worth double-checking) that these are possible graphs to get; they are certainly a DAG.

Here, B and D are single nodes, each with one incoming and one outgoing node. This is still a DAG.


Here's the second graph, redrawn in mermaid:

graph LR
A["A, diff source"]
D["D, diff target"]
B --> A
B --> C
D --> C
E --> B
E --> D
Loading

Footnotes

  1. there's probably some solution, at worst we could not cover these cases and consider A and D to be unrelated.


The `2:bar->1:foo` syntax above means that copy ID 2 has file `bar`, which was
copied from copy ID `1`, where it was called `foo`.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presentation: Perhaps I'd add something like:

Let's consider the diff from L to M. This is the same as the diff of the commit M2 we'd get by running jj new L; jj bookmark create M2; jj restore --from M --to M2 (which would result in the commit M2 having the same tree as M).


Commit M:
name: bar, id: M, copy_id: 4:bar
name: baz, id: M, copy_id: 5:baz->1:foo
Copy link
Contributor

@ilyagr ilyagr Jan 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These arrows could be considered "backwards" by analogy to commits. baz has a pointer to foo, but this represents foo becoming baz. In the descriptions in the beginning of the section, the arrows go the other way.

My guess is that it's best to point the arrows the other way from the beginning (so, this arrow should be foo->baz, and the text should say that baz has a pointer to foo), but it's of course arguable.

I'm writing this because when I glanced at the mermaid diagram again, it took me a while to realize that I was thinking of arrows meaning the wrong thing.

Copy link
Contributor

@PhilipMetzger PhilipMetzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just have some minor thougths and some typo fixes, I also don't consider supporting Git and its backend worth it.

Comment on lines +567 to +602
never find any copies. Do we need an indexing pass to detect all renames in a
repo when running `jj git init`? That can be very expensive for large repos.
For reference, `git log --summary --find-copies-harder` takes about 165 seconds
in the git.git repo on my computer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could have more examples, like Nixpkgs, mozilla-central and chromium and the results from the implementation you talked about in the Discord.

Comment on lines +556 to +588
### Annotate

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incomplete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +561 to +595
Do we ever want to record renames in the Git backend? If we do, we would
presumably store it outside the Git object, similar to how we store the change
id for commits.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some thoughts, it'd be nice to support precise renames and conflicts in a best effort basis. But since the bar to entry here already is quite steep, which you point out nicely in the next paragraph, I'd rather go for a set of special commands to deal with it in the Git backend, when we communicate with a native backend.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're going to index copy information for Git repos anyway, then it doesn't seem like much extra work to make it possible for a user to modify that state themselves.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're going to index copy information for Git repos anyway, then it doesn't seem like much extra work to make it possible for a user to modify that state themselves.

Yes, that's a good point. I think I previously over-indexed on the best effort use-case.

@martinvonz martinvonz force-pushed the push-msqlmmqooqmp branch 2 times, most recently from 194943d to 864cf6e Compare February 21, 2025 22:13
This version describes a design that I think will work relatively well
for supporting copy tracing even when there are conflicts.

There are still some open questions, in particular around how to
populate the data in Git repos, and how to efficiently query the data
in very large repos like the one at Google. I think it's still clear
enough that we can start working on the implementation.
Copy link
Member Author

@martinvonz martinvonz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've addressed many of the comments. There are still many left to address.

FYI, @steadmon will be working on implementing the feature.


```
$ jj log
D delete baz
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should ideally make it a conflict and let the user decide what to do. This would presumably mean leaving the three trees in the commit even though they could be resolved if copy info is not considered.

had names `bar` and `foo`.

To support merging two files into one, the list of past names is actually a DAG.
Merging can happen in a merge commit when two sides copy/rename different
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it's simpler to produce a merged in the copy graph that way I think it's less intuitive, so I prefer to leave it as is.

Comment on lines +556 to +588
### Annotate

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done :)

Comment on lines +561 to +595
Do we ever want to record renames in the Git backend? If we do, we would
presumably store it outside the Git object, similar to how we store the change
id for commits.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're going to index copy information for Git repos anyway, then it doesn't seem like much extra work to make it possible for a user to modify that state themselves.

The `2:bar->1:foo` syntax above means that copy ID 2 has file `bar`, which was
copied from copy ID `1`, where it was called `foo`.

Diffing from `L` to `M` finds 2,3,4,5 as changed copy IDs. By walking their
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I got tripped up over something silly here. You say "Diffing from L to M", which makes me think L is the source, and M is the destination. However, later you say "foo does not exist on the source side", which implies the opposite.

martinvonz added a commit that referenced this pull request Mar 6, 2025
`jj resolve` will currently try to resolve the executable bit and will
clear it if it can't be resolved. I don't think the executable bit
should be affected by the external merge tool. We could preset the
flags on the files we pass to the merge tool and then expect the merge
tool to update the flags, but we don't do that. I'm not sure that's a
good idea (few merge tools probably expect that), but since we don't
do it, I think it's better to leave the executable bits
unchanged. This patch makes it so by calling
`Merge::with_new_file_ids()` on the original conflict whether or not
the content-level conflict was resolved.

I noticed this was while working on the copy-tracking proposal in PR
#4988, which involves adding copy information in the `TreeValue`. If
we do implement that, then we should preserve copy information here
too (in addition to the executable flag). That will automatically work
after this patch.
martinvonz added a commit that referenced this pull request Mar 8, 2025
`jj resolve` will currently try to resolve the executable bit and will
clear it if it can't be resolved. I don't think the executable bit
should be affected by the external merge tool. We could preset the
flags on the files we pass to the merge tool and then expect the merge
tool to update the flags, but we don't do that. I'm not sure that's a
good idea (few merge tools probably expect that), but since we don't
do it, I think it's better to leave the executable bits
unchanged. This patch makes it so by calling
`Merge::with_new_file_ids()` on the original conflict whether or not
the content-level conflict was resolved.

I noticed this was while working on the copy-tracking proposal in PR
#4988, which involves adding copy information in the `TreeValue`. If
we do implement that, then we should preserve copy information here
too (in addition to the executable flag). That will automatically work
after this patch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants