Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Munged text leads to incorrect diffs #140

Open
schroederc opened this issue Mar 29, 2023 · 4 comments
Open

Munged text leads to incorrect diffs #140

schroederc opened this issue Mar 29, 2023 · 4 comments

Comments

@schroederc
Copy link

db1b095 introduces a bug in its change from diffLinesToRunesMunge to diffLinesToStringsMunge. Since each line is represented by 1 or more ascii characters, it's possible for the diffing algorithm to split hashed lines incorrectly whereas before the rune indexed lines were indivisible.

For instance, DiffLinesToChars could return hashed strings such as:

"1,2,3,4,5,6,7,8,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,35,36,37,38,39,40,41"
"42,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,43,36,37,38,39,44,45,46,47"

DiffMain may then split the leading 42 such as:

[{Delete 1,2,3,} {Equal 4} {Delete ,5,6,7,8} {Insert 2} {Equal ,9,10,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,9,27,28,9,29,30,31,32,33,9,34,} {Insert 4} {Equal 3} {Delete 5} {Equal ,36,37,38,39,4} {Delete 0} {Insert 4} {Equal ,4} {Delete 1} {Insert 5,46,47}] 

And the resulting diff after hydration is completely wrong.

This affects users of the DiffLinesTo* APIs as well as any user that passes true for checklines in DiffMain or DiffMainRunes.

@schroederc
Copy link
Author

Here's a rough test case:

func TestLineDiff(t *testing.T) {
	t.Run("Chars", func(t *testing.T) {
		before := `1
2
3
4
5
6
7
8
9
`
		after := `10
`

		dmp := diffmatchpatch.New()
		txt1, txt2, lines := dmp.DiffLinesToChars(string(before), string(after))
		diff := dmp.DiffMain(txt1, txt2, false)
		diff = dmp.DiffCharsToLines(diff, lines)

		var foundBefore, foundAfter string
		for _, d := range diff {
			switch d.Type {
			case eq:
				foundBefore += d.Text
				foundAfter += d.Text
			case del:
				foundBefore += d.Text
			case ins:
				foundAfter += d.Text
			}
		}

		if foundBefore != before {
			t.Errorf("Expected before %q; found %q", before, foundBefore)
		}
		if foundAfter != after {
			t.Errorf("Expected after %q; found %q", after, foundAfter)
		}
	})

	t.Run("Runes", func(t *testing.T) {
		before := `1
2
3
4
5
6
7
8
9
`
		after := `10
`

		dmp := diffmatchpatch.New()
		txt1, txt2, lines := dmp.DiffLinesToRunes(string(before), string(after))
		diff := dmp.DiffMainRunes(txt1, txt2, false)
		diff = dmp.DiffCharsToLines(diff, lines)

		var foundBefore, foundAfter string
		for _, d := range diff {
			switch d.Type {
			case eq:
				foundBefore += d.Text
				foundAfter += d.Text
			case del:
				foundBefore += d.Text
			case ins:
				foundAfter += d.Text
			}
		}

		if foundBefore != before {
			t.Errorf("Expected before %q; found %q", before, foundBefore)
		}
		if foundAfter != after {
			t.Errorf("Expected after %q; found %q", after, foundAfter)
		}
	})
}

schroederc added a commit to schroederc/go-diff that referenced this issue Mar 31, 2023
Revert implementation of diffLines to use runes to fix sergi#140.  In order
to not regress sergi#89, skip invalid utf8 runes when munging lines.
schroederc added a commit to schroederc/go-diff that referenced this issue Mar 31, 2023
Revert implementation of diffLines to use runes to fix sergi#140.  In order
to not regress sergi#89, skip invalid utf8 runes when munging lines.
schroederc added a commit to schroederc/go-diff that referenced this issue Mar 31, 2023
Revert implementation of diffLines to use runes to fix sergi#140.  In order
to not regress sergi#89, skip invalid utf8 runes when munging lines.
@schroederc
Copy link
Author

I sent out #141 to revert the implementation to the limited, but not incorrect, rune approach. Some more fundamental changes would probably be preferable if someone has more time to work on it.

@ernestrc
Copy link

Ran into this as well. I don't need to diff large chunks (context: #89 (comment)) so I downgraded to v1.1.0, which seems to be, still to this day, widely used (i.e. go-git).

@fcharlie
Copy link

fcharlie commented Jan 4, 2024

Is there any progress on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants