Fix the regressions introduced in the fix for #89 #120

ShoshinNikita · 2021-04-07T15:20:30Z

This PR partially reverts commits db1b095 - 0a651d5 and fixes the issue #89 with a different approach. The current solution doesn't work well with (*DiffMatchPatch).diffMainRunes method because array indexes in the index string occupy multiple runes.

This fix is based on the previous approach. But elements of the lineArray with indexes from 0xD800 (55296) to 0xDFFF (57343) are skipped because runes in this range are invalid. It requires additional 32KB of memory ([2048]string{}) but allows us to safely encode line indexes in a string.

It doesn't completely fix #89 but increases the panic limit to ~1114111 lines (0x10FFFF is the maximum valid unicode code point). The complete fix will require a lot of changes. At the same time the current approach has a bug. So, I believe it's better to use this fix.

Fixes #115

ShoshinNikita · 2021-04-07T15:22:51Z

There's another possible fix: https://github.com/ShoshinNikita/go-diff/commit/c8591cf97f43b198b269258a0c3f3b1fc07990df. It is also based on the previous approach but uses map[rune]string instead of []string to store lines. It doesn't require additional 32KB of memory but breaks the backward compatibility for methods DiffLinesToChars, DiffLinesToRunes and DiffCharsToLines.

jlao · 2021-04-25T00:12:41Z

I'm also hitting this regression. Because the lineHash map is no longer shared, the diff assigns different chars/runes to lines that are the same.

findleyr · 2021-04-26T20:30:29Z

Hi! We (the gopls team) spoke to @sergi and will help review and test this PR. Specifically, I'll do my best to review the change, and we'll test the fix in gopls. Will probably need a couple days.

sergi · 2021-04-26T20:45:18Z

Thanks for helping @findleyr and team!!

findleyr

Thanks again for this!

This is my first time reading this source, so some of my questions / observations were general in nature. Will follow-up with another pass once you've responded.

findleyr · 2021-05-03T19:38:49Z

diffmatchpatch/diff.go

+// checkLineArray checks the size of the slice and ensures that the index of the next element
+// will be the valid rune.
+func checkLineArray(a *[]string) {
+	// Runes in this range are invalid, utf8.ValidRune() returns false.


This is very subtle. Please explain further in this comment why this is a problem.

This comment describes why runes must be valid: #89 (comment)

Understood. I think it's worth explaining here (you can reference #89).

findleyr · 2021-05-03T19:43:14Z

diffmatchpatch/diff.go

+	return dmp.DiffLinesToRunes(string(text1), string(text2))
+}
+
+// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line.


The words 'array', 'list', and 'slice' are used interchangeably throughout, but array is really incorrect in this context.

findleyr · 2021-05-03T19:50:28Z

diffmatchpatch/diff.go

+
+// checkLineArray checks the size of the slice and ensures that the index of the next element
+// will be the valid rune.
+func checkLineArray(a *[]string) {


Why not have functions

// lineRune returns a valid rune value for the line at the given index func lineRune(idx int) rune // runeLine looks up the given rune hash in the lines slice. func runeLine(idxRune rune, lines []string) string

Thereby avoiding the padding here. It could also be used to avoid the prepended "".

I think it can break the user code because DiffLinesToRunes returns the lineArray. If a user has something like DiffCharsToLines, it can panic with index out of range.

I meant that DiffLinesToRunes could use lineRune to convert to rune values, and DiffCharsToLines can use runeLine to convert to line indexes. These functions would be inverses of eachother.

findleyr · 2021-05-03T19:53:46Z

diffmatchpatch/diff.go

-}
-
-// diffLinesToStringsMunge splits a text into an array of strings, and reduces the texts to a []string.
-func (dmp *DiffMatchPatch) diffLinesToStringsMunge(text string, lineArray *[]string) []uint32 {


Please reorganize to minimize the diff with this refactored method.

findleyr · 2021-05-03T19:55:35Z

diffmatchpatch/diff.go

+}
+
+func (dmp *DiffMatchPatch) diffLinesToRunes(text1, text2 []rune) ([]rune, []rune, []string) {
+	return dmp.DiffLinesToRunes(string(text1), string(text2))


I'm a bit concerned about all this unnecessary allocation converting to string, when it really shouldn't be necessary.

Did you compare benchmark performance?

findleyr · 2021-05-03T20:04:52Z

diffmatchpatch/diff.go

+
+// diffLinesToRunesMunge splits a text into an array of strings, and reduces the texts to a []rune where each Unicode character represents one line.
+// We use strings instead of []runes as input mainly because you can't use []rune as a map key.
+func (dmp *DiffMatchPatch) diffLinesToRunesMunge(text string, lineArray *[]string, lineHash map[string]int) []rune {


I'm staring at this, and I don't see how the previous algorithm worked without passing in lineHash. Is that the root cause of the bug?

Can you explain somewhere the exact nature of the bug in the fix for #89?

No, the root cause of the bug is described in the PR description:

The current solution doesn't work well with (*DiffMatchPatch).diffMainRunes method because array indexes in the index string occupy multiple runes.

...so array indexes could be corrupted/split in the diff output, right?

It's worth explaining in a comment.

findleyr · 2021-05-03T20:06:57Z

diffmatchpatch/diff.go

@@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int,
 // DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.


s/educes/reduces

findleyr · 2021-05-03T20:07:31Z

diffmatchpatch/diff.go

@@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int,
 // DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.
 // It's slightly faster to call DiffLinesToRunes first, followed by DiffMainRunes.


Is this comment still accurate?

findleyr · 2021-05-03T20:12:00Z

diffmatchpatch/diff.go

+			lineEnd = len(text) - 1
+		}
+
+		line := text[lineStart : lineEnd+1]


Yeah, I'd be interested to know how performance would be affected by using an actual hash to look up the rune value in the index, rather than a string. Then we could convert to []byte at API boundaries and avoid allocating again.

But that doesn't need to be done for this PR.

findleyr · 2021-05-03T20:16:01Z

@ShoshinNikita

The complete fix will require a lot of changes.

By this you mean modifying the diff algorithm to operate on []int, rather than string, right?

ShoshinNikita · 2021-05-04T13:49:50Z

@findleyr

I am not very familiar with the codebase. I just could determine the cause of the regressions and fix #89 with another approach based on code in v1.1.0 (the second commit in this PR partially reverts the previous fix). So, the real diff is v1.1.0...d20955a, lines 431-465 (or just the last commit).

I think using []int instead of string can fix the issue completely. But it will break the backward compatibility. However, as I said before, I am not familiar with the code. So, I could be wrong.

iambus · 2021-12-13T09:45:33Z

I would suggest you guys look at my PR #128. It fixes a serious panic issue. It may also fix the issue you were disucssing (I'm not sure).

kdarkhan · 2023-08-03T17:25:20Z

This PR can probably be closed now since my PR which fixes the same thing was merged yesterday #136.

ShoshinNikita added 3 commits April 7, 2021 15:26

add test case for the issue #115

e8aac4f

partially revert commits db1b095-0a651d

ebf6a5c

extend lineArray if needed to ensure that line indexes are valid runes

d20955a

findleyr reviewed May 3, 2021

View reviewed changes

stamblerre mentioned this pull request Jun 24, 2021

x/tools/gopls: address formatting issues with [email protected] golang/go#45732

Closed

kdarkhan mentioned this pull request Feb 20, 2023

Fix line diff by using rune index without a separator #136

Merged

ffluk3 mentioned this pull request May 15, 2023

Extinctions scan fails for large file diffs launchdarkly/ld-find-code-refs#351

Closed

AriehSchneier mentioned this pull request Jul 14, 2023

[BUG] v1.2.0 seems to produce incorrect diff #123

Open

ffluk3 mentioned this pull request Aug 24, 2023

fix: update go-diff package launchdarkly/ld-find-code-refs#386

Closed

ShoshinNikita closed this Nov 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix the regressions introduced in the fix for #89 #120

Fix the regressions introduced in the fix for #89 #120

ShoshinNikita commented Apr 7, 2021 •

edited

Loading

ShoshinNikita commented Apr 7, 2021

jlao commented Apr 25, 2021

findleyr commented Apr 26, 2021

sergi commented Apr 26, 2021

findleyr left a comment

findleyr May 3, 2021

ShoshinNikita May 4, 2021

findleyr May 4, 2021

findleyr May 3, 2021

findleyr May 3, 2021

ShoshinNikita May 4, 2021

findleyr May 4, 2021

findleyr May 3, 2021

findleyr May 3, 2021

findleyr May 3, 2021

ShoshinNikita May 4, 2021

findleyr May 4, 2021

findleyr May 3, 2021

findleyr May 3, 2021

findleyr May 3, 2021

findleyr commented May 3, 2021

ShoshinNikita commented May 4, 2021

iambus commented Dec 13, 2021

kdarkhan commented Aug 3, 2023

		@@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int,
		// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.

		@@ -392,28 +390,88 @@ func (dmp *DiffMatchPatch) diffBisectSplit(runes1, runes2 []rune, x, y int,
		// DiffLinesToChars splits two texts into a list of strings, and educes the texts to a string of hashes where each Unicode character represents one line.
		// It's slightly faster to call DiffLinesToRunes first, followed by DiffMainRunes.

Fix the regressions introduced in the fix for #89 #120

Fix the regressions introduced in the fix for #89 #120

Conversation

ShoshinNikita commented Apr 7, 2021 • edited Loading

ShoshinNikita commented Apr 7, 2021

jlao commented Apr 25, 2021

findleyr commented Apr 26, 2021

sergi commented Apr 26, 2021

findleyr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findleyr commented May 3, 2021

ShoshinNikita commented May 4, 2021

iambus commented Dec 13, 2021

kdarkhan commented Aug 3, 2023

ShoshinNikita commented Apr 7, 2021 •

edited

Loading