-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diff.Text OOMs with large different inputs #26
Comments
I added some quick debug printing, and it looks like you're just butting heads with the double loop in I think the right solution is to drop the quadratic behavior. This is already mentioned in the godoc:
Russ also hints at that here: golang/go#45200 (comment) I'm not sure if @josharian plans to work on this soon (I imagine not) or how soon Russ's will be ready, since he said "next month". Until either of those, I'd suggest to avoid diffing if the input is large (e.g. beyond 10k lines or 1MiB) or to add a timeout via a context. I'd probably go with the former, because one could imagine a very fast CPU with not a lot of spare memory still OOMing with a few-second timeout. |
Then we need to fix |
Just wondering if anyone is aware of any progress on the issues/topics related to this one? |
Not that I'm aware of. If you're interested in implementing the linear time diff, I'm happy to point you in the right direction. |
Sadly I don't see myself having the time. But will tweet out in case someone else is interested. A fun little project. |
Hi there 👋. FYI, Russ has a new diff implementation in CL 384255. It is an implementation of patience diff. It runs in O(n log n) time, and I think his version might be O(n + m) space. I copied it here, and added a trivial cmd to be able run it on files: https://github.com/thepudds/patience-diff For the example from this issue, it runs in 0.07 sec and uses 38 MB RSS, vs. pkg/diff runs in 19 sec and 18 GB RSS (roughly ~250x slower and ~500x more RAM). patience-diff:
pkg/diff repro:
The patience-diff version is not a minimal diff:
...but at least in theory, it is often friendlier to humans. A better location for this might be rogpeppe/go-internal (which is an accurate name for where this came from, and would eliminate an extra dependency from rogpeppe/go-internal, and the more reasons there are to go to rogpeppe/go-internal then the more testscripts probably spreads, which is a good thing...). I'd be interested in sending a PR there if there is interest. Alternatively, pkg/diff could be updated to use this (perhaps as a parallel package to the meyers package), but I think that might be more work? And of course, people could use the copy at thepudds/patience-diff, but that might mean random gophers having to wonder whether or not they can trust this "thepudds" character 😅. Finally, I should mention that Russ' CL hasn't been merged yet. |
All that said, there is also golang/go#45200 -- presumably that will be a time and space efficient algorithm. If that ends up happening soon, then perhaps not much benefit having this elsewhere. |
I'm mostly AFK for the next week. But at a high level, I'd be absolutely delighted to add a patience diff implementation here, as long as Russ is OK with it. I've been meaning to write one myself, but it never got to the top of the stack. Ideally it would be just like the myers diff one, that is, in its own package, with a similar API, etc. And then update the default to be patience instead of myers.
Those in the know know that they can. :) |
Hi @josharian, I’ll take a closer look at the API that the myers package presents here, and see if I can send a PR. No guarantees though. And agreed it makes sense to confirm with Russ before getting too far. |
The main reason to prefer a copy of Go's internal/diff over pkg/diff is that internal/diff is much more efficient in both time and memory usage. In particular, pkg/diff required quadratic space in memory, which could easily cause "out of memory" errors in Go tests per pkg/diff#26. Beyond making the `cmp` command better able to handle large files, this also moves us back to having zero external dependencies, which is always a nice to have. The long_diff test still appears to work well; the output is changed since the new package produces a shorter, but still entirely correct, diff. It also seems like the new package includes a leading "diff" line to show the two filenames. That seems like a harmless change.
The main reason to prefer a copy of Go's internal/diff over pkg/diff is that internal/diff is much more efficient in both time and memory usage. In particular, pkg/diff required quadratic space in memory, which could easily cause "out of memory" errors in Go tests per pkg/diff#26. Beyond making the `cmp` command better able to handle large files, this also moves us back to having zero external dependencies, which is always a nice to have. The long_diff test still appears to work well; the output is changed since the new package produces a shorter, but still entirely correct, diff. It also seems like the new package includes a leading "diff" line to show the two filenames. That seems like a harmless change.
For those following along, https://pkg.go.dev/github.com/rogpeppe/go-internal/diff@master exists now :) |
To reproduce:
Above,
repro.zip
is a reference to: repro.zipThe text was updated successfully, but these errors were encountered: