Reimagined: Correctly handle lengths of serialized strings #43

abdullah-kasim · 2024-10-07T09:59:45Z

Hello, this PR is an alternative to the PR in #39 and should fix #33 .

Regex isn't too great for this purpose, as it has a tendency to over-match. Hence, this PR offers an alternative solution.

This PR uses a mini-tokenizer/parser to detect instances of PHP String, then do search-replace in the constraints of the token, then go back up the stack to recalculate the new byte size.

Special care has been taken care so that we didn't include MySQL's sqldump escaping.

It should technically work with mydumper as well - upon inspecting a mydumper dump, the escaping is very much the same.

There's a few errors with the tests in the PR #39 which I've fixed, as well as added new tests to handle some edge cases with excessive escaping, as well as with very plausible scenario of someone using the string combination of \"; i.e. in WP posts.

If approved, I'll re-point the branch to trunk and merge it from there.

e.g. If you have CSS in an option with `\";`, that can confuse the current way we parse stuff and lead to incorrect replacements.

To help catch various edge cases

Use the length of the serialized content to determine the scope of the search-replace action. This is to avoid problems when serialized content has a closing marker (\";), e.g. in CSS content. This is much slower and needs to be optimized.

…f identifying token details, as well as an unescaper for a more accurate calculation of byte size, as well as handling edge cases

…sts having the wrong number of bytes for serialized data

main_test.go

search-replace.go

abdullah-kasim · 2024-10-07T10:11:47Z

search-replace.go

+	//TODO: We should first check if we found the string when inside a quote or not.
+	// but currently skipping that scenario because it seems unlikely to find it outside.


Rather hard to do, but it's possible that the regex matches inside say, a JSON, and it'd have some sort of multi-layer escaping. Someone might also accidentally use the same combination of string i.e. s:123:", as maybe he's writing a programming tutorial. So we'd accidentally match these, and we should match as little of these as possible, because it'll affect the parser's stability.

Should we just remove the TODO and leave the comment as a note about the edge case?

Sure, that makes sense.

search-replace.go

luismulinari

This is awesome!

Just wondering if it would be worth adding some test cases from: https://github.com/php/php-src/tree/master/ext/standard/tests/serialize

It has a pretty good set of edge cases.

nickchomey · 2024-10-07T12:51:49Z

Nice work! I was thinking about this problem recently and found these two packages for handling php serialized strings in Go.

I asked in an issue in the 2nd repo what the difference was and the dev gave a nice answer.
trim21/go-phpserialize#53

I wonder if this (finding the serialized strings, unserialize, modify, reserialize) would be a more robust way of handling this? (I brought it up in #42 but perhaps it wasn't noticed/relevant)

luismulinari

Also, do you think it would be worth adding some fuzzy tests?

abdullah-kasim · 2024-10-07T22:22:02Z

Yeah I definitely need more tests. Thanks for those suggestions. @luismulinari !

abdullah-kasim · 2024-10-07T22:33:17Z

Nice work! I was thinking about this problem recently and found these two packages for handling php serialized strings in Go.

elliotchance/phpserialize

trim21/go-phpserialize

I asked in an issue in the 2nd repo what the difference was and the dev gave a nice answer. trim21/go-phpserialize#53

I wonder if this (finding the serialized strings, unserialize, modify, reserialize) would be a more robust way of handling this? (I brought it up in #42 but perhaps it wasn't noticed/relevant)

@nickchomey I did notice your post about these libraries, and yeah I agree I should add some comments about it :) . The reason why I didn't utilize these libraries, is that it came down to simplification and speed.

Neither of these two libraries are designed to parse something from a sqldump, so we first need to parse the sqldump so that these serialization library can do its work.

Which means, either we write a SQL parser, or use something existing, like vitess-sqlparser. But we can't simply use a SQL parser, as per this reason.

Why

xwb1989/sqlparser is famous sql parser in Go.
But it cannot parse some query (like offset or bulk insert...) because it customizes vitess's sql parser.

Also, some libraries use from vitess sql parser directly. But vitess's sql parser only partial supports DDL parsing.

We want to perfectly support parsing for SQL and DDL.
Therefore we use vitess sql parser directly and also use TiDB parser for DDL parsing.

And so we're hitting a complexity point here - if we use a pre-existing sql parser, we're at the whims of that parser library. Certain lines in the sqldump might not work. Maybe it doesn't support certain mysql syntax, maybe we have some nasty edge-case in the sqldump, we'd need to check for mydumper compatibility as well. Admittedly I only did some thought experiments but didn't test-drive it for real. And I think it'll be slower than this PR.

And that's how I decided to just write my own parser. Took at most a day for the core code.

nickchomey · 2024-10-07T22:49:31Z

Fair enough! Hopefully this can get merged soon enough!

abdullah-kasim · 2024-10-08T02:29:35Z

This is awesome!

Just wondering if it would be worth adding some test cases from: php/php-src@master/ext/standard/tests/serialize

It has a pretty good set of edge cases.

@luismulinari There's a bit too many test cases in there, and there's a lot that is unrelated to string serialization. Anything specific that you wish to point out?

search-replace.go

abdullah-kasim · 2024-10-08T03:09:55Z

Also, do you think it would be worth adding some fuzzy tests?

@luismulinari

Probably not worth it - rather difficult to fuzz due to escaping and all so the fuzzing needs to have escaping logic itself 😅 , and the function pretty obviously short-circuits to just returning the line if it fails to parse it. Not enough branching for it, I suppose.

At least that's my take on it right now.

…erial replacements

…acing off-by-one errors by improving the data structure

abdullah-kasim · 2024-10-08T10:19:41Z

Benchmarks in case they are a concern:

cpu: Apple M2 Max
BenchmarkFix-12                          1864381               641.4 ns/op
BenchmarkNoReplaceOld-12                149462286                8.210 ns/op
BenchmarkNoReplaceNew-12                 6201714               193.3 ns/op
BenchmarkSimpleReplaceOld-12            100000000               10.71 ns/op
BenchmarkSimpleReplaceNew-12             5087581               238.7 ns/op
BenchmarkSerializedReplaceOld-12        70218040                17.00 ns/op
BenchmarkSerializedReplaceNew-12         4164714               286.2 ns/op

20x slow-down?

Real world: (2.8 GB DB dump)

[ 18:18:09 ] time cat dump.sql | ./go-search-replace-new domain-a.com domain-b.com > test-1.sql

real    0m2.093s
user    0m4.952s
sys     0m3.376s

go-search-replace on  trunk [📝] via  v1.23.2 via 🐘 v8.3.12 took 2s 
[ 18:18:14 ] time cat dump .sql | ./go-search-replace-old domain-a.com domain-b.com > test-1.sql

real    0m1.332s
user    0m1.661s
sys     0m1.818s

Not too bad I suppose...

search-replace_test.go

mjangda

Shared some minor notes otherwise this looks good.

I did some separate performance testing as well (different size files, including some that are 20GB+, and have lots of serialized data). In most cases, this update is slower but not enough to be a concern. In one case (lots of serialized content), this was actually faster :) But I think results will vary depending on size, lines, complexity, etc.

search-replace.go

mjangda · 2024-10-11T05:58:13Z

search-replace.go

+	//TODO: We should first check if we found the string when inside a quote or not.
+	// but currently skipping that scenario because it seems unlikely to find it outside.


Should we just remove the TODO and leave the comment as a note about the edge case?

…hat we wanted on purpose

… conditions to prevent confusion in case of error

mjangda and others added 7 commits February 7, 2024 21:43

Add new tests to catch issues with serialized markers

622e13c

e.g. If you have CSS in an option with `\";`, that can confuse the current way we parse stuff and lead to incorrect replacements.

Add additional tests for line fix function

8d40ea3

To help catch various edge cases

Properly handle serialized strings

6c08440

Use the length of the serialized content to determine the scope of the search-replace action. This is to avoid problems when serialized content has a closing marker (\";), e.g. in CSS content. This is much slower and needs to be optimized.

Make it faster to run when there are no changes

075f58f

Merge branch 'trunk' into update/serialized-lengths

331f65d

Rewrote the serialization fix to employ a mini-parser that consists o…

53b2087

…f identifying token details, as well as an unescaper for a more accurate calculation of byte size, as well as handling edge cases

Fix wrong last-index offset for complex serialized object, and fix te…

a960aaa

…sts having the wrong number of bytes for serialized data

abdullah-kasim requested review from a team and mjangda October 7, 2024 09:59

abdullah-kasim added the [Status] Needs Review label Oct 7, 2024

abdullah-kasim changed the base branch from trunk to update/serialized-lengths October 7, 2024 10:00

abdullah-kasim commented Oct 7, 2024

View reviewed changes

search-replace.go Show resolved Hide resolved

luismulinari reviewed Oct 7, 2024

View reviewed changes

abdullah-kasim commented Oct 8, 2024

View reviewed changes

search-replace.go Show resolved Hide resolved

Clarify on how unescapeContent works based on the latest understanding

6c3f03c

abdullah-kasim added 6 commits October 8, 2024 11:11

Reword how getUnescapedBytesIfEscaped works

77695d9

Use a stricter regex for detecting a serialized value

17a5a0a

Fix case where the non-serial replacement is applying itself to the s…

9a52030

…erial replacements

Fix case where we have both serial and non-serial replacement

1652adf

Massively simplify the solution to decrease reliance on indexes and f…

dff5ac8

…acing off-by-one errors by improving the data structure

Remove unused structs

03c2e62

abdullah-kasim changed the base branch from update/serialized-lengths to trunk October 8, 2024 08:50

luismulinari reviewed Oct 8, 2024

View reviewed changes

search-replace_test.go Show resolved Hide resolved

abdullah-kasim and others added 3 commits October 9, 2024 11:00

Add tests for escaped delimiters

3e90a3b

Switch TestMultiReplace to run for the new, replacement function instead

eeeaa7a

Merge branch 'trunk' into update/seralized-lengths-v2

cbb2f61

mjangda approved these changes Oct 11, 2024

View reviewed changes

mjangda mentioned this pull request Oct 11, 2024

Correctly handle lengths of serialized strings #39

Closed

4 tasks

abdullah-kasim added 4 commits October 14, 2024 10:17

Remove TODO and make the not-checking-for-quotes behavior something t…

c8f0432

…hat we wanted on purpose

Removed another TODO and more comments on why we're shifting by 3

a0f60b5

Better error messages, and separate out error messages from different…

1e23e66

… conditions to prevent confusion in case of error

Remove Debugf

399f397

abdullah-kasim merged commit afaa434 into trunk Oct 14, 2024
1 check passed

abdullah-kasim deleted the update/seralized-lengths-v2 branch October 14, 2024 02:31

abdullah-kasim mentioned this pull request Oct 16, 2024

Use golang phpserialize package? #42

Closed

		//TODO: We should first check if we found the string when inside a quote or not.
		// but currently skipping that scenario because it seems unlikely to find it outside.

Reimagined: Correctly handle lengths of serialized strings #43

Reimagined: Correctly handle lengths of serialized strings #43

Uh oh!

Conversation

abdullah-kasim commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

abdullah-kasim Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

mjangda Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

abdullah-kasim Oct 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

luismulinari left a comment

Choose a reason for hiding this comment

Uh oh!

nickchomey commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luismulinari left a comment

Choose a reason for hiding this comment

Uh oh!

abdullah-kasim commented Oct 7, 2024

Uh oh!

abdullah-kasim commented Oct 7, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

Uh oh!

nickchomey commented Oct 7, 2024

Uh oh!

abdullah-kasim commented Oct 8, 2024

Uh oh!

Uh oh!

abdullah-kasim commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

abdullah-kasim commented Oct 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mjangda left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

mjangda Oct 11, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

abdullah-kasim commented Oct 7, 2024 •

edited

Loading

abdullah-kasim Oct 14, 2024 •

edited

Loading

nickchomey commented Oct 7, 2024 •

edited

Loading

abdullah-kasim commented Oct 7, 2024 •

edited

Loading

abdullah-kasim commented Oct 8, 2024 •

edited

Loading

abdullah-kasim commented Oct 8, 2024 •

edited

Loading