Skip to content

Conversation

@abdullah-kasim
Copy link
Contributor

@abdullah-kasim abdullah-kasim commented Oct 7, 2024

Hello, this PR is an alternative to the PR in #39 and should fix #33 .

Regex isn't too great for this purpose, as it has a tendency to over-match. Hence, this PR offers an alternative solution.

This PR uses a mini-tokenizer/parser to detect instances of PHP String, then do search-replace in the constraints of the token, then go back up the stack to recalculate the new byte size.

Special care has been taken care so that we didn't include MySQL's sqldump escaping.

It should technically work with mydumper as well - upon inspecting a mydumper dump, the escaping is very much the same.

There's a few errors with the tests in the PR #39 which I've fixed, as well as added new tests to handle some edge cases with excessive escaping, as well as with very plausible scenario of someone using the string combination of \"; i.e. in WP posts.

If approved, I'll re-point the branch to trunk and merge it from there.

mjangda and others added 7 commits February 7, 2024 21:43
e.g. If you have CSS in an option with `\";`, that can confuse the current way we parse stuff and lead to incorrect replacements.
To help catch various edge cases
Use the length of the serialized content to determine the scope of the search-replace action. This is to avoid problems when serialized content has a closing marker (\";), e.g. in CSS content.

This is much slower and needs to be optimized.
…f identifying token details, as well as an unescaper for a more accurate calculation of byte size, as well as handling edge cases
…sts having the wrong number of bytes for serialized data
@abdullah-kasim abdullah-kasim requested review from a team and mjangda October 7, 2024 09:59
@abdullah-kasim abdullah-kasim changed the base branch from trunk to update/serialized-lengths October 7, 2024 10:00
Comment on lines 231 to 232
//TODO: We should first check if we found the string when inside a quote or not.
// but currently skipping that scenario because it seems unlikely to find it outside.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather hard to do, but it's possible that the regex matches inside say, a JSON, and it'd have some sort of multi-layer escaping. Someone might also accidentally use the same combination of string i.e. s:123:", as maybe he's writing a programming tutorial. So we'd accidentally match these, and we should match as little of these as possible, because it'll affect the parser's stability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just remove the TODO and leave the comment as a note about the edge case?

Copy link
Contributor Author

@abdullah-kasim abdullah-kasim Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, that makes sense.

Copy link
Contributor

@luismulinari luismulinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome!

Just wondering if it would be worth adding some test cases from: https://github.com/php/php-src/tree/master/ext/standard/tests/serialize

It has a pretty good set of edge cases.

@nickchomey
Copy link

nickchomey commented Oct 7, 2024

Nice work! I was thinking about this problem recently and found these two packages for handling php serialized strings in Go.

I asked in an issue in the 2nd repo what the difference was and the dev gave a nice answer.
trim21/go-phpserialize#53

I wonder if this (finding the serialized strings, unserialize, modify, reserialize) would be a more robust way of handling this? (I brought it up in #42 but perhaps it wasn't noticed/relevant)

Copy link
Contributor

@luismulinari luismulinari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do you think it would be worth adding some fuzzy tests?

@abdullah-kasim
Copy link
Contributor Author

Yeah I definitely need more tests. Thanks for those suggestions. @luismulinari !

@abdullah-kasim
Copy link
Contributor Author

abdullah-kasim commented Oct 7, 2024

Nice work! I was thinking about this problem recently and found these two packages for handling php serialized strings in Go.

I asked in an issue in the 2nd repo what the difference was and the dev gave a nice answer. trim21/go-phpserialize#53

I wonder if this (finding the serialized strings, unserialize, modify, reserialize) would be a more robust way of handling this? (I brought it up in #42 but perhaps it wasn't noticed/relevant)

@nickchomey I did notice your post about these libraries, and yeah I agree I should add some comments about it :) . The reason why I didn't utilize these libraries, is that it came down to simplification and speed.

Neither of these two libraries are designed to parse something from a sqldump, so we first need to parse the sqldump so that these serialization library can do its work.

Which means, either we write a SQL parser, or use something existing, like vitess-sqlparser. But we can't simply use a SQL parser, as per this reason.

Why

xwb1989/sqlparser is famous sql parser in Go.
But it cannot parse some query (like offset or bulk insert...) because it customizes vitess's sql parser.

Also, some libraries use from vitess sql parser directly. But vitess's sql parser only partial supports DDL parsing.

We want to perfectly support parsing for SQL and DDL.
Therefore we use vitess sql parser directly and also use TiDB parser for DDL parsing.

And so we're hitting a complexity point here - if we use a pre-existing sql parser, we're at the whims of that parser library. Certain lines in the sqldump might not work. Maybe it doesn't support certain mysql syntax, maybe we have some nasty edge-case in the sqldump, we'd need to check for mydumper compatibility as well. Admittedly I only did some thought experiments but didn't test-drive it for real. And I think it'll be slower than this PR.

And that's how I decided to just write my own parser. Took at most a day for the core code.

@nickchomey
Copy link

Fair enough! Hopefully this can get merged soon enough!

@abdullah-kasim
Copy link
Contributor Author

This is awesome!

Just wondering if it would be worth adding some test cases from: php/php-src@master/ext/standard/tests/serialize

It has a pretty good set of edge cases.

@luismulinari There's a bit too many test cases in there, and there's a lot that is unrelated to string serialization. Anything specific that you wish to point out?

@abdullah-kasim
Copy link
Contributor Author

abdullah-kasim commented Oct 8, 2024

Also, do you think it would be worth adding some fuzzy tests?

@luismulinari

Probably not worth it - rather difficult to fuzz due to escaping and all so the fuzzing needs to have escaping logic itself 😅 , and the function pretty obviously short-circuits to just returning the line if it fails to parse it. Not enough branching for it, I suppose.

At least that's my take on it right now.

@abdullah-kasim abdullah-kasim changed the base branch from update/serialized-lengths to trunk October 8, 2024 08:50
@abdullah-kasim
Copy link
Contributor Author

abdullah-kasim commented Oct 8, 2024

Benchmarks in case they are a concern:

cpu: Apple M2 Max
BenchmarkFix-12                          1864381               641.4 ns/op
BenchmarkNoReplaceOld-12                149462286                8.210 ns/op
BenchmarkNoReplaceNew-12                 6201714               193.3 ns/op
BenchmarkSimpleReplaceOld-12            100000000               10.71 ns/op
BenchmarkSimpleReplaceNew-12             5087581               238.7 ns/op
BenchmarkSerializedReplaceOld-12        70218040                17.00 ns/op
BenchmarkSerializedReplaceNew-12         4164714               286.2 ns/op

20x slow-down?

Real world: (2.8 GB DB dump)

[ 18:18:09 ] time cat dump.sql | ./go-search-replace-new domain-a.com domain-b.com > test-1.sql

real    0m2.093s
user    0m4.952s
sys     0m3.376s

go-search-replace on  trunk [📝] via  v1.23.2 via 🐘 v8.3.12 took 2s 
[ 18:18:14 ] time cat dump .sql | ./go-search-replace-old domain-a.com domain-b.com > test-1.sql

real    0m1.332s
user    0m1.661s
sys     0m1.818s

Not too bad I suppose...

Copy link
Member

@mjangda mjangda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared some minor notes otherwise this looks good.

I did some separate performance testing as well (different size files, including some that are 20GB+, and have lots of serialized data). In most cases, this update is slower but not enough to be a concern. In one case (lots of serialized content), this was actually faster :) But I think results will vary depending on size, lines, complexity, etc.

Comment on lines 231 to 232
//TODO: We should first check if we found the string when inside a quote or not.
// but currently skipping that scenario because it seems unlikely to find it outside.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just remove the TODO and leave the comment as a note about the edge case?

@abdullah-kasim abdullah-kasim merged commit afaa434 into trunk Oct 14, 2024
1 check passed
@abdullah-kasim abdullah-kasim deleted the update/seralized-lengths-v2 branch October 14, 2024 02:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Incorrect length calculation in serialization of string data

5 participants