-
Notifications
You must be signed in to change notification settings - Fork 19
Reimagined: Correctly handle lengths of serialized strings #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
e.g. If you have CSS in an option with `\";`, that can confuse the current way we parse stuff and lead to incorrect replacements.
To help catch various edge cases
Use the length of the serialized content to determine the scope of the search-replace action. This is to avoid problems when serialized content has a closing marker (\";), e.g. in CSS content. This is much slower and needs to be optimized.
…f identifying token details, as well as an unescaper for a more accurate calculation of byte size, as well as handling edge cases
…sts having the wrong number of bytes for serialized data
search-replace.go
Outdated
| //TODO: We should first check if we found the string when inside a quote or not. | ||
| // but currently skipping that scenario because it seems unlikely to find it outside. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather hard to do, but it's possible that the regex matches inside say, a JSON, and it'd have some sort of multi-layer escaping. Someone might also accidentally use the same combination of string i.e. s:123:", as maybe he's writing a programming tutorial. So we'd accidentally match these, and we should match as little of these as possible, because it'll affect the parser's stability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just remove the TODO and leave the comment as a note about the edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, that makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome!
Just wondering if it would be worth adding some test cases from: https://github.com/php/php-src/tree/master/ext/standard/tests/serialize
It has a pretty good set of edge cases.
|
Nice work! I was thinking about this problem recently and found these two packages for handling php serialized strings in Go. I asked in an issue in the 2nd repo what the difference was and the dev gave a nice answer. I wonder if this (finding the serialized strings, unserialize, modify, reserialize) would be a more robust way of handling this? (I brought it up in #42 but perhaps it wasn't noticed/relevant) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, do you think it would be worth adding some fuzzy tests?
|
Yeah I definitely need more tests. Thanks for those suggestions. @luismulinari ! |
@nickchomey I did notice your post about these libraries, and yeah I agree I should add some comments about it :) . The reason why I didn't utilize these libraries, is that it came down to simplification and speed. Neither of these two libraries are designed to parse something from a sqldump, so we first need to parse the sqldump so that these serialization library can do its work. Which means, either we write a SQL parser, or use something existing, like
And so we're hitting a complexity point here - if we use a pre-existing sql parser, we're at the whims of that parser library. Certain lines in the sqldump might not work. Maybe it doesn't support certain mysql syntax, maybe we have some nasty edge-case in the sqldump, we'd need to check for mydumper compatibility as well. Admittedly I only did some thought experiments but didn't test-drive it for real. And I think it'll be slower than this PR. And that's how I decided to just write my own parser. Took at most a day for the core code. |
|
Fair enough! Hopefully this can get merged soon enough! |
@luismulinari There's a bit too many test cases in there, and there's a lot that is unrelated to string serialization. Anything specific that you wish to point out? |
Probably not worth it - rather difficult to fuzz due to escaping and all so the fuzzing needs to have escaping logic itself 😅 , and the function pretty obviously short-circuits to just returning the line if it fails to parse it. Not enough branching for it, I suppose. At least that's my take on it right now. |
…erial replacements
…acing off-by-one errors by improving the data structure
|
Benchmarks in case they are a concern: 20x slow-down? Real world: (2.8 GB DB dump) Not too bad I suppose... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shared some minor notes otherwise this looks good.
I did some separate performance testing as well (different size files, including some that are 20GB+, and have lots of serialized data). In most cases, this update is slower but not enough to be a concern. In one case (lots of serialized content), this was actually faster :) But I think results will vary depending on size, lines, complexity, etc.
search-replace.go
Outdated
| //TODO: We should first check if we found the string when inside a quote or not. | ||
| // but currently skipping that scenario because it seems unlikely to find it outside. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we just remove the TODO and leave the comment as a note about the edge case?
…hat we wanted on purpose
… conditions to prevent confusion in case of error
Hello, this PR is an alternative to the PR in #39 and should fix #33 .
Regex isn't too great for this purpose, as it has a tendency to over-match. Hence, this PR offers an alternative solution.
This PR uses a mini-tokenizer/parser to detect instances of PHP String, then do search-replace in the constraints of the token, then go back up the stack to recalculate the new byte size.
Special care has been taken care so that we didn't include MySQL's sqldump escaping.
It should technically work with mydumper as well - upon inspecting a mydumper dump, the escaping is very much the same.
There's a few errors with the tests in the PR #39 which I've fixed, as well as added new tests to handle some edge cases with excessive escaping, as well as with very plausible scenario of someone using the string combination of
\";i.e. in WP posts.If approved, I'll re-point the branch to trunk and merge it from there.