💸 SQL parsing is expensive...and you know it! #2065

breedx-splk · 2021-01-16T01:03:04Z

breedx-splk
Jan 16, 2021
Collaborator

Hi.

Rather than just open an issue, I thought I would open a discussion because I know there has been a fair amount of talk and history around this subject. The gist is that sql string parsing is expensive, and the default jdbc instrumentation parses every single sql statement. This parsing introduces noteworthy application overhead, and for simple database-heavy apps (like spring-petclinic-rest) it can be significant (I've measured 350% increase in response times).

It's even worse than that. The default implementation both normalizes AND parses the operation and table name. These are two distinct passes through a complex language string and it leverages two different parser implementations. Normalization can be disabled of course, but the result is that potentially sensitive data could be leaked into an attribute and/or the statement table/operation extractor accuracy gets weakened. I looked and couldn't find anything in the spec that suggests sql normalization/obfuscation.

So why do we parse? Surprisingly, it doesn't look like we populate the db.sql.table nor db.operation attributes (maybe that's a separate bug?). The table name and operation name, though, are parsed so that they can be inserted into a descriptive, low cardinality span name.

The database semantic conventions have some interesting things to say about this. There are more than 3 places where the spec discourages client-side parsing of sql in order to determine the attributes or span name components.

[span name] It is not recommended to attempt any client-side parsing of db.statement just to get these properties, they should only be used if the library being instrumented already provides them.

[4]: When setting this to an SQL keyword, it is not recommended to attempt any client-side parsing of db.statement just to get this property, but it should be set if the operation name is provided by the library being instrumented. If the SQL statement has an ambiguous operation, or performs more than one operation, this value may be omitted.

[2]: It is not recommended to attempt any client-side parsing of db.statement just to get this property, but it should be set if it is provided by the library being instrumented.

Yet we still do it, and it's still slow. The span names are fantastic, but at what cost?

There's also this hint that the simplest/fastest thing should be done client side, and that more complex parsing could be undertaken by a smart backend:

Back ends could, for example, use the provided identifier to determine the appropriate SQL dialect for parsing the db.statement.

My recommendation now is to disable the default behavior of always parsing/extracting database/operation info here and ALSO to default the otel.instrumentation.jdbc.query.normalizer.enabled setting to false. My estimation is that most users will prefer a more lightweight implementation, and I like the parsing, so I think we should keep it around but guarded by a config setting, such as otel.instrumentation.jdbc.query.extractor.enabled (doesn't exist yet). Users that need both the security/utility of the existing normalizer and extractor will set both of these to true and will slow down the instrumentation considerably while trading it for better span names. Default users will get more vague span names and unadulterated span names, but at the benefit of performance.

What do you think?

anuraaga · 2021-01-16T02:08:55Z

anuraaga
Jan 16, 2021

Making expensive normalization opt-in makes sense to me. I would probably still do a tiny "parse" for the table name for a span name like "SELECT table_name". statement.substring(0, statement.indexOf(' ')) I suspect will not hurt much and is all we need.

1 reply

anuraaga Jan 16, 2021

FWIW, it goes with my when in doubt, just copy brave model :)

https://github.com/openzipkin/brave/blob/master/instrumentation/mysql8/src/main/java/brave/mysql8/TracingQueryInterceptor.java#L57

mateuszrzeszutek · 2021-01-18T15:54:55Z

mateuszrzeszutek
Jan 18, 2021

The statement info extractor should have been guarded with a config property in the first place, it's an oversight that we should definitely correct.
I'd like to propose the following configuration changes (inspired by InstrumentationModule enabled property handling):

Add a new property otel.instrumentation.statement.info-extractor.default-enabled, default value = false
Each instrumentation that uses the statement info extractor should have its own property otel.instrumentation.<instrumentationName>.statement.info-extractor.enabled, default value same as property above
Add a new property otel.instrumentation.statement.sanitizer.default-enabled (I think that we should consider renaming "normalizer" to "sanitizer" everywhere, since the main purpose is data sanitization); default value = false (? I'd probably be in favor of default value = true because of the possible PCI/PII data that could be present in the query)
Each instrumentation that uses the statement sanitizer should have its own property otel.instrumentation.<instrumentationName>.statement.sanitizer.enabled, default value same as property above

We could also consider using db-statement instead of plain statement to be a bit closer to the spec with the naming.

Anyway, we would have two general properties that enable/disable query parsing everywhere, and each instrumentation that does it would have a separate configuration switch in case somebody wants to disable/enable it just in one place for some reason. WDYT?

2 replies

breedx-splk Jan 20, 2021
Collaborator Author

I like the idea of changing normalizer -> sanitizer -- this is definitely a point of potential for confusion! 🆗

I like the idea of adding config to help with this, but I have a couple of thoughts:

I don't really like default-enabled...why not just enabled?
Do we really think it's important to have such fine-grained control? It seems like allowing individual instrumentations to have their own configs for overriding the broader config just adds complexity with little benefit.

Can we just keep it as simple as two settings:

otel.instrumentation.statement.info-extractor.enabled=true (because it's not enabled by default)
otel.instrumentation.sanitizer.enabled=true (because its not enabled by default)

I guess it's a little 🌶️ to consider disabling sanitization by default, but if users are hell bent on dynamically generated sql statements that contain sensitive data they can reenable this (and the corresponding performance penalty).

trask Jan 26, 2021
Maintainer

I don't really like default-enabled...why not just enabled?

this is a good point. I could see even renaming otel.instrumentation.default-enabled to otel.instrumentation.enabled.

just came here to add a note that we will want to check info-extractor setting in cassandra also (see #1314)

iNikem · 2021-01-19T09:12:56Z

iNikem
Jan 19, 2021

@johnbley any thoughts on that?

0 replies

johnbley · 2021-01-19T14:27:52Z

johnbley
Jan 19, 2021

I will definitely vote that we should disable pulling out table name / sql verb by default - and that such parsing (in the sense of having knowledge of grammar) belongs outside the instrumentation agent. I don't want to give up on the tokenization that the SqlNormalizer is doing however, for obvious data sensitivity reasons. It seems that my initial (somewhat hasty) choice of javacup is probably not the best since it doesn't offer a low-thrash way to do things. Consider the basic algorithm here in terms of allocations:

allocate a small state-tracking object for tokenization and accept the input string
tokenize, and while tokenizing copy/transform tokens to an output string which is size-limited
Step 2 there should basically be stack-local fields (e.g., token start/end positions in the original string) and select copying from one char[] to another. However, it appears that javacup allocates a new String object for every token and I don't see a clean way to fix that. I would recommend the following steps:
Experiment with other lexer libraries and see if one offers better performance
If not, consider hand-rolling a lexer or tweaking the generated output from javacup to not allocate Strings (and check in the result plus a very health set of comments about how/why)
Possibly upstream a non-allocating patch to javacup

It is absolutely possible to tokenize sql in an instrumentation agent at reasonable overheads; it just requires engineering effort.

Separately, I agree there should be a similar tokening normalizer in the otel collector, though given the breadth of data it deals with a design for such might involve a more general grammar, regex options, etc. etc.

1 reply

anuraaga Jan 20, 2021

I will definitely vote that we should disable pulling out table name / sql verb by default - and that such parsing (in the sense of having knowledge of grammar) belongs outside the instrumentation agent.

I hope we can at least use the simple indexOf + substring for a simple name - extremely cheap but with reasonable bang for buck I think.

mateuszrzeszutek · 2021-01-21T14:14:21Z

mateuszrzeszutek
Jan 21, 2021

I've run some very simple tests against different versions of the query normalizer/sanitizer - here's the repo with all the code.

I've tested 4 different scenarios:

Reading a huge input file with SQLs and printing them out to another file, times 100 (base-scenario);
Same as 1. but the query is sanitized with our current implementation before it gets printed out (javacc-original);
Same as 2. but the lexer implementation was modified to avoid unnecessary String creation (javacc-modified);
Same as 1. but using a lexer written from scratch using JFlex (jflex).

All of them run on a JVM that had 64m heap - I wanted to see how frequent the GC pauses would be.

Results:

base-scenario took ~1.2 secs and had 6 GC pauses;
javacc-original took ~9 secs and had 686 (!) GC pauses;
javacc-modified took ~8 secs and had 585 GC pauses (there is a noticeable difference here, but not that large I guess);
jflex took ~4.2 secs and had 350 GC pauses.

The test application is exceedingly simple and not a real-life scenario, but it still unambiguously shows that jflex is faster and generates less trash.

I'm going to do the same test for the SQL info extractor: I believe that the difference here will be even more noticeable, since the only thing the extractor does is finding two strings, nothing else needs to be copied.

FYI: @johnbley @breedx-nr

4 replies

johnbley Jan 21, 2021

Awesome work! This is a promising direction and after the dust settles I hope some form of this microbenchmark can make it into the automated test suite so that we don't regress performance on this critical bit in the future.

mateuszrzeszutek Jan 22, 2021

Tested the extractors - scenario is the same as in the previous post, read a huge file with SQLs 100 times and print them out to another file:

javacc-extractor-original took ~3.4 seconds and had 452 GC pauses;
jflex-extractor took ~2.5 seconds and had 338 GC pauses.

The fact that jflex extractor implementation did not have much less GC pauses than the sanitizer suggests that most memory is used by the lexer's internal logic and whatever we're doing with the output does not matter that much. If that's the case, then combining the sanitizer and info extractor into a single lexer that does a single pass should have little overhead over just the sanitizer. I'm going to verify this next: I'll compare the current implementation (javacc, two separate lexers) with a single jflex lexer that does both things.

mateuszrzeszutek Jan 22, 2021

Some more results:

javacc-both (two separate JavaCC lexers; the current implementation in the agent) took ~12.2 seconds and had 1130 GC pauses;
jflex-both (a single JFlex lexer that does both the sanitization and extraction) took~5 seconds and had 355 GC pauses.

I've also noticed that by default JFlex uses buffer size = 16 KB. That's a lot of memory for parsing SQL queries - fortunately that value is configurable; I've decided to experiment with it a little.

With buffer size = 4KB:

jflex took ~3.9 seconds and had 108 GC pauses;
jflex-extractor took ~2 seconds and had 90 GC pauses;
jflex-both took ~4.7 seconds and had 111 GC pauses.

With buffer size = 2KB:

jflex took ~3.5 seconds and had 67 GC pauses;
jflex-extractor took ~1.8 seconds and had 50 GC pauses;
jflex-both took ~4.4 seconds and had 71 GC pauses.

With buffer size = 1KB:

jflex took ~3.7 seconds and had 47 GC pauses;
jflex-extractor took ~2 seconds and had 30 GC pauses;
jflex-both took ~4.6 seconds and had 51 GC pauses.

mateuszrzeszutek Jan 22, 2021

Anyway, all these tests clearly show that JFlex-based lexers have superior performance to the ones that we have now - especially when buffer size is set to 1-2 KB (probably very little SQLs are larger than that). What's more, it might be worth it to integrate the sanitizer and info extractor into a single lexer, since the difference between sanitizer and sanitizer+extractor performance does not seem to be that large - in my simplified test case of course; but I suspect that in a "real" application that does more than reading & writing to files the difference might be even more negligible.

And if we decide to use a single lexer a single config property should be enough to enable/disable it everywhere.

breedx-splk · 2021-01-26T01:10:48Z

breedx-splk
Jan 26, 2021
Collaborator Author

@mateuszrzeszutek has submitted a new lexer with greatly improved performance and a considerable reduction in allocations and GC overhead. I ran some benchmarks and put the results in that PR here: #2113 (comment)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💸 SQL parsing is expensive...and you know it! #2065

{{title}}

Replies: 6 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

💸 SQL parsing is expensive...and you know it! #2065

breedx-splk Jan 16, 2021 Collaborator

Replies: 6 comments · 8 replies

breedx-splk Jan 20, 2021 Collaborator Author

trask Jan 26, 2021 Maintainer

breedx-splk Jan 26, 2021 Collaborator Author

breedx-splk
Jan 16, 2021
Collaborator

Replies: 6 comments 8 replies

breedx-splk Jan 20, 2021
Collaborator Author

trask Jan 26, 2021
Maintainer

breedx-splk
Jan 26, 2021
Collaborator Author