💸 SQL parsing is expensive...and you know it! #2065
Replies: 6 comments 8 replies
-
Making expensive normalization opt-in makes sense to me. I would probably still do a tiny "parse" for the table name for a span name like "SELECT table_name". |
Beta Was this translation helpful? Give feedback.
-
The statement info extractor should have been guarded with a config property in the first place, it's an oversight that we should definitely correct.
We could also consider using Anyway, we would have two general properties that enable/disable query parsing everywhere, and each instrumentation that does it would have a separate configuration switch in case somebody wants to disable/enable it just in one place for some reason. WDYT? |
Beta Was this translation helpful? Give feedback.
-
@johnbley any thoughts on that? |
Beta Was this translation helpful? Give feedback.
-
I will definitely vote that we should disable pulling out table name / sql verb by default - and that such parsing (in the sense of having knowledge of grammar) belongs outside the instrumentation agent. I don't want to give up on the tokenization that the SqlNormalizer is doing however, for obvious data sensitivity reasons. It seems that my initial (somewhat hasty) choice of javacup is probably not the best since it doesn't offer a low-thrash way to do things. Consider the basic algorithm here in terms of allocations:
It is absolutely possible to tokenize sql in an instrumentation agent at reasonable overheads; it just requires engineering effort. Separately, I agree there should be a similar tokening normalizer in the otel collector, though given the breadth of data it deals with a design for such might involve a more general grammar, regex options, etc. etc. |
Beta Was this translation helpful? Give feedback.
-
I've run some very simple tests against different versions of the query normalizer/sanitizer - here's the repo with all the code. I've tested 4 different scenarios:
All of them run on a JVM that had 64m heap - I wanted to see how frequent the GC pauses would be. Results:
The test application is exceedingly simple and not a real-life scenario, but it still unambiguously shows that jflex is faster and generates less trash. I'm going to do the same test for the SQL info extractor: I believe that the difference here will be even more noticeable, since the only thing the extractor does is finding two strings, nothing else needs to be copied. FYI: @johnbley @breedx-nr |
Beta Was this translation helpful? Give feedback.
-
@mateuszrzeszutek has submitted a new lexer with greatly improved performance and a considerable reduction in allocations and GC overhead. I ran some benchmarks and put the results in that PR here: #2113 (comment) |
Beta Was this translation helpful? Give feedback.
-
Hi.
Rather than just open an issue, I thought I would open a discussion because I know there has been a fair amount of talk and history around this subject. The gist is that sql string parsing is expensive, and the default jdbc instrumentation parses every single sql statement. This parsing introduces noteworthy application overhead, and for simple database-heavy apps (like spring-petclinic-rest) it can be significant (I've measured 350% increase in response times).
It's even worse than that. The default implementation both normalizes AND parses the operation and table name. These are two distinct passes through a complex language string and it leverages two different parser implementations. Normalization can be disabled of course, but the result is that potentially sensitive data could be leaked into an attribute and/or the statement table/operation extractor accuracy gets weakened. I looked and couldn't find anything in the spec that suggests sql normalization/obfuscation.
So why do we parse? Surprisingly, it doesn't look like we populate the
db.sql.table
nordb.operation
attributes (maybe that's a separate bug?). The table name and operation name, though, are parsed so that they can be inserted into a descriptive, low cardinality span name.The database semantic conventions have some interesting things to say about this. There are more than 3 places where the spec discourages client-side parsing of sql in order to determine the attributes or span name components.
Yet we still do it, and it's still slow. The span names are fantastic, but at what cost?
There's also this hint that the simplest/fastest thing should be done client side, and that more complex parsing could be undertaken by a smart backend:
My recommendation now is to disable the default behavior of always parsing/extracting database/operation info here and ALSO to default the
otel.instrumentation.jdbc.query.normalizer.enabled
setting to false. My estimation is that most users will prefer a more lightweight implementation, and I like the parsing, so I think we should keep it around but guarded by a config setting, such asotel.instrumentation.jdbc.query.extractor.enabled
(doesn't exist yet). Users that need both the security/utility of the existing normalizer and extractor will set both of these totrue
and will slow down the instrumentation considerably while trading it for better span names. Default users will get more vague span names and unadulterated span names, but at the benefit of performance.What do you think?
Beta Was this translation helpful? Give feedback.
All reactions