Skip to content

Sentry sdk serializes stack frames before determining whether the error was already handled earlier. #4365

@jvineet

Description

@jvineet

How do you use Sentry?

Self-hosted/on-premise

Version

"2.25.1"

Steps to Reproduce

We noticed that a lot of our fastapi transactions in sentry were showing an unexpectedly long delay in latency after the typical request-response cycle for a request route. On investigation further, we realized this increase in latency was coming from how sentry was capturing the error event in those cases. On setting up spans around sentry's own internal event capture and serialization functions, we noticed that while sentry-sdk was only sending an error event to sentry server once, like we expected, it would still try to serialize the stack frames for subsequent event captures, before checking whether the event was already sent by sentry-sdk earlier. This ends up wasting cpu cycles and also needlessly slows down the response latency by a decent amount. Can we update the logic here to first check if the event needs to be sent over before we start serializing stack frames?

The issue is even more stark in a test/prod environment running a real app, that can easily have over 25-30 stack frames that need serializing and we are seeing a latency increase by up to 30 ms or more from these extra stack frame serializations that are not necessary.

Here is a real world example:
Image

I have a simple setup below to reproduce the issue with fastapi, though it is not limited to fastapi:

# main.py

import fastapi
import logging

import sentry_sdk
from sentry_sdk import serializer
from sentry_sdk.integrations import asgi

sentry_sdk.capture_event = sentry_sdk.trace(sentry_sdk.capture_event)
serializer.serialize = sentry_sdk.trace(serializer.serialize)
asgi._capture_exception = sentry_sdk.trace(asgi._capture_exception)

sentry_sdk.init("<your sentry dsn>", traces_sample_rate=1)

app = fastapi.FastAPI()

@app.get("/test")
def test() -> str:
    raise ValueError("Check the linked trace")


@app.exception_handler(ValueError)
def exception_handler(request: fastapi.Request, exc: Exception):
    # logging error here so that sentry-sdk can serialize stack frames, capture exception,
    # send it to sentry and also mark the error as already "handled"
    logging.exception("Log Exception")

    # reraise so it bubbles up to outer middleware/layers.
    raise exc

Run with uvicorn

$ uvicorn main:app --port 8000

Expected Result

We expect that stack frame serialization will happen only once and capture_event() will only send the exception to sentry once for the raised exception.

Actual Result

The capture_event() does seem to only send event to sentry once like we expect. But we see the stack frame serialization happens 3 separate times, the last two of which happened when capture_event ultimately realized it had already sent the exception over earlier.

Image

In this example trace, we can see that while the normal request-response cycle would have normally taken 8-9 ms, including the first error capture event, the overall txn ended up taking 21ms, mostly from the 2 subsequent stack frame serializations for the same error event.

The following snapshot expands one of the three auto-grouped spans, to show that these spans are indeed coming from the serialization step for every stack frame in the exception.

Image

The transaction is only linked to one exception, so these are not three separate error events either.

Image

Activity

moved this to Waiting for: Product Owner in GitHub Issues with 👀 3on May 6, 2025
sentrivana

sentrivana commented on May 6, 2025

@sentrivana
Contributor

Hey @jvineet, thanks for raising this. There's a couple things going on here.

First of all, transactions and errors are serialized separately. In the sample app you're capturing an exception as well as a transaction, so the expected number of full serialization calls would be two, not one.

For some background, the logic responsible for determining whether an exception is a duplicate and should be dropped lives in its own integration and is applied via a global event processor. Event processors run before serialization, so events that are going to be dropped don't go through the whole serialization routine. However -- and this is most likely what you're seeing --, we do pre-serialize parts of the event (namely, the stacktrace) early. This was a conscious decision in order to avoid potentially tampering with actual userland objects (see #3409). This happens before the dedupe process takes place.

There's improvement potential here. Some things we could consider:

  • moving the stacktrace serialization later in the process (probably tricky to do this right while making sure we don't mess with actual refs)
  • moving the dedupe check earlier (maybe this doesn't make sense as an integration anymore and should rather live on the scope directly)
  • caching (need to be mindful of the memory blueprint there)
  • ???

I'll leave this open, but I don't believe there's a quick fix for this, so it might take some time. (PRs welcome though!)

moved this from Waiting for: Product Owner to No status in GitHub Issues with 👀 3on May 6, 2025
jvineet

jvineet commented on May 6, 2025

@jvineet
Author

Hi @sentrivana , thanks for the quick reply. I am almost certain that all three serialization groups that I mentioned above are coming from stack frame serialization (which makes sense since sentry-sdk pre-serializes stack traces like you mentioned). I also see three logging.exception logs per request that should corresponding to each one of the three serialization groups. Those three span groups do not represent a transaction serialization.

I think this can also be easily confirmed by changing the fastapi exception handler code into this

@app.exception_handler(ValueError)
def exception_handler(request: fastapi.Request, exc: Exception):
    for i in range(4):
        try:
            raise exc
        except Exception:
            logging.exception(f"Log Exception again: attempt {i}")

    return fastapi.responses.PlainTextResponse("some error")

With this change in place, the # of "serializing" span groups in the txn (and # of logging.exception logs per request) would be the same as range value.

Here is what it looks like for a range value of 4.

Image

for range value of 2.

Image

for range value of 10.

Image

The other reason I know that these huge serialization groups are not from transaction serialization is because we don't see these giant span groups in our production traces when there was no exception linked to the txn. In fact, in those cases we can hardly tell if any time was spent on serializing spans (it is possible that a sentry transaction does not capture it's own serializing time).

I can mimic that behavior by setting range value of 0 in our sample case.

Image

I think that adding a ~10 ms penalty to response latency every time the app calls logging.exception() is pretty brutal (I am saying ~10ms because real stack traces from production are rarely as small as our example above), especially when we incur this serialization penalty for stack frames that we know will never be sent over. I am also assuming that any team that relies on sentry error and transactions as much as we do has api performance near the top of their mind, so this feels a bit counterproductive (we do really love sentry here as a tool btw, so amazing job there!). Therefore, I am really hoping we could bump the priority on this particular issue.

Another thing to consider is optionally reducing the size of stack frames that we ship with sentry. Like maybe give the users option to omit stack frames from certain libraries. For example, in fastapi, we often see 17-18 stack frames from starlette/fastapi, before we even reach the frames from our app.

Image

While those frames might be nice to have, we personally don't really look at them ~99% of the time, and would much rather pick shaving off the ~15-20 ms of time that not serializing these frames would give us back. And only turn them on optionally, if required, under some rare circumstances.

moved this to Waiting for: Product Owner in GitHub Issues with 👀 3on May 6, 2025
self-assigned this
on May 7, 2025
sl0thentr0py

sl0thentr0py commented on May 7, 2025

@sl0thentr0py
Member

@jvineet agree largely with what you said and we need to fix this in the long term, but the way it is now is due to a non-trivial interaction between holding references to frames and other parts of the SDK.

For now, I can recommend you use LoggingIntegration with event_level=None so we will only capture actual exceptions and not the ones from logging.exception. I'm sorry I can't give you a cleaner solution right now.

import logging
import sentry_sdk
from sentry_sdk.integrations.logging import LoggingIntegration

sentry_sdk.init(
    # ...
    integrations=[
        LoggingIntegration(
            level=logging.INFO,        # Capture info and above as breadcrumbs
            event_level=None   # Don't send any logging statements as events
        ),
    ],
)

For @sentrivana, note that PEP 667 solves some of this on the Python side, but we're stuck with what to do for older versions. The theoretically proper solution is to build a read-only proxy/view of frame.f_locals that doesn't write overwriting any of them. Then we can remove a lot of cruft we have.

moved this from Waiting for: Product Owner to No status in GitHub Issues with 👀 3on May 7, 2025
jvineet

jvineet commented on May 7, 2025

@jvineet
Author

@sl0thentr0py We rely on sentry quite a bit, which includes relying on logging.exception and logging.error in many places (even if we weren't doing it, a lot of third party libs we use surface exceptions with logigng.exception). So disabling logging integration would probably be bad. Do you recall if there is an easy way to reduce the size of stack frames that are serialized (maybe as part of before_send or something like that), along the lines of the second point I tried to make earlier. Maybe we could look into that for the time being.

Maybe before_send is not going to work, since all of the frames are probably already serialized by that point.

Another thing to consider is optionally reducing the size of stack frames that we ship with sentry. Like maybe give the users option to omit stack frames from certain libraries. For example, in fastapi, we often see 17-18 stack frames from starlette/fastapi, before we even reach the frames from our app.

While those frames might be nice to have, we personally don't really look at them ~99% of the time, and would much rather pick shaving off the ~15-20 ms of time that not serializing these frames would give us back. And only turn them on optionally, if required, under some rare circumstances.

moved this to Waiting for: Product Owner in GitHub Issues with 👀 3on May 7, 2025
antonpirker

antonpirker commented on May 8, 2025

@antonpirker
Member

Hey @jvineet, you are right with your assumption. The serialization of the stack traces is done before the before_send hook. So there is no easy clean solution unfortunately.

The only thing I can thing of is the have this constant that limits how many stack frames are serialized: https://github.com/getsentry/sentry-python/blob/potel-base/sentry_sdk/utils.py#L81

We do not have an option to change this value, so you could only monkey patch the SDK to change this value. But that is a pretty hacky "solution: to your problem...

moved this from Waiting for: Product Owner to No status in GitHub Issues with 👀 3on May 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @antonpirker@jvineet@sl0thentr0py@sentrivana

      Issue actions

        Sentry sdk serializes stack frames before determining whether the error was already handled earlier. · Issue #4365 · getsentry/sentry-python