Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Celery task_postrun is called only after the span has finished in chains (due to a regression) #11479

Open
oranav opened this issue Nov 21, 2024 · 1 comment
Assignees

Comments

@oranav
Copy link
Contributor

oranav commented Nov 21, 2024

#10676 introduced a regression which causes the task_postrun signal to never be able to attach tags to a span when working with Celery canvases (chains, callbacks, chords etc).

When working with a chain for instance, the task always calls apply_async on its callback just before it finishes (before task_postrun is fired). With the changes introduced in #10676, the span is already finished at this point (and sent to the tracer), so adding new tags doesn't do anything, making the span degenerated.

This means that the following happens:

  1. task_prerun is called, creating a new span, and registering it under prerun_span
  2. The task is run (either successfully or not)
  3. Celery sees that there's a callback registered, and fires the next task in the chain
  4. Task.apply_async is invoked, but it's wrapped with _traced_apply_async_inner. Inside its finally block it sees that there's a prerun_span belonging to the worker span - therefore forcefully finishing it
  5. task_postrun is fired, and tries to attach tags to the span, but at this point it's doomed, since the span has already finished.

All in all I would recommend to revert #10676 and come up with a better implementation, or at least to remove the task_postrun part of it, since it's utterly broken.

Reproduce

tasks.py:

from celery import Celery, chain

app = Celery("tasks")


@app.task
def a():
    print("a")


@app.task
def b():
    print("b")

fire.py:

from tasks import a, b

(a.si() | b.si()).delay()

Run Celery with:

DD_TRACE_ENABLED=1 ddtrace-run celery -A tasks.app worker -l debug -P solo -c 1

Then fire the chain with:

python3 fire.py

You'll see the following debug logs in the Celery worker:

[2024-11-21 14:43:11,621: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] received
...
prerun signal start task_id=8d17b2e1-421d-410f-a81b-a49663fba490
...
The task_postrun signal was not called, so manually closing span: name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=None duration=None error=0 tags={'component': 'celery', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, 'process_id': 48412} links='' events=''
...
finishing span name='celery.run' id=11656151137401636297 trace_id=137238467797017623115752292864617901008 parent_id=None service='celery-worker' resource='tasks.a' type='worker' start=1732192991.621946 end=1732192991.624561 duration=0.002615 error=0 tags={'_dd.base_service': '', '_dd.p.dm': '-0', '_dd.p.tid': '673f2adf00000000', 'component': 'celery', 'language': 'python', 'runtime-id': '935bf9ad4d684312a4a8ab390bab4646', 'span.kind': 'consumer'} metrics={'_dd.measured': 1, '_dd.top_level': 1, '_dd.tracer_kr': 1.0, '_sampling_priority_v1': 1, 'process_id': 48412} links='' events='' (enabled:True)
...
[2024-11-21 14:43:11,625: INFO/MainProcess] Task tasks.a[8d17b2e1-421d-410f-a81b-a49663fba490] succeeded in 0.003730249998625368s: None
...
postrun signal task_id=8d17b2e1-421d-410f-a81b-a49663fba490

And you'll see that the first span doesn't have any celery.* tags, for example celery.state.

@wantsui wantsui self-assigned this Nov 21, 2024
@wantsui
Copy link
Collaborator

wantsui commented Nov 21, 2024

Thank you for the report and sorry for the behavior that the change had on your spans. The update was to fix a bug where we weren’t closing spans. I really appreciate the level of detail you provided on how to reproduce and your suggestion!

I will be taking a look to discuss with my team the best way to safely get the experience you’re looking for.

As a workaround - are you ok with pinning your version for now?

Apologies for the inconvenience!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants