Skip to content

fix(google): guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components#68250

Open
goingforstudying-ctrl wants to merge 3 commits into
apache:mainfrom
goingforstudying-ctrl:fix/stackdriver-guard-transport-send
Open

fix(google): guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components#68250
goingforstudying-ctrl wants to merge 3 commits into
apache:mainfrom
goingforstudying-ctrl:fix/stackdriver-guard-transport-send

Conversation

@goingforstudying-ctrl

Copy link
Copy Markdown

Fixes Bug 3 from #68240.

What

StackdriverRemoteLogIO.processorsproc() calls transport.send() without
any error handling. In Airflow 3's supervisor model, REMOTE_TASK_LOG applies
to ALL supervised components (scheduler, dag-processor, triggerer, workers).
A single IAM misconfiguration or gRPC error would crash the entire process.

Observed: dag-processor pod enters CrashLoopBackOff on every log emit when the
Kubernetes Service Account lacks the logging.logEntries.create IAM binding.

Fix

Wrap _transport.send() in try/except Exception and emit a
logging.warning instead of propagating. Log delivery is best-effort;
a Cloud Logging error should never kill a task-executing process.

Changes

  • Guarded _transport.send() call in proc() with try/except
  • New test test_processors_survives_transport_send_failure verifies:
    • proc() returns the event (doesn't crash)
    • warning is logged with the failure details

relates to #68240

@henry3260 henry3260 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you clarify how the observed IAM, gRPC error propagates from _transport.send()? StackdriverRemoteLogIO uses BackgroundThreadTransport by default, whosesend()only enqueues the entry. The network request happens in the background worker, where exceptions frombatch.commit()are already caught.

"""
mock_get_creds_and_project_id.return_value = ("creds", "project_id")

mock_transport_type = mock.MagicMock()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mock_transport_type = mock.MagicMock()
mock_transport_type = mock.create_autospec(Transport)

This ensures the mock follows the Transport interface and catches invalid method calls or arguments.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with mock.create_autospec(Transport) — should catch invalid method calls now.

# Must still return the event — crash would mean no return at all
assert result is event
# Must log the failure so the operator can see it
assert "Failed to send log to Cloud Logging" in caplog.text

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we avoid using caplog here? It can modify the logging configuration and make tests flaky.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed caplog and patched _logger directly instead — no logging config modification needed.

@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/stackdriver-guard-transport-send branch 7 times, most recently from e261850 to d894c3f Compare June 10, 2026 11:38
…t crash supervised components

In AF3's supervisor model REMOTE_TASK_LOG applies to ALL supervised
components (scheduler, dag-processor, triggerer, workers).  An unguarded
transport.send() failure — e.g. missing logging.logEntries.create IAM
binding — would crash the entire process.  The fix wraps send() in
try/except and logs a warning instead of propagating the exception.

relates to apache#68240
Address review feedback:
- Use mock.create_autospec(Transport) per henry3260's suggestion
- Replace caplog with mock.patch on _logger to avoid logging config
  modification
… instances

Apply reviewer's suggestion consistently across all three test methods
that instantiate mock_transport_type, not just the one originally
flagged.
@goingforstudying-ctrl goingforstudying-ctrl force-pushed the fix/stackdriver-guard-transport-send branch from d894c3f to 56c906b Compare June 10, 2026 14:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:logging area:providers provider:google Google (including GCP) related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants