fix(google): guard transport.send() in proc() so IAM/gRPC errors don't crash supervised components#68250
Conversation
0555885 to
72bd974
Compare
henry3260
left a comment
There was a problem hiding this comment.
Could you clarify how the observed IAM, gRPC error propagates from _transport.send()? StackdriverRemoteLogIO uses BackgroundThreadTransport by default, whosesend()only enqueues the entry. The network request happens in the background worker, where exceptions frombatch.commit()are already caught.
| """ | ||
| mock_get_creds_and_project_id.return_value = ("creds", "project_id") | ||
|
|
||
| mock_transport_type = mock.MagicMock() |
There was a problem hiding this comment.
| mock_transport_type = mock.MagicMock() | |
| mock_transport_type = mock.create_autospec(Transport) |
This ensures the mock follows the Transport interface and catches invalid method calls or arguments.
There was a problem hiding this comment.
Replaced with mock.create_autospec(Transport) — should catch invalid method calls now.
| # Must still return the event — crash would mean no return at all | ||
| assert result is event | ||
| # Must log the failure so the operator can see it | ||
| assert "Failed to send log to Cloud Logging" in caplog.text |
There was a problem hiding this comment.
Could we avoid using caplog here? It can modify the logging configuration and make tests flaky.
There was a problem hiding this comment.
Removed caplog and patched _logger directly instead — no logging config modification needed.
e261850 to
d894c3f
Compare
…t crash supervised components In AF3's supervisor model REMOTE_TASK_LOG applies to ALL supervised components (scheduler, dag-processor, triggerer, workers). An unguarded transport.send() failure — e.g. missing logging.logEntries.create IAM binding — would crash the entire process. The fix wraps send() in try/except and logs a warning instead of propagating the exception. relates to apache#68240
Address review feedback: - Use mock.create_autospec(Transport) per henry3260's suggestion - Replace caplog with mock.patch on _logger to avoid logging config modification
… instances Apply reviewer's suggestion consistently across all three test methods that instantiate mock_transport_type, not just the one originally flagged.
d894c3f to
56c906b
Compare
Fixes Bug 3 from #68240.
What
StackdriverRemoteLogIO.processors→proc()callstransport.send()withoutany error handling. In Airflow 3's supervisor model,
REMOTE_TASK_LOGappliesto ALL supervised components (scheduler, dag-processor, triggerer, workers).
A single IAM misconfiguration or gRPC error would crash the entire process.
Observed: dag-processor pod enters
CrashLoopBackOffon every log emit when theKubernetes Service Account lacks the
logging.logEntries.createIAM binding.Fix
Wrap
_transport.send()intry/except Exceptionand emit alogging.warninginstead of propagating. Log delivery is best-effort;a Cloud Logging error should never kill a task-executing process.
Changes
_transport.send()call inproc()with try/excepttest_processors_survives_transport_send_failureverifies:relates to #68240