Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure service bus: K8 pods become idle after AMQP connection error #38973

Open
sasich821 opened this issue Dec 23, 2024 · 2 comments
Open

Azure service bus: K8 pods become idle after AMQP connection error #38973

sasich821 opened this issue Dec 23, 2024 · 2 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus

Comments

@sasich821
Copy link

  • Package Name: azure-servicebus
  • Package Version: 7.11.2
  • Operating System: Debian Container (kubernetes pod)
  • Python Version: 3.12

Describe the bug
The issue we are facing is this . Our K8 pods listen to service bus messages continuously, process these messages and sends another message to a service bus topic . Often we see messages like "Connection keep-alive for 'SendClientAsync' failed: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')." These mostly are temporary and the pods get reconnected on their own. But we have seen instances when this does not happen. In that case the pod just becomes idle and does not process any messages, unless the pod is restarted. We have faced many outages because of this behaviour. There are no more logs emitted from the pod again, only this last log message. This is also logged as Log.Info and not thrown as an error from the SDK, hence we are unable to capture these in our try-catch for retry upon our end.

To Reproduce
Steps to reproduce the behavior:
We do not have a full proof way to repro this as it happens randomly, but one way to try this would be to

  1. Build a service bus consumer and publisher (SendClientAsync seems to be in the publisher module, hence we are thinking this might be coming from here)
  2. Keep the channel idle for 2-3 hours
  3. Check for last log message as "Connection keep-alive for 'SendClientAsync' failed: AMQPConnectionError('Error condition: ErrorCondition.SocketError\n Error Description: Can not read frame due to exception: [Errno 104] Connection reset by peer')."
    Put a message in service bus, the message will not be consumed.

Expected behavior
Pods should be able to reconnect.
If not, an error should be thrown that can help us handle retries or automatically kill pods.

Screenshots
attached certain logs
debugLogs.txt
basicLogs.txt

Additional context

  1. As mentioned in this issue, we are not able to enable debug logs because it's a production env and when we enabled debug logs there was certain sensitive data that is also getting logged.
  2. Also, the service bus listener and senders are singleton objects which are created once when the service starts and then it's getting reused throughout the lifetime of service unless service is redeployed or pods restarts.
@github-actions github-actions bot added Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus labels Dec 23, 2024
Copy link

Thank you for your feedback. Tagging and routing to the team member best able to assist.

@sasich821
Copy link
Author

hey team, any suggestions on how we can move forward here to debug more on this issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Bus
Projects
None yet
Development

No branches or pull requests

2 participants