-
Notifications
You must be signed in to change notification settings - Fork 462
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IotEdgeMetricsCollector Stops Working #7092
Comments
@huguesBouvier would you mind also taking a look at this one |
Does the contain "stops"? Changing the restart policy of the pod should mitigate the issue. So if it stops it gets restarted automatically:
|
Hello,
The restart policy is “always”. The module restarts, but it’s always in failed state. It seems like it’s caching something, because deleting the container and letting iotedge recreate it solves the problem.
From: hugues bouvier ***@***.***>
Sent: Tuesday, August 29, 2023 5:44 PM
To: Azure/iotedge ***@***.***>
Cc: Pedro Buhigas ***@***.***>; Author ***@***.***>
Subject: Re: [Azure/iotedge] IotEdgeMetricsCollector Stops Working (Issue #7092)
Does the contain "stops"?
I see: [2023-08-18 07:48:42.915 INF] MetricsCollector Main() finished..
Changing the restart policy of the pod should mitigate the issue. So if it stops it gets restarted automatically:
The restart policies are:
* Default: not to restart
* always Always restart
* unless-stopped Restart always except when the user has manually stopped the container
* on-failure Restart only when the container exit code is non-zero
—
Reply to this email directly, view it on GitHub<#7092 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD3WJ6ZEVHOLQ2WX6WCLIUTXXZWCRANCNFSM6AAAAAA36SBBEI>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
|
Restart policy is set to Always already
From: hugues bouvier ***@***.***>
Sent: Tuesday, August 29, 2023 5:44 PM
To: Azure/iotedge ***@***.***>
Cc: Pedro Buhigas ***@***.***>; Author ***@***.***>
Subject: Re: [Azure/iotedge] IotEdgeMetricsCollector Stops Working (Issue #7092)
Does the contain "stops"?
I see: [2023-08-18 07:48:42.915 INF] MetricsCollector Main() finished..
Changing the restart policy of the pod should mitigate the issue. So if it stops it gets restarted automatically:
The restart policies are:
* Default: not to restart
* always Always restart
* unless-stopped Restart always except when the user has manually stopped the container
* on-failure Restart only when the container exit code is non-zero
—
Reply to this email directly, view it on GitHub<#7092 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AD3WJ6ZEVHOLQ2WX6WCLIUTXXZWCRANCNFSM6AAAAAA36SBBEI>.
You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>
|
I ran a test for a few weeks to see if I could repro but up to this day it was working fine for me. The logs don't show much, it looks like the interruption doesn't come from the metric collector but hard to say. |
I am having a chronic problem with IotEdgeMetricsCollector. It will work for several months, and all of the sudden it will stop working with the following log entries bellow. Removing the docker container and forcing IotEdge to recreate solves the problem.
[2023-08-18 07:47:18.810 INF] Started operation Reconnect to IoT Hub
[2023-08-18 07:47:18.828 INF] Started operation Scrape and Upload Metrics
[2023-08-18 07:48:18.833 INF] Starting periodic operation Scrape and Upload Metrics...
[2023-08-18 07:48:18.833 INF] Starting periodic operation Reconnect to IoT Hub...
[2023-08-18 07:48:18.862 INF] Scraping endpoint http://edgeHub:9600/metrics
[2023-08-18 07:48:18.864 INF] Trying to initialize module client using transport type [Amqp_Tcp_Only]
[2023-08-18 07:48:18.995 INF] Scraping endpoint http://edgeAgent:9600/metrics
[2023-08-18 07:48:18.997 INF] Scraping endpoint http://IotEdgeEcoView:9600/metrics
[2023-08-18 07:48:19.332 INF] Scraping finished, received 37 metrics from endpoint http://IotEdgeEcoView:9600/metrics
[2023-08-18 07:48:19.351 INF] Scraping finished, received 60 metrics from endpoint http://edgeHub:9600/metrics
[2023-08-18 07:48:19.357 INF] Scraping finished, received 141 metrics from endpoint http://edgeAgent:9600/metrics
[2023-08-18 07:48:22.700 INF] Successfully created self-signed certificate for agentGuid : {3f26d4e0-64b6-46c3-8bea-e0540a3b3aa0} and workspace: 61ca6587-9689-4c80-acc4-58f22c2676c9
[2023-08-18 07:48:22.710 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:22.714 INF] sending registration request
[2023-08-18 07:48:22.720 INF] waiting for response to registration request
[2023-08-18 07:48:27.808 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:27.810 INF] sending registration request
[2023-08-18 07:48:27.810 INF] waiting for response to registration request
[2023-08-18 07:48:32.828 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:32.828 INF] sending registration request
[2023-08-18 07:48:32.828 INF] waiting for response to registration request
[2023-08-18 07:48:37.856 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:37.856 INF] sending registration request
[2023-08-18 07:48:37.856 INF] waiting for response to registration request
[2023-08-18 07:48:42.875 WRN] exception occurred : One or more errors occurred. (Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443))
[2023-08-18 07:48:42.878 ERR] Registering agent with OMS failed (are the Log Analytics Workspace ID and Key correct?) : One or more errors occurred. (Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443))
[2023-08-18 07:48:42.908 FTL] System.AggregateException: One or more errors occurred. (Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443))
---> System.Net.Http.HttpRequestException: Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443)
---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Net.Sockets.Socket.g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.Wait()
at Microsoft.Azure.Devices.Edge.Azure.Monitor.Certificategenerator.CertGenerator.RegisterWithOms(X509Certificate2 cert, String AgentGuid, String logAnalyticsWorkspaceId, String logAnalyticsWorkspaceKey, String logAnalyticsWorkspaceDomainPrefixOms) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/CertificateGenerator/CertGenerator.cs:line 188
at Microsoft.Azure.Devices.Edge.Azure.Monitor.Certificategenerator.CertGenerator.RegisterWithOmsWithBasicRetryAsync(X509Certificate2 cert, String AgentGuid, String logAnalyticsWorkspaceId, String logAnalyticsWorkspaceKey, String logAnalyticsWorkspaceDomainPrefixOms) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/CertificateGenerator/CertGenerator.cs:line 208
at Microsoft.Azure.Devices.Edge.Azure.Monitor.Certificategenerator.CertGenerator.RegisterAgentWithOMS(String logAnalyticsWorkspaceId, String logAnalyticsWorkspaceKey, String logAnalyticsWorkspaceDomainPrefixOms) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/CertificateGenerator/CertGenerator.cs:line 267
[2023-08-18 07:48:42.911 INF] Termination requested, initiating shutdown.
[2023-08-18 07:48:42.912 INF] Waiting for cleanup to finish
[2023-08-18 07:48:42.914 INF] Done with cleanup. Shutting down.
[2023-08-18 07:48:42.915 INF] MetricsCollector Main() finished.
The text was updated successfully, but these errors were encountered: