Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IotEdgeMetricsCollector Stops Working #7092

Closed
PedroBuhigas opened this issue Aug 25, 2023 · 5 comments
Closed

IotEdgeMetricsCollector Stops Working #7092

PedroBuhigas opened this issue Aug 25, 2023 · 5 comments

Comments

@PedroBuhigas
Copy link

I am having a chronic problem with IotEdgeMetricsCollector. It will work for several months, and all of the sudden it will stop working with the following log entries bellow. Removing the docker container and forcing IotEdge to recreate solves the problem.

[2023-08-18 07:47:18.810 INF] Started operation Reconnect to IoT Hub
[2023-08-18 07:47:18.828 INF] Started operation Scrape and Upload Metrics
[2023-08-18 07:48:18.833 INF] Starting periodic operation Scrape and Upload Metrics...
[2023-08-18 07:48:18.833 INF] Starting periodic operation Reconnect to IoT Hub...
[2023-08-18 07:48:18.862 INF] Scraping endpoint http://edgeHub:9600/metrics
[2023-08-18 07:48:18.864 INF] Trying to initialize module client using transport type [Amqp_Tcp_Only]
[2023-08-18 07:48:18.995 INF] Scraping endpoint http://edgeAgent:9600/metrics
[2023-08-18 07:48:18.997 INF] Scraping endpoint http://IotEdgeEcoView:9600/metrics
[2023-08-18 07:48:19.332 INF] Scraping finished, received 37 metrics from endpoint http://IotEdgeEcoView:9600/metrics
[2023-08-18 07:48:19.351 INF] Scraping finished, received 60 metrics from endpoint http://edgeHub:9600/metrics
[2023-08-18 07:48:19.357 INF] Scraping finished, received 141 metrics from endpoint http://edgeAgent:9600/metrics
[2023-08-18 07:48:22.700 INF] Successfully created self-signed certificate for agentGuid : {3f26d4e0-64b6-46c3-8bea-e0540a3b3aa0} and workspace: 61ca6587-9689-4c80-acc4-58f22c2676c9
[2023-08-18 07:48:22.710 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:22.714 INF] sending registration request
[2023-08-18 07:48:22.720 INF] waiting for response to registration request
[2023-08-18 07:48:27.808 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:27.810 INF] sending registration request
[2023-08-18 07:48:27.810 INF] waiting for response to registration request
[2023-08-18 07:48:32.828 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:32.828 INF] sending registration request
[2023-08-18 07:48:32.828 INF] waiting for response to registration request
[2023-08-18 07:48:37.856 INF] OMS endpoint Url : https://61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com/AgentService.svc/AgentTopologyRequest
[2023-08-18 07:48:37.856 INF] sending registration request
[2023-08-18 07:48:37.856 INF] waiting for response to registration request
[2023-08-18 07:48:42.875 WRN] exception occurred : One or more errors occurred. (Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443))
[2023-08-18 07:48:42.878 ERR] Registering agent with OMS failed (are the Log Analytics Workspace ID and Key correct?) : One or more errors occurred. (Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443))
[2023-08-18 07:48:42.908 FTL] System.AggregateException: One or more errors occurred. (Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443))
---> System.Net.Http.HttpRequestException: Resource temporarily unavailable (61ca6587-9689-4c80-acc4-58f22c2676c9.oms.opinsights.azure.com:443)
---> System.Net.Sockets.SocketException (11): Resource temporarily unavailable
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ThrowException(SocketError error, CancellationToken cancellationToken)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.System.Threading.Tasks.Sources.IValueTaskSource.GetResult(Int16 token)
at System.Net.Sockets.Socket.g__WaitForConnectWithCancellation|277_0(AwaitableSocketAsyncEventArgs saea, ValueTask connectTask, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
--- End of inner exception stack trace ---
at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(HttpRequestMessage request)
at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.GetHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
at System.Net.Http.HttpClient.g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
--- End of inner exception stack trace ---
at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions)
at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken)
at System.Threading.Tasks.Task.Wait()
at Microsoft.Azure.Devices.Edge.Azure.Monitor.Certificategenerator.CertGenerator.RegisterWithOms(X509Certificate2 cert, String AgentGuid, String logAnalyticsWorkspaceId, String logAnalyticsWorkspaceKey, String logAnalyticsWorkspaceDomainPrefixOms) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/CertificateGenerator/CertGenerator.cs:line 188
at Microsoft.Azure.Devices.Edge.Azure.Monitor.Certificategenerator.CertGenerator.RegisterWithOmsWithBasicRetryAsync(X509Certificate2 cert, String AgentGuid, String logAnalyticsWorkspaceId, String logAnalyticsWorkspaceKey, String logAnalyticsWorkspaceDomainPrefixOms) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/CertificateGenerator/CertGenerator.cs:line 208
at Microsoft.Azure.Devices.Edge.Azure.Monitor.Certificategenerator.CertGenerator.RegisterAgentWithOMS(String logAnalyticsWorkspaceId, String logAnalyticsWorkspaceKey, String logAnalyticsWorkspaceDomainPrefixOms) in /mnt/vss/_work/1/s/edge-modules/metrics-collector/src/CertificateGenerator/CertGenerator.cs:line 267
[2023-08-18 07:48:42.911 INF] Termination requested, initiating shutdown.
[2023-08-18 07:48:42.912 INF] Waiting for cleanup to finish
[2023-08-18 07:48:42.914 INF] Done with cleanup. Shutting down.
[2023-08-18 07:48:42.915 INF] MetricsCollector Main() finished.

@jlian
Copy link
Member

jlian commented Aug 29, 2023

@huguesBouvier would you mind also taking a look at this one

@huguesBouvier
Copy link
Contributor

huguesBouvier commented Aug 29, 2023

Does the contain "stops"?
I see: [2023-08-18 07:48:42.915 INF] MetricsCollector Main() finished..

Changing the restart policy of the pod should mitigate the issue. So if it stops it gets restarted automatically:
The restart policies are:

  • Default: not to restart
  • always Always restart
  • unless-stopped Restart always except when the user has manually stopped the container
  • on-failure Restart only when the container exit code is non-zero

@PedroBuhigas
Copy link
Author

PedroBuhigas commented Aug 30, 2023 via email

@PedroBuhigas
Copy link
Author

PedroBuhigas commented Sep 22, 2023 via email

@huguesBouvier
Copy link
Contributor

I ran a test for a few weeks to see if I could repro but up to this day it was working fine for me.
I also check the code to see if there was something wrong but I couldn't see anything.

The logs don't show much, it looks like the interruption doesn't come from the metric collector but hard to say.
Without a simpler repro it will be hard to troubleshoot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants