Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500 #2110

Open
antikorol opened this issue Jun 22, 2024 · 24 comments · May be fixed by #2111
Open
Assignees
Labels
accepted Bug or feature would be accepted as a PR or is being worked on bug Identified as a potential bug Kubernetes Service discovery by Kubernetes Load Balancer Ocelot feature: Load Balancer Service Discovery Ocelot feature: Service Discovery Summer'24 Summer 2024 release
Milestone

Comments

@antikorol
Copy link

antikorol commented Jun 22, 2024

Expected Behavior / New Feature

Resolving a service address to route an API request

Actual Behavior / Motivation for New Feature

Sometimes, about every 20th - 30th request finished with status code 500

requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId, 
message: 'Exception caught in global error handler, 
exception message: Object reference not set to an instance of an object.,
exception stack:
   at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
   at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4ISPBSIV4V:000037FF'

Steps to Reproduce the Problem

{
  "Routes": [
    {
      "UpstreamPathTemplate": "/public/{everything}",
      "UpstreamHttpMethod": [
        "POST"
      ],
      "ServiceName": "serviceName",
      "DownstreamPathTemplate": "/public/{everything}",
      "DownstreamScheme": "http",
      "LoadBalancerOptions": {
        "Type": "RoundRobin"
      }
    }
  ],
  "GlobalConfiguration": {
    "ServiceDiscoveryProvider": {
      "Namespace": "namespace",
      "Type": "Kube"
    }
}
}

Specifications

  • Version: 23.3.3 (NET6 and NET8)
  • Platform: Linux
  • Subsystem:
@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

Hello Roman!
Welcome to Ocelot world! 🐯

It's difficult to determine what happened. Could you please upload the entire solution to GitHub for review? Alternatively, you could attach the following artifacts to this thread:

  • The Kubernetes service configurations, particularly the serviceName configuration
  • The Ocelot application startup C# code, also known as the app services setup
  • The Kubernetes client JSON configuration, also referred to as appsettings.json

This information will likely assist in identifying the issue.


After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500

You didn't observe the same errors in the Ocelot logs for version 23.2.x, did you?
Are you now operating in two different environments, or did you only upgrade a single environment?

Since you're utilizing Kubernetes, could you provide more details about your K8s server deployment?
How do you initiate the server, is it locally or in a cloud remotely?
And how long have you been using Kubernetes in conjunction with Ocelot service discovery?

@raman-m raman-m added Service Discovery Ocelot feature: Service Discovery Load Balancer Ocelot feature: Load Balancer Kubernetes Service discovery by Kubernetes waiting Waiting for answer to question or feedback from issue raiser labels Jun 24, 2024
@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

Regarding the exception details...

requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId, 
message: 'Exception caught in global error handler, 
exception message: Object reference not set to an instance of an object.,
exception stack:
   at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
   at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)

Are you a C# developer? Can you debug C# code? Do you use the official Ocelot NuGet package?

Deploying the Release version of the Ocelot package DLLs means there is no debug information for the called lines in the Call Stack trace. However, we can analyze the code of the RoundRobin.Lease method.

public async Task<Response<ServiceHostAndPort>> Lease(HttpContext httpContext)
{
var services = await _servicesDelegate?.Invoke() ?? new List<Service>();
if (services?.Count != 0)
{
lock (_lock)
{
if (_last >= services.Count)
{
_last = 0;
}
var next = services[_last++];
return new OkResponse<ServiceHostAndPort>(next.HostAndPort);
}
}
return new ErrorResponse<ServiceHostAndPort>(new ServicesAreEmptyError($"There were no services in {nameof(RoundRobin)} during {nameof(Lease)} operation."));
}

It appears the issue may lie here:

var next = services[_last++];
return new OkResponse<ServiceHostAndPort>(next.HostAndPort);

There's a missing null check for line 33. Theoretically, the service provider could return a null object, which should be accounted for.

To conclude, it seems that for some reason, the Kube service provider and its service builder returned a null list item, leading to the raising of NullReferenceException.
Therefore, further investigation into the problem is warranted...

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

@antikorol The subsequent step in the investigation involves searching for subsequent messages in Ocelot's log.
Here is the service builder logic:

public virtual IEnumerable<Service> BuildServices(KubeRegistryConfiguration configuration, EndpointsV1 endpoint)
{
ArgumentNullException.ThrowIfNull(configuration);
ArgumentNullException.ThrowIfNull(endpoint);
var services = endpoint.Subsets
.SelectMany(subset => _serviceCreator.Create(configuration, endpoint, subset))
.ToArray();
_logger.LogDebug(() => $"K8s '{Check(endpoint.Kind)}:{Check(endpoint.ApiVersion)}:{Check(endpoint.Metadata?.Name)}' endpoint: Total built {services.Length} services.");
return services;
}

We need to search for the following message in the log:

_logger.LogDebug(() => $"K8s '{Check(endpoint.Kind)}:{Check(endpoint.ApiVersion)}:{Check(endpoint.Metadata?.Name)}' endpoint: Total built {services.Length} services.");

In appsettings.json file the Debug logging level for the application can be defined as follows:

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft.AspNetCore": "Warning"
    }
  }
}

Deploy and review the log for the message:

  • "K8s 'kind:version:name' endpoint: Total built N services."

If this message appears in the log, it indicates that the Kubernetes setup is correct, and we can proceed to investigate the issue more thoroughly.

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

every 20th - 30th request finished with status code 500

It seems that the requests to the Kubernetes endpoint were failing around the 20th to 30th attempts, as indicated by the code in these lines:

var endpoint = await _kubeApi
.ResourceClient(client => new EndPointClientV1(client))
.GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);

This suggests that the Kubernetes client connectivity might be unstable. It would be advisable to switch from the Kube to the PollKube provider and observe the behavior again. The error should no longer occur if this change is made. Additionally, the PollingInterval should be set to less than half the average duration between the occurrences of the 500 error.

Let's figure out what's happening and possibly provide a hotfix for your user scenario to eliminate the Null Reference Error logging. We need to handle this 500 status case and process it accordingly.
Will you be opening a PR soon?

@antikorol
Copy link
Author

antikorol commented Jun 24, 2024

  1. I am using the official NuGet package.
    There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port

  2. I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces.
    https://github.com/antikorol/sample-of-error

  3. When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error. (I replaced it this way because I wanted to see in which line the null reference exception appears)

  4. When I use PollKube, I get an error in the logs "requestId: 0HN4ISPBSIV51:000023E9, previousRequestId: No PreviousRequestId, message: 'Error Code: ServicesAreEmptyError Message: There were no services in RoundRobin during Lease operation. errors found in ResponderMiddleware"

  5. When I changed log level to debug, I was only able to reproduce this exception once

Below is the log with debug information.


  | Jun 24, 2024 @ 15:13:45.018 | requestId: 0HN4K8QDBB2FT:00000016, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
-- | --

  | Jun 24, 2024 @ 15:13:39.565 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'Exception caught in global error handler, exception message: Object reference not set to an instance of an object., exception stack:    at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)    at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext)    at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext)    at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4K8QDBB2FO:00000015'

  | Jun 24, 2024 @ 15:13:39.515 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.511 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.209 | requestId: 0HN4K8QDBB2FR:00000014, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'


@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

Dear Roman,
What's your full name?
What's your LinkedIn? We can continue discussion in LinkedIn chat...

@antikorol
Copy link
Author

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

  1. I am using the official NuGet package.
    There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port

Please be aware that any custom logic may compromise stability, and the Ocelot team cannot be held accountable for it. It appears your existing custom logic is only compatible with version 23.2.
I guess, you might be facing a Kubernetes misconfiguration issue related to the Downstream Scheme versus Port Names feature. Are you utilizing this feature?

It would be helpful to examine the service definition in your Kubernetes configuration. Could you provide it, please?

@antikorol
Copy link
Author

No, I removed that logic because your changes already cover the case I needed with multiple ports.

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

  1. I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces.
    https://github.com/antikorol/sample-of-error

Good! After a quick review:

  • ApiGateway.Service: Where is the ocelot.json file? In other words, where is the primary configuration file for Ocelot?
  • appsettings.json: I'm very sorry, but why have you defined routes inside appsettings.json? The Ocelot configuration should be in ocelot.json! For configuration guidance: configuration.

I wonder that your Ocelot app started at all!
Did you develop the Ocelot app in the past using version 23.2 or someone else?

Did you read Configuration docs?
Especially, the Multiple Environments paragraph which say how to configure app with .AddJsonFile("ocelot.json") // primary config file

So, I don't see configuration part in these lines L27-L42
You have to setup configuration:

    builder.WebHost
        .ConfigureAppConfiguration((context, config) =>
        {
            var env = context.HostingEnvironment;
            config.SetBasePath(env.ContentRootPath)
                .AddJsonFile("appsettings.json", true, true)
                .AddJsonFile($"appsettings.{env.EnvironmentName}.json", true, true)
                .AddOcelot(env) // happy path
                .AddEnvironmentVariables();
        })

So, it's very important to configure via AddOcelot method:

 .ConfigureAppConfiguration((context, config) => config.AddOcelot())

This is minimal required configuring code! But better to use official template from docs.

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

@antikorol commented:

No, I removed that logic because your changes already cover the case I needed with multiple ports.

Good! You wanna use the feature: Downstream Scheme vs Port Names

Also you enabled the port finder via "DownstreamScheme": "http".
Now, are you sure your K8s definition is correct?
Should we review K8s config together?

@antikorol
Copy link
Author

antikorol commented Jun 24, 2024

There was update from 18.0.0 to 23.2.2 a long time ago and now to 23.3.3
I had a configuration in a YAML file, but to exclude all possible libraries, I converted it to JSON and added it to the appsettings, and yes, everything worked.
Now I've moved it to a separate ocelot.json and fixed registration

According to the Microsoft documentation, it is not mandatory to explicitly specify appsettings.json or its variations depending on the environment.
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-6.0

Should we review K8s config together?

yes, what I should provide?

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

  1. When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error.

I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same. I don't understand what you're referring to.

(I replaced it this way because I wanted to see in which line the null reference exception appears)

I apologize, but uncommenting line L53 does not result in any replacement. You need to implement actual service injection using the official helpers: Custom Load Balancers. Have you consulted this documentation?

Even if you substitute the services with custom ones, it won't aid in identifying the failing line, as this necessitates the inclusion of Debug version DLLs in your Docker file, as seen here: L21. Therefore, you must build with the debug configuration -c Debug to ensure the log contains Call Stack code lines within messages for exceptions.

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

@antikorol commented:

yes, what I should provide?

Please provide the Kubernetes service definition in text, JSON format, or the Kubernetes client response JSON. From the log you've shared, it appears you have defined only one service; however, the number of ports in the definition is unclear.

Additionally, I'm curious as to why every twentieth Kubernetes client request fails or returns an invalid response, causing the Ocelot provider to throw NullReferenceException. We need to pinpoint this faulty request and obtain the response data in JSON format to replicate the user scenario through tests.

Lastly, I trust that you have the expertise to debug the provider, identify the issue, and potentially write tests or even submit a pull request with a fix.

@antikorol
Copy link
Author

I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same.

Sorry for the confusion. The files are identical. I didn't want to change the project build type. In the release build, I can see lines in the stack trace in my code. I wanted to understand in which exact line the error occurred so that I could add logs later. Surprisingly, the error didn't reproduce when I used just a copy

@raman-m
Copy link
Member

raman-m commented Jun 24, 2024

I am awaiting a PR that will resolve the issue...
At present, I cannot replicate your user scenario as there are no reproduction steps provided. Writing these steps is challenging because everything appears to function correctly for the first 29 requests, and only the 30th request fails with an exception. This is integration issue with invalid data from integrated service which is K8s client.
Since I cannot reproduce the issue and lack access to your actual production environment, emulating your production setting is also difficult.
Please pinpoint the issue and return to us with a bugfix PR that better illustrates the problem.
Best of luck!

@antikorol
Copy link
Author

antikorol commented Jun 24, 2024

@raman-m
I haven't still found 100% reproduction case, but I delved deeper into the code and noticed one detail.


This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it, RoundRobinCreator is also registered as a singleton. At the same time, we use the Kube.GetAsync method call in async code, where we can easily encounter a "Collection was modified" exception. Maybe somewhere deep in the List.Clear or elsewhere, due to resize or clear array, we might lost references and encounter an "Object reference not set to an instance..." exception.
What do you think about this part of code?

@raman-m
Copy link
Member

raman-m commented Jun 25, 2024

This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it

Look at the actual registration method

public static IOcelotBuilder AddKubernetes(this IOcelotBuilder builder, bool usePodServiceAccount = true)
{
builder.Services
.AddKubeClient(usePodServiceAccount)
.AddSingleton(KubernetesProviderFactory.Get)
.AddSingleton<IKubeServiceBuilder, KubeServiceBuilder>()
.AddSingleton<IKubeServiceCreator, KubeServiceCreator>();
return builder;
}

So, the Kube class is not registered at all. Only KubernetesProviderFactory.Get delegate is registered:
.AddSingleton(KubernetesProviderFactory.Get)

Kube is instantiated on every request. The logic is quite complicated. And I think Kube is scoped service.
Also, IKubeApiClient is scoped registered service.
The problem could be bad/empty response from IKubeApiClient


RoundRobinCreator is also registered as a singleton.

The issue seems to be with the load balancing logic, which involves a singleton RoundRobin service that is poorly synchronized. The RoundRobin is receiving an unsynchronized IServiceDiscoveryProvider object. Furthermore, the Lease method invokes an asynchronous method that may return lists of varying lengths, which I believe is problematic. Additionally, the exception call stack pinpoints the Lease method as the source of exceptions. It would be prudent to rewrite the Lease method to operate synchronously and stabilize the list length immediately.

What's happening in my opinion.
1st thread gets the list of 1 item (count == 1)
2nd thread gets list of 1 null-item (count == 1)
1st thread goes inside of the if-block considering the list is not empty
2nd thread has finished calling async call of await _servicesDelegate?.Invoke() which returns list of null items
2nd thread re-initializes services variable with list of null items
1st thread goes to the line 32 and reads null-object
1st thread read data on line 33 from null object and the exception is generated

I believe the problem is on the line 33

return new OkResponse<ServiceHostAndPort>(next.HostAndPort);

It must be rewritten!

@antikorol
Copy link
Author

antikorol commented Jun 25, 2024

return new OkResponse<ServiceHostAndPort>(next.HostAndPort);

You are right that problem at this row. But the main problem in concurrent issue that I described before in the class Kube because the collection of services is declared at the class field level.

I spent a lot of time for confirming the issue, I hope it will be enough.
I added debug logs and can see that after clearing the list, sometimes after inserting one service, the list becames with a size of 2.

image

The second element of list is null I can see it in class RoundRoubin

image

Example of logs:
Kube class where I can see that captured list and _services list are different:

public virtual async Task<List<Ocelot.Values.Service>> GetAsync()
{
        var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);

        _services.Clear();
        if (endpoint?.Subsets.Count != 0)
        {
            var array = BuildServices(_configuration, endpoint!);

            _services.AddRange(array);

            var captured = new List<Ocelot.Values.Service>(array);

            _logger.LogWarning(() => $"RoundRobin2: Kube.GetAsync;"
                + $"_services.Count: {_services.Count}; captured.Count: {captured.Count};"
                + $" Services: {string.Join(Environment.NewLine, captured.Select(s => s is null ? "null" : JsonConvert.SerializeObject(s)))}");
        }
        else
        {
            _logger.LogWarning(() => $"K8s Namespace:{_configuration.KubeNamespace}, Service:{_configuration.KeyOfServiceInK8s}; Unable to use: it is invalid. Address must contain host only e.g. localhost and port must be greater than 0!");
        }

        return _services;
}

If the logs aren't entirely clear, I created a code sample that shows if the list is modified from different threads, we can get null where it shouldn't be. In the sample, I always add one element to the list, but in different threads, and I end up with the list size being more than 1 and the second element is null. And we don't get an exeption that the collection was modified.
Example is here https://github.com/antikorol/incorrect-list-usage-sample

@raman-m raman-m added bug Identified as a potential bug Summer'24 Summer 2024 release and removed waiting Waiting for answer to question or feedback from issue raiser labels Jun 25, 2024
@raman-m raman-m added this to the Summer'24 milestone Jun 25, 2024
@raman-m
Copy link
Member

raman-m commented Jun 25, 2024

@antikorol Roman,
This code analysis and research are excellent, thank you!
I concur that the Kube design warrants a review, and synchronization issues need resolution. It appears that the forthcoming refactoring may alter between 2 to 5 classes in the K8s provider. Following your report, it seems the Kube instance is acting as a singleton, likely due to the storage of provider delegates in a static field.

If you open a fix PR, I will prepare an additional hotfix release. However, I'm concerned that we need to refactor the Kube provider logic, which may require some time to develop smoothly.

FYI, I've included this bug issue in the Summer'24 milestone. This represents the least favorable delivery scenario. I am optimistic that we will manage to deliver the fix within a week or two, or alternatively, we can assign the issue to the Annual'23 release which is early bird aka current release.

@raman-m
Copy link
Member

raman-m commented Jun 25, 2024

TODO

  • Acceptance test which should catch the problem

@antikorol
Copy link
Author

A quick fix is to move the creation of the List from the class fields to a method.
This will increase allocations, but it will help avoid exception in high-load APIs.
A memory-optimized solution can be written or refactored later.

@raman-m
Copy link
Member

raman-m commented Jun 26, 2024

Would you be interested in contributing by submitting a PR? If so, please fork the repository into your account.

@antikorol
Copy link
Author

Hello, Raman @raman-m
I've opened the PR

@raman-m raman-m added the accepted Bug or feature would be accepted as a PR or is being worked on label Jun 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Bug or feature would be accepted as a PR or is being worked on bug Identified as a potential bug Kubernetes Service discovery by Kubernetes Load Balancer Ocelot feature: Load Balancer Service Discovery Ocelot feature: Service Discovery Summer'24 Summer 2024 release
Projects
None yet
2 participants