ThreeMammals / Ocelot

.NET API Gateway

Home Page:https://www.nuget.org/packages/Ocelot

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500

antikorol opened this issue · comments

commented

Expected Behavior / New Feature

Resolving a service address to route an API request

Actual Behavior / Motivation for New Feature

Sometimes, about every 20th - 30th request finished with status code 500

requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId, 
message: 'Exception caught in global error handler, 
exception message: Object reference not set to an instance of an object.,
exception stack:
   at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
   at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext)
   at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4ISPBSIV4V:000037FF'

Steps to Reproduce the Problem

{
  "Routes": [
    {
      "UpstreamPathTemplate": "/public/{everything}",
      "UpstreamHttpMethod": [
        "POST"
      ],
      "ServiceName": "serviceName",
      "DownstreamPathTemplate": "/public/{everything}",
      "DownstreamScheme": "http",
      "LoadBalancerOptions": {
        "Type": "RoundRobin"
      }
    }
  ],
  "GlobalConfiguration": {
    "ServiceDiscoveryProvider": {
      "Namespace": "namespace",
      "Type": "Kube"
    }
}
}

Specifications

  • Version: 23.3.3 (NET6 and NET8)
  • Platform: Linux
  • Subsystem:

Hello Roman!
Welcome to Ocelot world! 🐯

It's difficult to determine what happened. Could you please upload the entire solution to GitHub for review? Alternatively, you could attach the following artifacts to this thread:

  • The Kubernetes service configurations, particularly the serviceName configuration
  • The Ocelot application startup C# code, also known as the app services setup
  • The Kubernetes client JSON configuration, also referred to as appsettings.json

This information will likely assist in identifying the issue.


After Upgrade to 23.3.3 from 23.2.2 about every 20th - 30th request finished with status code 500

You didn't observe the same errors in the Ocelot logs for version 23.2.x, did you?
Are you now operating in two different environments, or did you only upgrade a single environment?

Since you're utilizing Kubernetes, could you provide more details about your K8s server deployment?
How do you initiate the server, is it locally or in a cloud remotely?
And how long have you been using Kubernetes in conjunction with Ocelot service discovery?

Regarding the exception details...

requestId: 0HN4ISPBSIV4V:000037FF, previousRequestId: No PreviousRequestId, 
message: 'Exception caught in global error handler, 
exception message: Object reference not set to an instance of an object.,
exception stack:
   at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)
   at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)

Are you a C# developer? Can you debug C# code? Do you use the official Ocelot NuGet package?

Deploying the Release version of the Ocelot package DLLs means there is no debug information for the called lines in the Call Stack trace. However, we can analyze the code of the RoundRobin.Lease method.

public async Task<Response<ServiceHostAndPort>> Lease(HttpContext httpContext)
{
var services = await _servicesDelegate?.Invoke() ?? new List<Service>();
if (services?.Count != 0)
{
lock (_lock)
{
if (_last >= services.Count)
{
_last = 0;
}
var next = services[_last++];
return new OkResponse<ServiceHostAndPort>(next.HostAndPort);
}
}
return new ErrorResponse<ServiceHostAndPort>(new ServicesAreEmptyError($"There were no services in {nameof(RoundRobin)} during {nameof(Lease)} operation."));
}

It appears the issue may lie here:

var next = services[_last++];
return new OkResponse<ServiceHostAndPort>(next.HostAndPort);

There's a missing null check for line 33. Theoretically, the service provider could return a null object, which should be accounted for.

To conclude, it seems that for some reason, the Kube service provider and its service builder returned a null list item, leading to the raising of NullReferenceException.
Therefore, further investigation into the problem is warranted...

@antikorol The subsequent step in the investigation involves searching for subsequent messages in Ocelot's log.
Here is the service builder logic:

public virtual IEnumerable<Service> BuildServices(KubeRegistryConfiguration configuration, EndpointsV1 endpoint)
{
ArgumentNullException.ThrowIfNull(configuration);
ArgumentNullException.ThrowIfNull(endpoint);
var services = endpoint.Subsets
.SelectMany(subset => _serviceCreator.Create(configuration, endpoint, subset))
.ToArray();
_logger.LogDebug(() => $"K8s '{Check(endpoint.Kind)}:{Check(endpoint.ApiVersion)}:{Check(endpoint.Metadata?.Name)}' endpoint: Total built {services.Length} services.");
return services;
}

We need to search for the following message in the log:

_logger.LogDebug(() => $"K8s '{Check(endpoint.Kind)}:{Check(endpoint.ApiVersion)}:{Check(endpoint.Metadata?.Name)}' endpoint: Total built {services.Length} services.");

In appsettings.json file the Debug logging level for the application can be defined as follows:

{
  "Logging": {
    "LogLevel": {
      "Default": "Debug",
      "Microsoft.AspNetCore": "Warning"
    }
  }
}

Deploy and review the log for the message:

  • "K8s 'kind:version:name' endpoint: Total built N services."

If this message appears in the log, it indicates that the Kubernetes setup is correct, and we can proceed to investigate the issue more thoroughly.

every 20th - 30th request finished with status code 500

It seems that the requests to the Kubernetes endpoint were failing around the 20th to 30th attempts, as indicated by the code in these lines:

var endpoint = await _kubeApi
.ResourceClient(client => new EndPointClientV1(client))
.GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);

This suggests that the Kubernetes client connectivity might be unstable. It would be advisable to switch from the Kube to the PollKube provider and observe the behavior again. The error should no longer occur if this change is made. Additionally, the PollingInterval should be set to less than half the average duration between the occurrences of the 500 error.

Let's figure out what's happening and possibly provide a hotfix for your user scenario to eliminate the Null Reference Error logging. We need to handle this 500 status case and process it accordingly.
Will you be opening a PR soon?

commented
  1. I am using the official NuGet package.
    There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port

  2. I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces.
    https://github.com/antikorol/sample-of-error

  3. When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error. (I replaced it this way because I wanted to see in which line the null reference exception appears)

  4. When I use PollKube, I get an error in the logs "requestId: 0HN4ISPBSIV51:000023E9, previousRequestId: No PreviousRequestId, message: 'Error Code: ServicesAreEmptyError Message: There were no services in RoundRobin during Lease operation. errors found in ResponderMiddleware"

  5. When I changed log level to debug, I was only able to reproduce this exception once

Below is the log with debug information.


  | Jun 24, 2024 @ 15:13:45.018 | requestId: 0HN4K8QDBB2FT:00000016, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'
-- | --

  | Jun 24, 2024 @ 15:13:39.565 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'Exception caught in global error handler, exception message: Object reference not set to an instance of an object., exception stack:    at Ocelot.LoadBalancer.LoadBalancers.RoundRobin.Lease(HttpContext httpContext)    at Ocelot.LoadBalancer.Middleware.LoadBalancingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.DownstreamPathManipulation.Middleware.ClaimsToDownstreamPathMiddleware.Invoke(HttpContext httpContext)    at Ocelot.QueryStrings.Middleware.ClaimsToQueryStringMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Headers.Middleware.ClaimsToHeadersMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Authorization.Middleware.AuthorizationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Claims.Middleware.ClaimsToClaimsMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Authentication.Middleware.AuthenticationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.RequestId.Middleware.RequestIdMiddleware.Invoke(HttpContext httpContext)    at Ocelot.RateLimiting.Middleware.RateLimitingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Request.Middleware.DownstreamRequestInitialiserMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Headers.Middleware.HttpHeadersTransformationMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Security.Middleware.SecurityMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Multiplexer.MultiplexingMiddleware.Invoke(HttpContext httpContext)    at Ocelot.DownstreamRouteFinder.Middleware.DownstreamRouteFinderMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Responder.Middleware.ResponderMiddleware.Invoke(HttpContext httpContext)    at Ocelot.Errors.Middleware.ExceptionHandlerMiddleware.Invoke(HttpContext httpContext) RequestId: 0HN4K8QDBB2FO:00000015'

  | Jun 24, 2024 @ 15:13:39.515 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FU:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.514 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FO:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.512 | requestId: 0HN4K8QDBB2FP:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.511 | requestId: 0HN4K8QDBB2FQ:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.506 | requestId: 0HN4K8QDBB2FR:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FT:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.505 | requestId: 0HN4K8QDBB2FN:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s service with key 'login-service' and address 172.31.40.225; Detected port is http:80. Total 1 ports of [http].'

  | Jun 24, 2024 @ 15:13:39.504 | requestId: 0HN4K8QDBB2FS:00000015, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'

  | Jun 24, 2024 @ 15:13:39.209 | requestId: 0HN4K8QDBB2FR:00000014, previousRequestId: No PreviousRequestId, message: 'K8s 'Endpoints:v1:login-service' endpoint: Total built 1 services.'


Dear Roman,
What's your full name?
What's your LinkedIn? We can continue discussion in LinkedIn chat...

  1. I am using the official NuGet package.
    There was no error in version 23.2.2, but I had my own logic for extending the config to specify the service port now I have started using your logic with schema matching with the named port

Please be aware that any custom logic may compromise stability, and the Ocelot team cannot be held accountable for it. It appears your existing custom logic is only compatible with version 23.2.
I guess, you might be facing a Kubernetes misconfiguration issue related to the Downstream Scheme versus Port Names feature. Are you utilizing this feature?

It would be helpful to examine the service definition in your Kubernetes configuration. Could you provide it, please?

commented

No, I removed that logic because your changes already cover the case I needed with multiple ports.

  1. I have cleaned the service code from corporate code, leaving the minimum on which the error definitely reproduces.
    https://github.com/antikorol/sample-of-error

Good! After a quick review:

  • ApiGateway.Service: Where is the ocelot.json file? In other words, where is the primary configuration file for Ocelot?
  • appsettings.json: I'm very sorry, but why have you defined routes inside appsettings.json? The Ocelot configuration should be in ocelot.json! For configuration guidance: configuration.

I wonder that your Ocelot app started at all!
Did you develop the Ocelot app in the past using version 23.2 or someone else?

Did you read Configuration docs?
Especially, the Multiple Environments paragraph which say how to configure app with .AddJsonFile("ocelot.json") // primary config file

So, I don't see configuration part in these lines L27-L42
You have to setup configuration:

    builder.WebHost
        .ConfigureAppConfiguration((context, config) =>
        {
            var env = context.HostingEnvironment;
            config.SetBasePath(env.ContentRootPath)
                .AddJsonFile("appsettings.json", true, true)
                .AddJsonFile($"appsettings.{env.EnvironmentName}.json", true, true)
                .AddOcelot(env) // happy path
                .AddEnvironmentVariables();
        })

So, it's very important to configure via AddOcelot method:

 .ConfigureAppConfiguration((context, config) => config.AddOcelot())

This is minimal required configuring code! But better to use official template from docs.

@antikorol commented:

No, I removed that logic because your changes already cover the case I needed with multiple ports.

Good! You wanna use the feature: Downstream Scheme vs Port Names

Also you enabled the port finder via "DownstreamScheme": "http".
Now, are you sure your K8s definition is correct?
Should we review K8s config together?

commented

There was update from 18.0.0 to 23.2.2 a long time ago and now to 23.3.3
I had a configuration in a YAML file, but to exclude all possible libraries, I converted it to JSON and added it to the appsettings, and yes, everything worked.
Now I've moved it to a separate ocelot.json and fixed registration

According to the Microsoft documentation, it is not mandatory to explicitly specify appsettings.json or its variations depending on the environment.
https://learn.microsoft.com/en-us/aspnet/core/fundamentals/configuration/?view=aspnetcore-6.0

Should we review K8s config together?

yes, what I should provide?

  1. When I make a copy of the RoundRobin class and simply use it in the config as RoundRobin2, I can no longer reproduce this error.

I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same. I don't understand what you're referring to.

(I replaced it this way because I wanted to see in which line the null reference exception appears)

I apologize, but uncommenting line L53 does not result in any replacement. You need to implement actual service injection using the official helpers: Custom Load Balancers. Have you consulted this documentation?

Even if you substitute the services with custom ones, it won't aid in identifying the failing line, as this necessitates the inclusion of Debug version DLLs in your Docker file, as seen here: L21. Therefore, you must build with the debug configuration -c Debug to ensure the log contains Call Stack code lines within messages for exceptions.

@antikorol commented:

yes, what I should provide?

Please provide the Kubernetes service definition in text, JSON format, or the Kubernetes client response JSON. From the log you've shared, it appears you have defined only one service; however, the number of ports in the definition is unclear.

Additionally, I'm curious as to why every twentieth Kubernetes client request fails or returns an invalid response, causing the Ocelot provider to throw NullReferenceException. We need to pinpoint this faulty request and obtain the response data in JSON format to replicate the user scenario through tests.

Lastly, I trust that you have the expertise to debug the provider, identify the issue, and potentially write tests or even submit a pull request with a fix.

commented

I apologize, but I see absolutely no difference from the official code. There's no difference at all; every line is exactly the same.

Sorry for the confusion. The files are identical. I didn't want to change the project build type. In the release build, I can see lines in the stack trace in my code. I wanted to understand in which exact line the error occurred so that I could add logs later. Surprisingly, the error didn't reproduce when I used just a copy

I am awaiting a PR that will resolve the issue...
At present, I cannot replicate your user scenario as there are no reproduction steps provided. Writing these steps is challenging because everything appears to function correctly for the first 29 requests, and only the 30th request fails with an exception. This is integration issue with invalid data from integrated service which is K8s client.
Since I cannot reproduce the issue and lack access to your actual production environment, emulating your production setting is also difficult.
Please pinpoint the issue and return to us with a bugfix PR that better illustrates the problem.
Best of luck!

commented

@raman-m
I haven't still found 100% reproduction case, but I delved deeper into the code and noticed one detail.


This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it, RoundRobinCreator is also registered as a singleton. At the same time, we use the Kube.GetAsync method call in async code, where we can easily encounter a "Collection was modified" exception. Maybe somewhere deep in the List.Clear or elsewhere, due to resize or clear array, we might lost references and encounter an "Object reference not set to an instance..." exception.
What do you think about this part of code?

This class is registered as a singleton for each ServiceProviderConfiguration, and the class that resolves it

Look at the actual registration method

public static IOcelotBuilder AddKubernetes(this IOcelotBuilder builder, bool usePodServiceAccount = true)
{
builder.Services
.AddKubeClient(usePodServiceAccount)
.AddSingleton(KubernetesProviderFactory.Get)
.AddSingleton<IKubeServiceBuilder, KubeServiceBuilder>()
.AddSingleton<IKubeServiceCreator, KubeServiceCreator>();
return builder;
}

So, the Kube class is not registered at all. Only KubernetesProviderFactory.Get delegate is registered:
.AddSingleton(KubernetesProviderFactory.Get)

Kube is instantiated on every request. The logic is quite complicated. And I think Kube is scoped service.
Also, IKubeApiClient is scoped registered service.
The problem could be bad/empty response from IKubeApiClient


RoundRobinCreator is also registered as a singleton.

The issue seems to be with the load balancing logic, which involves a singleton RoundRobin service that is poorly synchronized. The RoundRobin is receiving an unsynchronized IServiceDiscoveryProvider object. Furthermore, the Lease method invokes an asynchronous method that may return lists of varying lengths, which I believe is problematic. Additionally, the exception call stack pinpoints the Lease method as the source of exceptions. It would be prudent to rewrite the Lease method to operate synchronously and stabilize the list length immediately.

What's happening in my opinion.
1st thread gets the list of 1 item (count == 1)
2nd thread gets list of 1 null-item (count == 1)
1st thread goes inside of the if-block considering the list is not empty
2nd thread has finished calling async call of await _servicesDelegate?.Invoke() which returns list of null items
2nd thread re-initializes services variable with list of null items
1st thread goes to the line 32 and reads null-object
1st thread read data on line 33 from null object and the exception is generated

I believe the problem is on the line 33

return new OkResponse<ServiceHostAndPort>(next.HostAndPort);

It must be rewritten!

commented

return new OkResponse<ServiceHostAndPort>(next.HostAndPort);

You are right that problem at this row. But the main problem in concurrent issue that I described before in the class Kube because the collection of services is declared at the class field level.

I spent a lot of time for confirming the issue, I hope it will be enough.
I added debug logs and can see that after clearing the list, sometimes after inserting one service, the list becames with a size of 2.

image

The second element of list is null I can see it in class RoundRoubin

image

Example of logs:
Kube class where I can see that captured list and _services list are different:

public virtual async Task<List<Ocelot.Values.Service>> GetAsync()
{
        var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);

        _services.Clear();
        if (endpoint?.Subsets.Count != 0)
        {
            var array = BuildServices(_configuration, endpoint!);

            _services.AddRange(array);

            var captured = new List<Ocelot.Values.Service>(array);

            _logger.LogWarning(() => $"RoundRobin2: Kube.GetAsync;"
                + $"_services.Count: {_services.Count}; captured.Count: {captured.Count};"
                + $" Services: {string.Join(Environment.NewLine, captured.Select(s => s is null ? "null" : JsonConvert.SerializeObject(s)))}");
        }
        else
        {
            _logger.LogWarning(() => $"K8s Namespace:{_configuration.KubeNamespace}, Service:{_configuration.KeyOfServiceInK8s}; Unable to use: it is invalid. Address must contain host only e.g. localhost and port must be greater than 0!");
        }

        return _services;
}

If the logs aren't entirely clear, I created a code sample that shows if the list is modified from different threads, we can get null where it shouldn't be. In the sample, I always add one element to the list, but in different threads, and I end up with the list size being more than 1 and the second element is null. And we don't get an exeption that the collection was modified.
Example is here https://github.com/antikorol/incorrect-list-usage-sample

@antikorol Roman,
This code analysis and research are excellent, thank you!
I concur that the Kube design warrants a review, and synchronization issues need resolution. It appears that the forthcoming refactoring may alter between 2 to 5 classes in the K8s provider. Following your report, it seems the Kube instance is acting as a singleton, likely due to the storage of provider delegates in a static field.

If you open a fix PR, I will prepare an additional hotfix release. However, I'm concerned that we need to refactor the Kube provider logic, which may require some time to develop smoothly.

FYI, I've included this bug issue in the Summer'24 milestone. This represents the least favorable delivery scenario. I am optimistic that we will manage to deliver the fix within a week or two, or alternatively, we can assign the issue to the Annual'23 release which is early bird aka current release.

TODO

commented

A quick fix is to move the creation of the List from the class fields to a method.
This will increase allocations, but it will help avoid exception in high-load APIs.
A memory-optimized solution can be written or refactored later.

Would you be interested in contributing by submitting a PR? If so, please fork the repository into your account.

commented

Hello, Raman @raman-m
I've opened the PR

Thank you, Roman!
To understand the subsequent steps, please review our Dev Process.

Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful:

     var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
commented

Could you provide the response data or JSON that your client returns, please? Understanding the actual return values of these lines would be helpful:

     var endpoint = await _kubeApi
           .ResourceClient(client => new EndPointClientV1(client))
           .GetAsync(_configuration.KeyOfServiceInK8s, _configuration.KubeNamespace);
{
   "kind":"Endpoints",
   "apiVersion":"v1",
   "metadata":{
      "name":"fake-service",
      "namespace":"fake-namespace-dev",
      "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a",
      "resourceVersion":"111111",
      "creationTimestamp":"2023-11-01T13:07:32Z",
      "labels":{
         "app.kubernetes.io/instance":"fake-service",
         "app.kubernetes.io/managed-by":"Helm",
         "app.kubernetes.io/name":"fake-service",
         "app.kubernetes.io/version":"1.0.1",
         "helm.sh/chart":"fake-service-0.1.0"
      },
      "managedFields":[
         {
            "manager":"kube-controller-manager",
            "operation":"Update",
            "apiVersion":"v1",
            "time":"2024-07-11T11:04:39Z",
            "fieldsType":"FieldsV1",
            "fieldsV1":{
               "f:metadata":{
                  "f:labels":{
                     ".":{
                        
                     },
                     "f:app.kubernetes.io/instance":{
                        
                     },
                     "f:app.kubernetes.io/managed-by":{
                        
                     },
                     "f:app.kubernetes.io/name":{
                        
                     },
                     "f:app.kubernetes.io/version":{
                        
                     },
                     "f:helm.sh/chart":{
                        
                     }
                  }
               },
               "f:subsets":{
                  
               }
            }
         }
      ]
   },
   "subsets":[
      {
         "addresses":[
            {
               "ip":"111.22.33.10",
               "nodeName":"ip-111.22.33.10.eu-central-1.compute.internal",
               "targetRef":{
                  "kind":"Pod",
                  "namespace":"fake-namespace-dev",
                  "name":"fake-service-ddsadasd-1111",
                  "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a"
               }
            },
            {
               "ip":"111.33.33.11",
               "nodeName":"ip-111.33.33.11.eu-central-1.compute.internal",
               "targetRef":{
                  "kind":"Pod",
                  "namespace":"fake-namespace-dev",
                  "name":"fake-service-ddsadasd-2222",
                  "uid":"a00000a0-a0aa-0000-0000-0a00aaa0a00a"
               }
            }
         ],
         "ports":[
            {
               "name":"http",
               "port":80,
               "protocol":"TCP"
            }
         ]
      }
   ]
}

Thank you for the JSON samples! I will utilize them to write an acceptance test.

Roman, I have the next question regarding load balancing in your project...
Why does your team use Kubernetes for service registrations? Is the project big enough?
Using containerization tools makes sense in case of large projects, in a cloud to deploy containerization tools like Kubernetes or Consul.

If the project is small and the number of downstream services is constant (with all service IPs known), you can use static service definitions in the DownstreamHostAndPorts route options and manually load balance, as described in this documentation section:

So, it's just a question of...

commented

In my opinion, Kubernetes is a good choice for a startup/initial stage of the project as a simple and quick way to implement service discovery (it can be a five-minute job). In the future, after analyzing the load, you can implement another solution.

Using a static IP for a pod, in my view, is an anti-pattern.
Using pods IPs in static configuration is a bad idea because, in Kubernetes, there are many ways when IP address will be changed: deployment, scaling, and even simply rescheduling the pod to a different node by Kubernetes itself.

At the same time, to reduce the load on Kubernetes, I can simply use the PollKube. However, unfortunately, this bug affects this provider too, but this fix resolves it. :)

Is the project big enough?

I hope it will be huge in the future :D, but even now there is scaling during peak load hours.

@antikorol Hi Roman!
I was able to reproduce the problem in this branch: raman-m/2110-steps-to-reproduce → the KubernetesServiceDiscoveryTests class. You need to run ShouldReturnServicesFromK8s_HavingHighLoadOnTheProviderAndRoundRobinBalancer tests.

Be aware that there are stable and unstable theory cases; however, both exhibit the 500 status issue. To reveal the 500 error in the green test, you must use the "Run Until Failed" option. Additionally, both cases, when in Debug mode, might display the exception under discussion, because debug mode show exceptions being thrown.
I'm going to reuse my research results in the fixing PR...

commented

@raman-m Hello Raman
Thank you for acceptance tests
I've tried to run this test, but I have the error (on macOS)

System.Net.Sockets.SocketException: Can't assign requested address

This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop.

I see that your CI failed on this test case. Can I simply cherry-pick the commit and check it on your CI?

We use Windows and Linux only for now. Don't use MacOS please.
CircleCI runs builds using Linux env.
Now the test is green

I've tried to run this test, but I have the error (on macOS)

System.Net.Sockets.SocketException: Can't assign requested address

This is a known issue for macOS. I tried a couple of suggestions from the internet, but they didn't help, and many of the suggestions are such that I wouldn't want to implement on my work laptop.

@antikorol, you've highlighted an important aspect concerning MacOS builds. We will soon establish automated Windows CI-CD builds as certain features of Ocelot necessitate it. Linux builds satisfy 99.9% of our requirements, and the .NET SDK versus Linux is highly stable since Microsoft conducts extensive testing of the .NET SDK on Linux OS. I am in the process of ascertaining whether the issue lies with the .NET SDK on MacOS or if it is an issue unique to Ocelot?

commented

@raman-m Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias.

https://github.com/ThreeMammals/Ocelot/blob/raman-m/2110-steps-to-reproduce/test/Ocelot.AcceptanceTests/Steps.cs#L69

If I don't use loopback here, the tests pass (the method is called with different ports).

Understood! As a MacOS user, you can suggest a directive preprocessor correction to ensure this helper runs successfully in MacOS environments. MacOS machines should indeed have localhost as an embedded local IP. Have you investigated this issue thoroughly? Could you run the Terminal command ping localhost or an equivalent and share the Terminal output, please? Unfortunately, I do not have access to any MacOS machines.

Loopback does not work on macOS. If I want to use addresses 127.0.0.2+, I need to manually configure the alias.

You can manually configure aliases such as localhost1 and localhost2 in macOS; however, this method is not suitable for automated scripts.

commented

I encountered this problem many years ago, and it seems that nothing has changed since then.
I've attached a screenshot of the ping results.

image

I solved this problem last time by using the command for each needed ips

sudo ifconfig lo0 alias 127.0.0.2 up

There might be a more elegant solution, but I didn't have free time to investigate further.