solo-io / gloo

The Feature-rich, Kubernetes-native, Next-Generation API Gateway Built on Envoy

Home Page:https://docs.solo.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Allow management of the Envoy configuration parameter: `close_connections_on_host_set_change`.

evsasha opened this issue · comments

Gloo Edge Product

Open Source

Gloo Edge Version

v1.16.13

Is your feature request related to a problem? Please describe.

Feature Request: Solving the WebSocket Split Brain Problem.

In my scenario, multiple instances of Envoy Proxy serve several instances of the backend.
The backend operates using the WebSocket protocol.
Users connect to the backend and, via the Maglev load balancing protocol, consistently reach the same pod, regardless of which Envoy Proxy instance they connect through.

During a network failure, different Envoy instances start seeing a different number of backend instances.
WebSocket sessions get rebalanced to different pods, resulting in disrupted communication between users.

I expect that once the network issue is resolved and the pod is back in the load balancer, the sessions will automatically rebalance and once again route to a single pod.
However, this does not happen due to the default behavior.

This logic in Envoy is controlled by the configuration parameter close_connections_on_host_set_change, which is currently unavailable when using Gloo.

https://www.envoyproxy.io/docs/envoy/latest/api-v3/config/cluster/v3/cluster.proto#config-cluster-v3-cluster-commonlbconfig

Describe the solution you'd like

Add a configuration block to manage the Envoy parameter close_connections_on_host_set_change.

Describe alternatives you've considered

  1. Monitoring and Terminating Sessions on the Client Side: This requires implementation on the backend side and is potentially slower than an Envoy-side implementation.

  2. Using a Single Instance of Envoy Proxy: This would result in a loss of fault tolerance.

  3. Increasing Timeouts and the Number of Checks Before Removing a Pod from the Load Balancer: This reduces the number of incidents but also decreases the response time to a pod failure.

Additional Context

It might be worth considering the possibility of adding an entire configuration block for config.cluster.v3.Cluster.CommonLbConfig.