api7 / lua-resty-etcd

Nonblocking Lua etcd driver library for OpenResty

Home Page:https://api7.ai/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

discuss: endpoint choose issue

nic-chen opened this issue · comments

Background

  1. Currently, lua-resty-etcd supports cluster mode, but the way of implementation is too simple: each connection changes to the next endpoint.

  2. Using this mechanism, it can work well under normal circumstances, but once an api or an instance has a problem, the consequences will be unpredictable.

Issues with the current solution

  1. In cluster mode, when an instance is down, there is no way to skip the down instance, and it will still be polled every time.

  2. When all instances of a certain api fail (such as auth api), it will cause crazy retries, which may eventually overwhelm the ETCD cluster.

Suggested changes

  1. Implement a health check mechanism, no active check is required, only passive check is required, that is, it is recorded when the connection fails.

  2. There is no need to poll all instances, and only switch instances when the connection fails.

  3. In a certain period of time, if there are n consecutive failures, the instance is considered unhealthy, and the instance will not be connected for a certain period of time in the future (the duration and times can be configured).

related issue: apache/apisix#2899

what do you think ?

Thanks.

Looks great to me. It just how much could we benefit from it.

related PR: #96

commented

get it , assigned to me.

commented

Hi folks,
after two tries, and the discussion with @membphis , I found that the previous designs were flawed, so now I'm going to publish some design ideas and ask for your opinions before I try to do it.

181608135556_ pic_hd

  1. the health check instance is global and manages an endpoinds pool, and each ectd instance registers its own endpoints to endpoinds pool
  2. the health check instance track and update the status of each endpoint in the endpoints pool
  3. each etcd instance will report endpoint failures to the health check instance, and when choosing its own endpoint, it will check the status of the corresponding endpoint in the endpoints pool and pass the unhealthy endpoint

@tzssangglass
How do we implement the health check instance? it's not mentioned above.

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

commented

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

got, let me study it

commented

health check instance

the health check instance is independent of the etcd instances, which I think can be created in the init_worker_by_lua phase

commented

I am busy during this period, so this work will be slow.

commented

I've updated the flowchart a bit to hopefully convey my thoughts more clearly.
201608567280_ pic_hd

  • checker parameter setting: the checker parameters(fail_timeout, max_fails) are global, not per etcd client.
  • choose endpoint: the function choose_endpoint of the etcd client checks if the selected endpoint is healthy by calling the function check_endpoint_status of the checker and passes if it is not.
  • endpoint status: the status o endpoint is stored in the shared dict, and the status is global, shared by the worker, and shared by each etcd client.

@tzssangglass I think we can avoid using the init_worker_lua phase. Lua top-level variable should be enough.

image

The others are LGTM . ^_^

commented

got