discuss: endpoint choose issue

Question

discuss: endpoint choose issue

nic-chen opened this issue 4 years ago · comments

JunXu Chen commented 4 years ago

Background

Currently, lua-resty-etcd supports cluster mode, but the way of implementation is too simple: each connection changes to the next endpoint.
Using this mechanism, it can work well under normal circumstances, but once an api or an instance has a problem, the consequences will be unpredictable.

Issues with the current solution

In cluster mode, when an instance is down, there is no way to skip the down instance, and it will still be polled every time.
When all instances of a certain api fail (such as auth api), it will cause crazy retries, which may eventually overwhelm the ETCD cluster.

Suggested changes

Implement a health check mechanism, no active check is required, only passive check is required, that is, it is recorded when the connection fails.
There is no need to poll all instances, and only switch instances when the connection fails.
In a certain period of time, if there are n consecutive failures, the instance is considered unhealthy, and the instance will not be connected for a certain period of time in the future (the duration and times can be configured).

related issue: apache/apisix#2899

what do you think ?

Thanks.

kurt commented 3 years ago

got

JunXu Chen · Answer 1 · Thu Dec 03 2020 20:21:11 GMT+0800 (China Standard Time)

cc @membphis @spacewander @Yiyiyimu @tzssangglass

Shuyang Wu · Answer 2 · Thu Dec 03 2020 21:37:40 GMT+0800 (China Standard Time)

Looks great to me. It just how much could we benefit from it.

YuanSheng Wang · Answer 3 · Thu Dec 03 2020 22:51:36 GMT+0800 (China Standard Time)

related PR: #96

kurt · Answer 4 · Sat Dec 05 2020 11:58:56 GMT+0800 (China Standard Time)

get it , assigned to me.

kurt · Answer 5 · Thu Dec 17 2020 00:41:17 GMT+0800 (China Standard Time)

Hi folks,
after two tries, and the discussion with @membphis , I found that the previous designs were flawed, so now I'm going to publish some design ideas and ask for your opinions before I try to do it.

the health check instance is global and manages an endpoinds pool, and each ectd instance registers its own endpoints to endpoinds pool
the health check instance track and update the status of each endpoint in the endpoints pool
each etcd instance will report endpoint failures to the health check instance, and when choosing its own endpoint, it will check the status of the corresponding endpoint in the endpoints pool and pass the unhealthy endpoint

kurt · Answer 6 · Thu Dec 17 2020 09:18:52 GMT+0800 (China Standard Time)

ping @membphis @nic-chen @spacewander @tokers

JunXu Chen · Answer 7 · Thu Dec 17 2020 16:32:39 GMT+0800 (China Standard Time)

@tzssangglass
How do we implement the health check instance? it's not mentioned above.

YuanSheng Wang · Answer 8 · Thu Dec 17 2020 19:02:44 GMT+0800 (China Standard Time)

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

kurt · Answer 9 · Thu Dec 17 2020 21:47:58 GMT+0800 (China Standard Time)

@tzssangglass I think you can take a look at this plugin[api-breaker]: https://github.com/apache/apisix/blob/master/apisix/plugins/api-breaker.lua#L168

it should be useful for you

got, let me study it

kurt · Answer 10 · Thu Dec 17 2020 21:50:18 GMT+0800 (China Standard Time)

health check instance

the health check instance is independent of the etcd instances, which I think can be created in the init_worker_by_lua phase

kurt · Answer 11 · Mon Dec 21 2020 09:29:13 GMT+0800 (China Standard Time)

I am busy during this period, so this work will be slow.

kurt · Answer 12 · Tue Dec 22 2020 00:27:30 GMT+0800 (China Standard Time)

I've updated the flowchart a bit to hopefully convey my thoughts more clearly.

checker parameter setting: the checker parameters(fail_timeout, max_fails) are global, not per etcd client.
choose endpoint: the function choose_endpoint of the etcd client checks if the selected endpoint is healthy by calling the function check_endpoint_status of the checker and passes if it is not.
endpoint status: the status o endpoint is stored in the shared dict, and the status is global, shared by the worker, and shared by each etcd client.

YuanSheng Wang · Answer 13 · Tue Dec 22 2020 17:08:14 GMT+0800 (China Standard Time)

@tzssangglass I think we can avoid using the init_worker_lua phase. Lua top-level variable should be enough.

The others are LGTM . ^_^