[question] hostnetwork

Question

[question] hostnetwork

kuizhiqing opened this issue 3 years ago · comments

In the implementation of host network feature, a service is created with clusterIP setting to "".

In summary,

pod bind to host with hostnetwork
service bind to pod, with a port redirection

My question is that can we achieve a good performance if we access the pod via the service, respect to access via the host ip ?

Simon_CQK · Answer 1 · Sat Nov 13 2021 13:26:38 GMT+0800 (China Standard Time)

@kuizhiqing hi zhiqing, good point! Our motivation of job hostnetwork mode is to enable some capabilities like RDMA， which requires hostnetwork and bypassing overlay networking.
Anyway, when hostnetwork bypasses the virtualized overlay container network, service based traffic routing on standard k8s networking model brings new overhead, theoretically it has some performance advantages comparing to overlay network when scale of cluster is not that large. What's more important, high performance networking capabilities is fully enabled and workers can be seamlessly failovered regardless of peer worker host-ports changed.

Simon_CQK · Answer 2 · Sat Nov 13 2021 13:35:46 GMT+0800 (China Standard Time)

@kuizhiqing performance of service traffic routing has been bumped since IPVS introduced, and as far as I know there some other contributions working on this, e.g. Tencent optimize service performance leveraging IPVS-eBPF.

Chitsing KUI · Answer 3 · Mon Nov 15 2021 14:01:38 GMT+0800 (China Standard Time)

@SimonCqk Thank you very much for your explanation, it helps me a lot to understand your design.

There is one more thing which may beyond the scope but I really want to know if you have further consideration under this topic. As you mentioned that a service is born to handle the failover situation, in ML scenario, communication library such as NCCL is NOT fault tolerant, which prevents us from profiting the benefit of introducing service. I mean if we have to re-construct the communication group with the new pod, do it with the same service or a new IP is almost the same cost.

That's one reason why I propose the initial question, accessing via the host ip directly can lead to a better performance (I was persuade that this diff may negligible), does the argument of introducing service is strong enough here ?

Simon_CQK · Answer 4 · Mon Nov 15 2021 16:17:25 GMT+0800 (China Standard Time)

@kuizhiqing since NCCL-based allreduce training job is not fault-tolerant, why not just mark it as Failed and re-submit job?
If some worker unexpectedly crashed during training progress, kube-scheduler will not guaranteed that it can be re-scheduled to the node along with other worker pods, the NCCL communication ring always breaks.

Chitsing KUI · Answer 5 · Wed Feb 16 2022 11:08:55 GMT+0800 (China Standard Time)

Hi @SimonCqk , one more question plz, it seems that the allocation of hostport depends nothing but a random function , which do not cover the case of port conflict, am I right ?

Jian He · Answer 6 · Tue Jun 28 2022 07:13:16 GMT+0800 (China Standard Time)

the random function can avoid conflict to some extent, in case of conflict that really happened, kubedl will do failover and try again until the unused port is found.
Hope this explanation helps