kubedl-io / kubedl

Run your deep learning workloads on Kubernetes more easily and efficiently.

Home Page:https://kubedl.io/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

[question] hostnetwork

kuizhiqing opened this issue · comments

In the implementation of host network feature, a service is created with clusterIP setting to "".

In summary,

  • pod bind to host with hostnetwork
  • service bind to pod, with a port redirection

My question is that can we achieve a good performance if we access the pod via the service, respect to access via the host ip ?

@kuizhiqing hi zhiqing, good point! Our motivation of job hostnetwork mode is to enable some capabilities like RDMA, which requires hostnetwork and bypassing overlay networking.
Anyway, when hostnetwork bypasses the virtualized overlay container network, service based traffic routing on standard k8s networking model brings new overhead, theoretically it has some performance advantages comparing to overlay network when scale of cluster is not that large. What's more important, high performance networking capabilities is fully enabled and workers can be seamlessly failovered regardless of peer worker host-ports changed.

@kuizhiqing performance of service traffic routing has been bumped since IPVS introduced, and as far as I know there some other contributions working on this, e.g. Tencent optimize service performance leveraging IPVS-eBPF.

@SimonCqk Thank you very much for your explanation, it helps me a lot to understand your design.

There is one more thing which may beyond the scope but I really want to know if you have further consideration under this topic. As you mentioned that a service is born to handle the failover situation, in ML scenario, communication library such as NCCL is NOT fault tolerant, which prevents us from profiting the benefit of introducing service. I mean if we have to re-construct the communication group with the new pod, do it with the same service or a new IP is almost the same cost.

That's one reason why I propose the initial question, accessing via the host ip directly can lead to a better performance (I was persuade that this diff may negligible), does the argument of introducing service is strong enough here ?

@kuizhiqing since NCCL-based allreduce training job is not fault-tolerant, why not just mark it as Failed and re-submit job?
If some worker unexpectedly crashed during training progress, kube-scheduler will not guaranteed that it can be re-scheduled to the node along with other worker pods, the NCCL communication ring always breaks.

Hi @SimonCqk , one more question plz, it seems that the allocation of hostport depends nothing but a random function , which do not cover the case of port conflict, am I right ?

the random function can avoid conflict to some extent, in case of conflict that really happened, kubedl will do failover and try again until the unused port is found.
Hope this explanation helps