pinterest / DoctorK

DoctorK is a service for Kafka cluster auto healing and workload balancing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Question: Why does DoctorKafka rely on same cluster it is trying to heal? Isn't that an architecture problem?

saherahwal opened this issue · comments

commented

Hi

I am curious about the architecture and design of DoctorKafka.

Usually a system (e.g DoctorKafka) that checks on the health of another system should be isolated from the main system to be checked (Kafka Cluster). So I am curious about the design decision to use Kafka topic from within the cluster (the same cluster we are trying to auto-heal and check health of) to report metrics to be consumed by DoctorKafka.

Can someone please clarify and help me understand why this is a good design decision?
What was the thought process on the fact DoctorKafka relies on a topic within the same cluster it is trying to heal?

commented

Dr. Kafka is designed to manage 10s of clusters. Ideally deployments should separate those clusters but it's not necessary since the replication factor of the stats topic should ideally be 3 or 4 with min.isr 2 or higher. This provides sufficient fault tolerance to not really create any issues if you do deploy with a cyclical dependencies in the clusters.

Fault tolerance in question here should be achieved via replication and consistency in Kafka not necessarily via node isolation. Dr. Kafka also has built-in throttling controls to limit multi-node impact i.e. 1 node replaced at most in X hours therefore if the min.isr is > 2 then there shouldn't be any faults.

commented

Thanks for the explanation but I am not sure I understand your reply.
I am asking a simple architecture question regarding dependency here and the fact that the health system is part of the actual system we are reporting health for. It seems logical to me that health system should be isolated from the system it is trying to heal / detect problems in.

I am not discussing fault tolerance here. Just the fact that "how can I trust a health system that is part of the same exact system I am trying to heal?!". It just seems to be fragile architecture.