krkn sandbox submission review
joshgav opened this issue · comments
krkn is a chaos testing project proposed to CNCF sandbox in cncf/sandbox#44. This issue tracks discussions and reviews of krkn to help with accepting it in CNCF sandbox.
We've asked @psuriset and team to join an upcoming TAG meeting to discuss the following about the project:
- the values this project proposes to end users
- the project's high-level technical architecture
- the project's near-term roadmap
- state of the project's community and governance
- comparison with existing projects
krkn will present to the TAG at our general meeting on 10/18. Agenda/notes here: https://docs.google.com/document/d/1OykvqvhSG4AxEdmDMXilrupsX2n1qCSJUWwTc3I7AOs/edit#heading=h.5676wfk2ybjv
Thank you @psuriset and team for presenting krkn to us last week. Following are notes from the presentation. The TAG believes krkn is a good fit for CNCF sandbox!
- Recording: https://youtu.be/nXQkBFK_MWc?t=722
- Presentation: https://drive.google.com/file/d/1jaTWROCtruWyBvLB0xI5qZhbavVCSwEe/
Value Props
- find unexpected problems by injecting unexpected scenarios. Needed by Red Hat performance & scale team to ensure max performance of clusters and apps.
- emphasis on performance - SLAs and SLOs
- AI and recommender increase chaos coverage
Architecture
- Components include krkn, cerberus, chaos recommender, chaos AI, & telemetry collector
- client-side tool, doesn't run inside the cluster, don't want it to be a victim of its own actions
- calls APIs to inject chaos
- for supported scenarios has built-in checks for successful handling of failure
- configure PromQL queries defining success
- Cerberus: utility that aggregates health into a single go/no-go signal
Chaos AI and Recommender, Telemetry collector
- Why? improve and increase coverage for chaos
- Can watch telemetry from application or other components and create appropriate chaos test cases
- developed by IBM
- Recommender - based on static rules
- Chaos Recommender already part of project, Chaos AI still in development but will be part of project
- Chaos AI will include a mechanism to continually train a model based on actual telemetry and observation
Roadmap
- implement chaos tests for more known scenarios, for example a Kafka cluster in K8s or DNS
- want to learn from more users via CNCF
- want to create visualizations and reports from tests
Community
- Other contributors: IBM (AI)
- Users: universities using and providing feedback, FSIs (banks, finance)
Questions
- What do you mean by "focus on performance"?
- use kube-burner
- provide some recommended default SLOs to test against
- Contrast with LitmusChaos and others
- runs outside of cluster
- cover more perf use cases
- AI capability - automate creating test cases
- How do you anticipate users using this? In a pipeline, ad-hoc?
- Recommend using in a continuous chaos system
- Use in a test environment first
Closing as this is now complete, thanks all.