krkn sandbox submission review

Question

krkn sandbox submission review

joshgav opened this issue 8 months ago · comments

krkn is a chaos testing project proposed to CNCF sandbox in cncf/sandbox#44. This issue tracks discussions and reviews of krkn to help with accepting it in CNCF sandbox.

We've asked @psuriset and team to join an upcoming TAG meeting to discuss the following about the project:

the values this project proposes to end users
the project's high-level technical architecture
the project's near-term roadmap
state of the project's community and governance
comparison with existing projects

Josh Gavant · Answer 1 · Wed Oct 04 2023 21:14:50 GMT+0800 (China Standard Time)

krkn will present to the TAG at our general meeting on 10/18. Agenda/notes here: https://docs.google.com/document/d/1OykvqvhSG4AxEdmDMXilrupsX2n1qCSJUWwTc3I7AOs/edit#heading=h.5676wfk2ybjv

Josh Gavant · Answer 2 · Tue Oct 24 2023 05:48:37 GMT+0800 (China Standard Time)

Thank you @psuriset and team for presenting krkn to us last week. Following are notes from the presentation. The TAG believes krkn is a good fit for CNCF sandbox!

Recording: https://youtu.be/nXQkBFK_MWc?t=722
Presentation: https://drive.google.com/file/d/1jaTWROCtruWyBvLB0xI5qZhbavVCSwEe/

Value Props

find unexpected problems by injecting unexpected scenarios. Needed by Red Hat performance & scale team to ensure max performance of clusters and apps.
emphasis on performance - SLAs and SLOs
AI and recommender increase chaos coverage

Architecture

Components include krkn, cerberus, chaos recommender, chaos AI, & telemetry collector
client-side tool, doesn't run inside the cluster, don't want it to be a victim of its own actions
calls APIs to inject chaos
for supported scenarios has built-in checks for successful handling of failure
configure PromQL queries defining success
Cerberus: utility that aggregates health into a single go/no-go signal

Chaos AI and Recommender, Telemetry collector

Why? improve and increase coverage for chaos
Can watch telemetry from application or other components and create appropriate chaos test cases
developed by IBM
Recommender - based on static rules
Chaos Recommender already part of project, Chaos AI still in development but will be part of project
Chaos AI will include a mechanism to continually train a model based on actual telemetry and observation

Roadmap

implement chaos tests for more known scenarios, for example a Kafka cluster in K8s or DNS
want to learn from more users via CNCF
want to create visualizations and reports from tests

Community

Other contributors: IBM (AI)
Users: universities using and providing feedback, FSIs (banks, finance)

Questions

What do you mean by "focus on performance"?
- use kube-burner
- provide some recommended default SLOs to test against
Contrast with LitmusChaos and others
- runs outside of cluster
- cover more perf use cases
- AI capability - automate creating test cases
How do you anticipate users using this? In a pipeline, ad-hoc?
- Recommend using in a continuous chaos system
- Use in a test environment first

Josh Gavant · Answer 3 · Fri Oct 27 2023 21:13:10 GMT+0800 (China Standard Time)

Closing as this is now complete, thanks all.