Data on Kubernetes - Getting Started Guide

Contributors: Paul Au, Ryan Wallner, Johnathon Battiato, Kallio Prinewill

Project Proposal

Project Description

The barrier to entry for any technology and community surrounding that technology can often be quite high. The purpose of this document is to capture existing resources that we can leverage to help create a roadmap for DoK beginners.

The secondary purpose of this document is to capture gaps in existing content so we can, as a community, create content to fill those gaps.

Goals

Provide a set of resources to help guide a complete beginner with the base set of knowledge to start running data workloads on Kubernetes.

We will break the content into sections in a logical order to guide someone from DoK beginner to deploying their first stateful application on kubernetes.

Why Stateful on Kubernetes

According to past DoK reports, the following are some of the reasons

Kubernetes has become a core part of IT – half of the respondents are running 50% or more of their production workloads • on it, and they are very satisfied and more productive as a result. The most advanced users report 2x or greater productivity gains. -Business demands are creating pressures for further adoption. The increasing importance of real-time data to competitive advantage will sharpen companiesʼ need to run data on Kubernetes. A majority believe standards will improve data management and that data should become declarative. -Standardization is the key driver for Kubernetes Leaders

Intro to Stateful

Purpose

Provide basic knowledge of what stateful means in Kubernetes.

Resources

Types of workloads

Purpose

Provide a list of stateful workloads that exist on Kubernetes and a description/examples of each workload Stateful Workloads

Databases (stateful sets or CRD)
AI/ML (usually jobs) - https://developers.redhat.com/aiml/ai-workloads
Batch processing jobs
Stream processing
Machine learning and AI workloads
Data analytics
ETL (Extract, Transform, Load) pipelines
Data warehousing
Distributed databases
In-memory data grids
Time series databases
Search and indexing engines

Operators 101

Purpose

Provide resources explaining what operators are and what role they play in running data workloads on kubernetes.

Resources:

Common Operators

Common Tools

Purpose

Provide resources explaining what operators are and what role they play in running data workloads on kubernetes.

Resources:

MiniKube: Used for running kubernetes locally

Ecosystem 101

Purpose

List and describe open source projects that are a part of the DoK Ecosystem. This list is not comprehensive.

DoK Landscape

Databases

Vitess - MySQL-compatible, horizontally scalable, cloud-native database solution
Cassandra - Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.
PostgreSQL - PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads.

Cloud Native Storage

Rook - Rook is an open source cloud-native storage orchestrator, providing the platform, framework, and support for Ceph storage to natively integrate with cloud-native environments.
CubeFS - CubeFS is a new generation cloud-native open source storage system that supports access protocols such as S3, HDFS, and POSIX.
Longhorn - Longhorn is a lightweight, reliable and easy-to-use distributed block storage system for Kubernetes.

Streaming

Kafka
- Running Apache Spark on Kubernetes
Spark
- Run Apache Spark jobs on Amazon EKS using the OSS Spark Operator
Flink
Kubeflow
Strimzi
- Running Kafka on Kubernetes using Strimzi

AI/ML

Ray
- Managing Ray clusters for ML on Kubernetes with KubeRay

Deploy your first database on kubernetes

Purpose

In this section, you'll learn how to use the knowledge you've accumulated to deploy a database to kubernetes.

Resources:

Next Steps

Purpose

In this section, we'll list some resources to push you to the next level of understanding.

Resources:

About

Apache License 2.0

sklarsa / DoK-Get-Started

Data on Kubernetes - Getting Started Guide

Project Proposal

Project Description

Goals

Table of Contents

Why Stateful on Kubernetes

Intro to Stateful

Purpose

Resources

Types of workloads

Purpose

Operators 101

Purpose

Resources:

Common Operators

Common Tools

Purpose

Resources:

Ecosystem 101

Purpose

Databases

Cloud Native Storage

Streaming

AI/ML

Deploy your first database on kubernetes

Purpose

Resources:

Next Steps

Purpose

Resources:

About