eliphatfs / kubetk

A set of tools to work efficiently with IO in a distributed cluster.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

kubetk

kubetk is a set of tools designed to utilize a cluster with high-latency volumes better.

Installation

pip install git+https://github.com/eliphatfs/kubetk.git

Single-Node CLI Scripts

These scripts use concurrency to boost the throughput of high-latency volumes. Use -h option to see usage.

kubetk-bulk-zip

The script reads files concurrently, then stores into a zip file without compression. The script stores contents of files of the threads are reading in memory. It is suitable for archiving large amount of small files before transfer. It also recurses into directories with good concurrency.

Benchmark

Cross-zone (files stored in us-west and execution node is in us-central) archiving 5000 small thumbnails that sum up to 559MB:

Binary Threads Wall Time (s)
zip - 1216
kubetk-bulk-zip 32 (default) 16.3
kubetk-bulk-zip 256 3.0

kubetk-rmtree

The script recurses into directories and removes all the contents concurrently.

Benchmark

Cross-zone (files stored in us-east and execution node is in us-central):

Binary Threads Throughput (files/s)
rm - 14.01
kubetk-rmtree 32 (default) 15.52
kubetk-rmtree 1024 15.93

kubetk-cp

kubetk-cp src dst is a concurrent version of cp -r src dst.

Benchmark

Cross-zone (files stored in us-west and execution node is in us-central) copying 5000 small thumbnails that sum up to 559MB (destination is local SSD):

Binary Threads Wall Time (s)
cp - 423.7
kubetk-cp 32 (default) 22.7
kubetk-cp 256 4.3

Distributed Scheduler-Worker Architecture

Scheduler

The implementation is in kubetk.arch.scheduler.

Monitoring

See statsched -h.

Worker

The implementation is in kubetk.arch.worker.

Pipelining

We provide pipelining functionality in worker for better compute to IO utility ratio.

About

A set of tools to work efficiently with IO in a distributed cluster.

License:Apache License 2.0


Languages

Language:Python 100.0%