zhaiyuyong / cloud

PaddleCloud distributed training job scheduling

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PaddlePaddle Cloud

Build Status

PaddlePaddle Cloud is a combination of PaddlePaddle and Kubernetes. It supports fault-recoverable and fault-tolerant large-scaled distributed deep learning. We can deploy it on public cloud and on-premise clusters.

PaddlePaddle Cloud includes the following components:

  • paddlectl: A command-line tool that talks to paddlecloud and paddle-fs.
  • paddlecloud: An HTTP server that exposes Kubernetes as a Web service.
  • paddle-fs: An HTTP server that exposes the CephFS distributed filesystem as a Web service.
  • EDL (elastic deep learning): A Kubernetes controller that supports elastic scheduling of deep learning jobs and other jobs.
  • Fault-tolerant distributed deep learning: This part is in the Paddle repo.

Tutorials

How To

Directory structure

.
├── demo: distributed version of https://github.com/PaddlePaddle/book programs
├── doc: documents
├── docker: scripts to build Docker image to run PaddlePaddle distributed
├── go
│   ├── cmd
│   │   ├── edl: entry of EDL controller binary
│   │   ├── paddlecloud: the command line client of PaddlePaddle Cloud (will be deprecated)
│   │   ├── paddlectl: the command line client of PaddlePaddle Cloud
│   │   └── pfsserver: entry of PaddleFS binary
│   ├── edl: EDL implementation
│   ├── filemanager: PaddleFS implementation
│   ├── paddlecloud: command line client implement (will be deprecated)
│   ├── paddlectl: command line client implement
│   ├── scripts: scripts for Go code generation
├── k8s: YAML files to create different components of PaddlePaddle Cloud
│   ├── edl: TPR definition and EDL controller for TraningJob resource
│   │   ├── autoscale_job: A sample TrainingJob that can scale
│   │   └── autoscale_load: A sample cluster job demonstrating a common workload
│   ├── minikube: YAML files to deploy on local mini-kube environment
│   └── raw_job: A demo job demonstrates how to run PaddlePaddle jobs in cluster
└── python: PaddlePaddle Cloud REST API server

About

PaddleCloud distributed training job scheduling

License:Apache License 2.0


Languages

Language:JavaScript 41.9%Language:Go 31.1%Language:Python 23.3%Language:Shell 2.1%Language:HTML 1.4%Language:CSS 0.2%Language:Dockerfile 0.1%