Giters
kubeflow
/
pytorch-operator
PyTorch on Kubernetes
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
302
Watchers:
31
Issues:
147
Forks:
143
kubeflow/pytorch-operator Issues
unable to build image for ppc64le
Updated
3 years ago
PytorchJob DDP training will stop if I delete a worker pod
Updated
3 years ago
Comments count
2
What is the difference between master and worker?
Closed
3 years ago
Comments count
6
Multi-gpu in a single pod
Updated
3 years ago
Comments count
2
run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed
Updated
3 years ago
Comments count
1
How to use DDP in pytorch operator?
Closed
3 years ago
Comments count
3
whether multi-gpu-per-pod setup be supported in PytorchJob
Updated
3 years ago
Comments count
1
service label mismatches selector, which result in inconsistency
Updated
3 years ago
Comments count
3
The training hangs after reloading one of master/worker pods
Updated
3 years ago
Comments count
5
Is python sdk still being maintained?
Updated
3 years ago
Comments count
7
container "pytorch" is waiting to start: PodInitializing
Updated
3 years ago
Comments count
20
Can not use volcano for Gang Scheduling
Closed
3 years ago
Can I freeze pytorchjob training pods and migrate them to other nodes?
Updated
3 years ago
Comments count
9
Pytorch version may have an effect on the training reproduction
Updated
3 years ago
Comments count
4
Different DDP training results of PytorchJob and Bare Metal
Updated
3 years ago
Comments count
6
Can I use hostNetwork to run PytorchJob like on bare metal
Closed
3 years ago
Comments count
3
Can PytorchJob skip or cancel the init cantainer?
Updated
3 years ago
Comments count
2
Mnist dataset server is down
Updated
3 years ago
Comments count
5
volcano change the PodGroup CRD APIGroup to volcano.sh
Updated
3 years ago
Comments count
1
why worker need initContainer in pytorch-operator?
Closed
3 years ago
Comments count
2
[feat] Support PyTorch 1.9
Updated
3 years ago
Comments count
3
Upgrade to v1 CRDs
Updated
3 years ago
Comments count
1
kubeflow pipelines sdk, distributed multi-node training with autoscaling
Closed
3 years ago
Comments count
4
PytorchJob replicas has different node affinity behaviors compared with Deployment
Updated
3 years ago
Comments count
4
fell confused about world_size
Closed
3 years ago
`init-pytorch` init container image configurable
Closed
3 years ago
Comments count
4
PyTorch Lightning Example.
Closed
3 years ago
Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator
Updated
3 years ago
Comments count
4
'./pytorch_job_sendrecv.yaml' missing in pytorch-operator/examples/smoke-dist
Closed
3 years ago
Comments count
6
can I use PyTorchJobClient inside a pod of the cluster?
Updated
3 years ago
Comments count
1
Worker template should be configurable.
Updated
3 years ago
Comments count
1
'host not found' error occurs during PyTorch distributed learning
Updated
3 years ago
Comments count
1
NCCL "Connection Refused" for Worker Pods
Updated
3 years ago
Comments count
1
worker get connection timed out error in user namespace with sidecar.istio.io/inject=false
Closed
3 years ago
Comments count
1
is there a simpler way to install pytorch-operator
Closed
3 years ago
Comments count
2
Please create v1.2-branch
Closed
3 years ago
Comments count
3
pytorch-operator: Consolidate manifests
Closed
3 years ago
Comments count
1
PyTorch Operator: Move manifests development upstream
Closed
3 years ago
Operator has invalid memory address error on specific pytorchjob spec
Updated
3 years ago
Comments count
1
pytorch-operator pod CheckCRDExist failed
Closed
3 years ago
Comments count
3
dist.init_process_group stuck
Updated
3 years ago
Comments count
9
Does pytorch-opterator just simplified the use of nn.parallel.DistributedDataParallel on multi nodes of multi gpu?
Closed
4 years ago
Comments count
2
can I use gpus on specific node to train
Closed
4 years ago
Comments count
5
how can I run a pytorch job with all my Gpu resources
Closed
4 years ago
Comments count
4
Make manifest test friendly
Closed
4 years ago
Comments count
2
Do not trigger presubmit jobs for simple changes
Updated
4 years ago
Comments count
1
Support Torch Elastic in pytorch operator
Updated
4 years ago
Comments count
2
Activate Travis in PR check
Updated
4 years ago
Comments count
2
[bug] Unit test is broken
Updated
4 years ago
Comments count
4
how to create a local non-distributed training
Closed
4 years ago
Comments count
7
Previous
Next