Giters
aws-samples
/
amazon-eks-machine-learning-with-terraform-and-kubeflow
Distributed training using Kubeflow on Amazon EKS
Geek Repo:
Geek Repo
Github PK Tool:
Github PK Tool
Stargazers:
72
Watchers:
12
Issues:
100
Forks:
43
aws-samples/amazon-eks-machine-learning-with-terraform-and-kubeflow Issues
nemo-megatron container needs fixed version for transformers and datasets
Closed
2 months ago
neuronx-nemo-megatron examples have runtime error saving checkpoint with save_xser=True
Closed
2 months ago
Machine learning data process chart needs support for creating inline scripts
Closed
3 months ago
Add Helm chart for databtricks-dolly-15k dataset
Closed
3 months ago
Need Helm chart for Hugging Face model snapshot download
Closed
3 months ago
Red pajama dataset download link is defunct
Closed
3 months ago
Neuronx distributed Llama2 examples do not load latest checkpoint if it exists
Closed
3 months ago
How do I set MAXPOD in EKS ?
Updated
3 months ago
Comments count
6
neuronx-distributed examples save up to last 10 checkpoints which consumes too much disk space
Closed
3 months ago
neuronx-nemo-megatron examples need checkpointing enabled
Closed
3 months ago
Can use the Kiali dashboard ?
Updated
3 months ago
Comments count
1
Nueronx distributed Llama2 7B PyTorch Lightning example has fatal error during checkpointing
Updated
3 months ago
Creating FSx for Lustre Data Repository Association: BadRequest: Amazon FSx is unable to validate access to the S3 bucket.
Closed
4 months ago
Comments count
12
Torch distributed RuntimeError: Socket Timeout
Updated
4 months ago
Comments count
1
Machine learning charts for training need to support dynamic EBS volume
Closed
4 months ago
Data process machine learning charts need to support dynamic EBS volume
Closed
4 months ago
Need EBS CSI driver storage class with volumeBindingMode WaitForFirstConsumer for EBS volume type gp3
Closed
4 months ago
Katib UI is not detecting auth request header
Closed
4 months ago
Allow FSx for Lustre file-system storage capacity to be configurable via Terraform variable
Closed
4 months ago
Comments count
1
Need a way to have Karpenter create single AZ GPU clusters when using EFA
Closed
4 months ago
The manifest file eks-cluster/utils/attach-pvc.yaml should attach to both efs and fsx pvcs
Closed
4 months ago
Git clone directory created by machine learning charts is not getting cleaned in case of failure
Closed
4 months ago
Some training jobs require VPC CIDR Ingress in EKS cluster managed security group
Closed
4 months ago
Helm chart pipeline step not does not complete when the job completes
Closed
4 months ago
Comments count
1
Helm charts pipeline does not show output of helm install command
Closed
5 months ago
Trainium clusters need to be in a single subnet for EFA collective communications
Closed
5 months ago
Helm chart kfp component does not need to include default values file
Closed
5 months ago
In machine-learning charts, pre_script needs to execute after git clone
Closed
5 months ago
Comments count
1
MaskRCNN related helm charts need to be relocated
Closed
5 months ago
Need to add script for configuring S3 backend for Terraform state
Closed
5 months ago
Need to refactor kubeflow platform charts into a single sub-folder
Closed
5 months ago
Comments count
1
FSx for Lustre automatic export to S3 is not configured correrctly
Closed
5 months ago
EFA plugin helm chart install values are incorrect
Closed
5 months ago
Need to refactor top-level container and container-optimized folders under a new top-level containers folder
Closed
5 months ago
Comments count
1
build-ecr-images.sh script fails due to AWS login failure
Closed
5 months ago
Remove unused files
Closed
5 months ago
Need to add support for kubeflow components used in training
Closed
6 months ago
Helm chart pv-fsx template YAML files do not explicitly reference storage class name
Closed
6 months ago
EFS and Fsx for Luster PVC attach pods get stuck in Terminating state
Closed
6 months ago
The version of aws-ia/eks-blueprints-addons/aws used is not fixed
Closed
6 months ago
Karpenter module is being used without a version which is breaking the module
Closed
6 months ago
Need to refactor Mask RCNN tutorial to separate training and testing Helm charts
Closed
6 months ago
Refactor kubectl_manifest into helm charts
Closed
6 months ago
Required file is in legacy folder
Closed
7 months ago
Need to refactor karpenter and kubeflow components into separate helm charts
Closed
7 months ago
Need to refactor terraform script into separate files
Closed
7 months ago
Replace provisioner local-exec with kubectl_manifest or helm_release
Closed
7 months ago
Add new terraform variables for system node group instance types and instance volume size
Closed
7 months ago
Use Karpenter to manage accelerator nodes
Closed
7 months ago
Need to add support for EC2 trn1 and inf2 instance types
Closed
7 months ago
Previous
Next