Unable to launch job in a p2 instance

Question

Unable to launch job in a p2 instance

diegoamoros opened this issue 5 months ago · comments

Environment:

AWS ParallelCluster [ aws-parallelcluster-3.7.2]
region: eu-west-1
OS: [alinux2]
Scheduler: [slurm]
HeadNode instance type: [c5a.large]
Compute CPU instance type: [c5a.8xlarge] Partition = comp4x-cpu16-mem32-od
Compute GPU instance type: [p2.xlarge] Partition = gpu1x-gpu1-cpu4-od

Service Quota limit:

Running On-Demand P instances: 8
Running On-Demand Standard (A, C, D, H, I, M, R, T, Z) instances: 320

Custom scripts

pclusterTagsAndBudget post install script

Bug description and how to reproduce:

When I launch a job in the c5a.8xlarge instance, works OK. But when I launch the same job in the p2.xlarge instance (JobId=35), the instance is launched, but it is terminated after a few minutes. Job in squeue remains in CF or PD all the time.

P2 configuration

- Name: gpu1x-gpu1-cpu4-od
  CapacityType: ONDEMAND
  ComputeResources:
    - Name: p21xlargeod
      MinCount: 0
      MaxCount: 30
      Instances:
        - InstanceType: p2.xlarge
  Networking:
    SubnetIds:
      - subnet-xxxxxxxxx
  CustomActions:
    OnNodeConfigured:
      Script: s3://xxxxx/post_install.sh
  Iam:
    S3Access:
      - BucketName: xxxxxxxx
        EnableWriteAccess: false
    AdditionalIamPolicies:
      - Policy: >-
          arn:aws:iam::xxxxx
  ComputeSettings:
    LocalStorage:
      RootVolume:
        VolumeType: gp3

/var/log/slurmctld.log

[2024-01-12T10:05:46.156] sched: Allocate JobId=35 NodeList=gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2 #CPUs=1 Partition=gpu1x-gpu1-cpu4-od
[2024-01-12T10:05:47.500] POWER: no more nodes to resume for job JobId=35
[2024-01-12T10:05:47.500] POWER: power_save: waking nodes gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2
[2024-01-12T10:08:58.020] update_node: node gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2 reason set to: Scheduler health check failed
[2024-01-12T10:08:58.020] requeue job JobId=35 due to failure of node gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2
[2024-01-12T10:08:58.020] Requeuing JobId=35
[2024-01-12T10:08:58.020] powering down node gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2
[2024-01-12T10:09:17.516] POWER: power_save: suspending nodes gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2

/var/log/parallelcluster/slurm_resume.log

2024-01-12 10:05:47,688 - [slurm_plugin.resume:_resume] - INFO - Launching EC2 instances for the following Slurm nodes: gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2
2024-01-12 10:05:47,704 - [slurm_plugin.resume:_resume] - INFO - Current state of Slurm nodes to resume: [('gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2', 'MIXED+CLOUD+NOT_RESPONDING+POWERING_UP')]
2024-01-12 10:05:47,735 - [botocore.credentials:load] - INFO - Found credentials from IAM Role: pcluster-RoleHeadNode-xxxx
2024-01-12 10:05:47,781 - [slurm_plugin.instance_manager:_add_instances_for_nodes] - INFO - Launching instances for Slurm nodes (x1) ['gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2']
2024-01-12 10:05:47,781 - [slurm_plugin.fleet_manager:create_fleet] - INFO - Launching instances with create_fleet API. Parameters: {'LaunchTemplateConfigs': [{'LaunchTemplateSpecification': {'LaunchTemplateName': 'pcluster-gpu1x-gpu1-cpu4-od-p21xlargeod', 'Version': '$Latest'}, 'Overrides': [{'InstanceType': 'p2.xlarge', 'SubnetId': 'subnet-015b1e1ad2c5f32de'}]}], 'TargetCapacitySpecification': {'TotalTargetCapacity': 1, 'DefaultTargetCapacityType': 'on-demand'}, 'Type': 'instant', 'OnDemandOptions': {'AllocationStrategy': 'lowest-price', 'SingleInstanceType': True, 'SingleAvailabilityZone': True, 'MinTargetCapacity': 1, 'CapacityReservationOptions': {'UsageStrategy': 'use-capacity-reservations-first'}}}
2024-01-12 10:05:50,022 - [slurm_plugin.instance_manager:_update_slurm_node_addrs_and_failed_nodes] - INFO - Nodes are now configured with instances: (x1) ["('gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2', EC2Instance(id='i-0679xxx', private_ip='10.0.1.36', hostname='ip-10-0-1-36', launch_time=datetime.datetime(2024, 1, 12, 10, 5, 49, tzinfo=tzlocal()), slurm_node=None))"]
2024-01-12 10:05:50,022 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - Saving assigned hostnames in DynamoDB
2024-01-12 10:05:50,055 - [slurm_plugin.instance_manager:_store_assigned_hostnames] - INFO - Database update: COMPLETED
2024-01-12 10:05:50,055 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - Updating DNS records for XXXXXXXXXXXXXXXX - pcluster.pcluster.
2024-01-12 10:05:50,433 - [slurm_plugin.instance_manager:_update_dns_hostnames] - INFO - DNS records update: COMPLETED
2024-01-12 10:05:50,434 - [slurm_plugin.resume:_resume] - INFO - Successfully launched nodes (x1) ['gpu1x-gpu1-cpu4-od-dy-p21xlargeod-2']
2024-01-12 10:05:50,435 - [slurm_plugin.resume:main] - INFO - ResumeProgram finished.

Enrico Usai · Answer 1 · Fri Jan 12 2024 20:46:15 GMT+0800 (China Standard Time)

Hi @diegoamoros

are you using ParallelCluster 3.7.2 or 3.8.0?

In 3.8.0 this is a known limitation due to incompatibility between OpenRM Nvidia drivers and Nvidia Kepler architecture present in p2 instances (if you're using 3.8.0 you need to create a custom AMI with old Nvidia drivers).

If you're using 3.7.2 instead we need to check what's happening in one of the nodes failing the bootstrap. In this case you can retrieve the console logs of the terminated instance with: aws ec2 get-console-output --instance-id i-1234567890abcdef0 --output text source

Diego Amoros · Answer 2 · Fri Jan 12 2024 21:31:26 GMT+0800 (China Standard Time)

Hi @enrico-usai
Thank you!

I'm using 3.7.2. I just retrieved the console logs and I don't see any error.
The last 10 lines:

<13>Jan 12 13:00:45 user-data: Recipe: aws-parallelcluster-platform::sudo_config
<13>Jan 12 13:00:45 user-data:   * template[/etc/sudoers.d/99-parallelcluster-user-tty] action create[2024-01-12T13:00:45+00:00] INFO: Processing template[/etc/sudoers.d/99-parallelcluster-user-tty] action create (aws-parallelcluster-platform::sudo_config line 15)
<13>Jan 12 13:00:45 user-data: [2024-01-12T13:00:45+00:00] INFO: template[/etc/sudoers.d/99-parallelcluster-user-tty] created file /etc/sudoers.d/99-parallelcluster-user-tty
<13>Jan 12 13:00:45 user-data: 
<13>Jan 12 13:00:45 user-data:     - create new file /etc/sudoers.d/99-parallelcluster-user-tty[2024-01-12T13:00:45+00:00] INFO: template[/etc/sudoers.d/99-parallelcluster-user-tty] updated file contents /etc/sudoers.d/99-parallelcluster-user-tty
<13>Jan 12 13:00:45 user-data: 
<13>Jan 12 13:00:45 user-data:     - update content in file /etc/sudoers.d/99-parallelcluster-user-tty from none to 584e08
<13>Jan 12 13:00:45 user-data:     --- /etc/sudoers.d/99-parallelcluster-user-tty       2024-01-12 13:00:45.091170156 +0000
<13>Jan 12 13:00:45 user-data:     +++ /etc/sudoers.d/.chef-99-parallelcluster-user-tty20240112-5573-morzlt     2024-01-12 13:00:45.091170156 +0000
<13>Jan 12 13:00:45 user-data:     @@ -1 +1,2 @@
<13>Jan 12 13:00:45 user-data:     +Defaults:ec2-user !requiretty[2024-01-12T13:00:45+00:00] INFO: templ        2024-01-12T13:01:11+00:00

Enrico Usai · Answer 3 · Fri Jan 12 2024 22:22:09 GMT+0800 (China Standard Time)

Hi @diegoamoros

I don't think these logs are complete, here I see sudo_config recipe execution, but it is just an initial step of the configuration flow.
You should try to retrieve the log after a bit, to see if there are updated data. Please attach the entire log.

The other thing you can check is networking. Are you using the same subnet for p2 and c5a?

Diego Amoros · Answer 4 · Sat Jan 13 2024 00:06:39 GMT+0800 (China Standard Time)

Hi @enrico-usai

Yes, these logs are incomplete.
Attached you can find the entire console-output.
Thank you!

Yes. I'm using the same subnet for p2 and c5a.

fullLog.txt

David Pratt · Answer 5 · Thu Jan 18 2024 06:42:57 GMT+0800 (China Standard Time)

Hi @diegoamoros

I've looked through the attached log and it still looks truncated (it is possible that the console log itself did not record everything). I see no errors present in the log, but it does end well before what I would expect to be output by the bootstrap process.

Do you have custom actions defined for the compute resource? Have you tried creating a cluster without any custom actions?

Are there any CloudWatch logs present in the cluster's log group for the failed compute nodes?

Diego Amoros · Answer 6 · Thu Jan 18 2024 16:08:12 GMT+0800 (China Standard Time)

Hi @davprat.

Do you have custom actions defined for the compute resource?
Yes, I have custom actions for the compute node (OnNodeConfigured): post_install.sh for cost allocation tags (https://aws.amazon.com/es/blogs/compute/using-cost-allocation-tags-with-aws-parallelcluster/).

Have you tried creating a cluster without any custom actions?
No, I haven't. But c5 instances on-demand are working without problems. But I have tried c5 spot instances and they do not work either.

Are there any CloudWatch logs present in the cluster's log group for the failed compute nodes?
No, only head node and logs that I already have. How can I access to compute node logs?

Thank you!

David Pratt · Answer 7 · Fri Jan 19 2024 04:42:57 GMT+0800 (China Standard Time)

Hi diegoamoros,

How can I access to compute node logs?

If the compute nodes were able to make it far enough through the boot process, they should have started CloudWatch Agent. The agent should publish most of the important logs to CloudWatch. If you go to the CloudWatch console, there should be a log group for the cluster. In that log group, there should be several log streams from all the instances launched by ParallelCluster - unless you have specifically disabled this feature.

Here is a link with instruction on retrieving logs using the pcluster CLI as an alternative.

Judy Ng · Answer 8 · Wed Jan 31 2024 06:09:05 GMT+0800 (China Standard Time)

Hi @diegoamoros, are you still running into this issue? Were you able to retrieve the logs?

Diego Amoros · Answer 9 · Thu Feb 01 2024 01:12:27 GMT+0800 (China Standard Time)

Hi @judysng. I am sorry. I have been busy and haven't been able to continue with the issue, but I will try to retrieve the logs shortly.

Roshan Mathew · Answer 10 · Fri Feb 09 2024 19:56:16 GMT+0800 (China Standard Time)

We have encountered the same issue. The drivers in the default AMI for PC 3.7.2 is too new for p2 (K80s)

AWS ParallelCluster AMI for ubuntu2204, kernel-5.15.0-1026-aws, lustre-5.15.0.1031.29, efa-2.5.0-1.amzn1, dcv-2023.0.15487-1, **nvidia-535.54.03**, cuda-12.2.20230626

https://www.nvidia.com/download/driverResults.aspx/206014/en-us/

K80s are not supported

Would have to build your own AMIs if we have to use p2 in 3.7.2

Diego Amoros · Answer 11 · Mon Feb 12 2024 15:59:20 GMT+0800 (China Standard Time)

We have encountered the same issue. The drivers in the default AMI for PC 3.7.2 is too new for p2 (K80s)

AWS ParallelCluster AMI for ubuntu2204, kernel-5.15.0-1026-aws, lustre-5.15.0.1031.29, efa-2.5.0-1.amzn1, dcv-2023.0.15487-1, **nvidia-535.54.03**, cuda-12.2.20230626

https://www.nvidia.com/download/driverResults.aspx/206014/en-us/

K80s are not supported

Would have to build your own AMIs if we have to use p2 in 3.7.2

Thanks @roshantm !!

Would it work on an earlier version of parallel cluster?