aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

slurm job not able to run

JeffNing opened this issue · comments

aws-parallel cluster v3.7.2

config file:
Region: us-west-2
Image:
Os: ubuntu2004
CustomAmi: ami-06b643a05666f7089
HeadNode:
InstanceType: t2.micro
Networking:
SubnetId: subnet-0c9aa7683d6da903b
ElasticIp: false
SecurityGroups: [sg-09cddbd7a2d8a2d4a, sg-063975629d1e61300]
DisableSimultaneousMultithreading: false
Ssh:
KeyName: us-west-2-rnd
AllowedIps: 0.0.0.0/0
LocalStorage:
RootVolume:
Size: 40
Encrypted: true
VolumeType: gp3
Iops: 3000
DeleteOnTermination: true
Iam:
InstanceRole: arn:aws:iam::aws:role/CustomHeadNodeRole
Imds:
Secured: True
Image:
CustomAmi: ami-06b643a05666f7089
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 10
Dns:
DisableManagedDns: true
SlurmQueues:
- Name: ua1-pipeline
CapacityType: ONDEMAND
Networking:
SubnetIds:
- subnet-0c9aa7683d6da903b
ComputeResources:
- Name: r6in4
InstanceType: r6in.4xlarge
MinCount: 0
MaxCount: 10
- Name: i6i4
InstanceType: r6i.4xlarge
MinCount: 0
MaxCount: 500
Iam:
S3Access:
- BucketName: rnd-nextseq-raw
KeyName: read_only/*
EnableWriteAccess: false
- BucketName: rnd-nextseq-analysis
KeyName: read_and_write/*
EnableWriteAccess: true
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AdministratorAccess
Image:
CustomAmi: ami-06b643a05666f7089
SharedStorage:

  • MountDir: /shared
    Name: reference
    StorageType: Ebs
    EbsSettings:
    VolumeType: gp3
    Iops: 3000
    Size: 1000
    Encrypted: False
    SnapshotId: snap-086d5c648153019f5
    DeletionPolicy: Retain
  • MountDir: /data
    Name: data
    StorageType: Efs
    EfsSettings:
    ThroughputMode: provisioned
    ProvisionedThroughput: 1024
    Monitoring:
    DetailedMonitoring: true
    Logs:
    CloudWatch:
    Enabled: true
    RetentionInDays: 30
    DeletionPolicy: Retain
    Dashboards:
    CloudWatch:
    Enabled: true
    AdditionalPackages:
    IntelSoftware:
    IntelHpcPlatform: false
    Tags:
  • Key: Name
    Value: UA1-PCluster3
  • Key: department
    Value: DAT

Output of pcluster describe-cluster command:
{
"creationTime": "2023-12-05T23:02:57.712Z",
"headNode": {
"launchTime": "2023-12-05T23:06:50.000Z",
"instanceId": "i-0f244c62107de987a",
"instanceType": "t2.micro",
"state": "running",
"privateIpAddress": "10.108.1.32"
},
"version": "3.7.2",
"clusterConfiguration": {
"url": "https://parallelcluster-d40a76bfb998e45c-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.2/clusters/ua1-test-zsjy87n1zvuzp9xl/configs/cluster-config.yaml?versionId=f1anT9j6NqnkfwA6k0Ou5Vmn9yCdJaeW&AWSAccessKeyId=ASIA5IO3M4DQJHSKNVZ2&Signature=8SFAUdGDPwED7uKURi2zJ%2FSA8e8%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEBgaCXVzLXdlc3QtMiJGMEQCIFdoK44vZyDe%2BnoYpW6XexFL7HRakSUBPLUxsIHZKfFjAiB0Nt2vEAl749Vv1yLsel3GOp8phD%2Bxn%2F1QiGcOZbZuRyqeAwiB%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAEaDDkxMTUzMDA1Nzk1MiIMTBuOqXBWay%2BHm177KvICevHAQbyIvvYEC%2BznOBGJ1fJr3OOK%2FetSxZt8OFQoL3G%2BvxqDKp2kIoCjhZzj9eHiViJP2gT0zXRQum79IWOF8rgIMDXKKXGY6vmolpheSdlnGpW1cvIbBdbrOBbE%2BJg3t8MGUuuh8IyJDUq3K9rVak7%2BFBqGYRieBQeoom9Nwc0sFPenHMj9BbjFxzLsOb%2BMnFupuDtn2wnW7qsVxy701Nw%2FH5xa35gf4pI6ZdW42%2FQafcGKQI258f5eZ9g3fGzrGZoTH2iR7EaGObUDE6ymt84IgvX8aWcdz7V1%2BSfBOHVfVFiRfJzaXX5AZ6Q631TXdcIEJHAxacV1y3R0NAGqWOGxgFrd7rpZKWGVDFCHKj2lorNvTbCN98q9WVXxEGkPBygTYg1gvAciMBPVy0o4m3UC6Ba8qd7KRg45GsN2BIHZCvVlAHxrYtrBIGcXsuuqaLB9aqpqSlAMmhobGO9rhN1v0fdz5k%2FHig2yWe1G6QmlpjDl8r6rBjqnAT4%2BaKaQxDqCC4CLaOn7oXYULZU2dj8beWUcxOZ3hjrPdpOAVXojxp9fAJxsjXEUwq0GTPJG0fo2XNM18bonAZ1wAgW2avXhCsDc5nKXi4wPtuReU6n564HpUha%2Bwkg%2BXL2k80O6tCxnsHP3HTEqiJTkTnp1DE4gazQq5yzaY3B5%2BxigwU4IKIiqQzxKP1ULJd6xOM43y6Ug5Ie63K%2BzZo1h3lfzuIpa&Expires=1701824387"
},
"tags": [
{
"value": "3.7.2",
"key": "parallelcluster:version"
},
{
"value": "ua1-test",
"key": "parallelcluster:cluster-name"
},
{
"value": "DAT",
"key": "department"
},
{
"value": "UA1-PCluster3",
"key": "Name"
}
],
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "ua1-test",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-west-2:911530057952:stack/ua1-test/6e193bb0-93c2-11ee-9936-02100fd5ef73",
"lastUpdatedTime": "2023-12-05T23:02:57.712Z",
"region": "us-west-2",
"clusterStatus": "CREATE_COMPLETE",
"scheduler": {
"type": "slurm"
}
}

squeue output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 ua1-pipel Jeff_tes ubuntu CF 5:52 1 ua1-pipeline-dy-i6i4-2

$ sudo ls -l /var/spool/slurm.state/
total 380
-rw------- 1 slurm slurm 20 Dec 6 00:10 assoc_mgr_state
-rw------- 1 slurm slurm 20 Dec 6 00:05 assoc_mgr_state.old
-rw------- 1 slurm slurm 10 Dec 6 00:10 assoc_usage
-rw------- 1 slurm slurm 10 Dec 6 00:05 assoc_usage.old
-rw-r--r-- 1 slurm slurm 8 Dec 5 23:10 clustername
-rw------- 1 slurm slurm 19 Dec 6 00:10 fed_mgr_state
-rw------- 1 slurm slurm 19 Dec 6 00:05 fed_mgr_state.old
drwx------ 3 slurm slurm 4096 Dec 5 23:28 hash.1
-rw------- 1 slurm slurm 1203 Dec 6 00:10 job_state
-rw------- 1 slurm slurm 1203 Dec 6 00:05 job_state.old
-rw------- 1 slurm slurm 38 Dec 5 23:10 last_config_lite
-rw------- 1 slurm slurm 289 Dec 6 00:10 last_tres
-rw------- 1 slurm slurm 289 Dec 6 00:05 last_tres.old
-rw------- 1 slurm slurm 151221 Dec 6 00:10 node_state
-rw------- 1 slurm slurm 151221 Dec 6 00:05 node_state.old
-rw------- 1 slurm slurm 151 Dec 6 00:10 part_state
-rw------- 1 slurm slurm 151 Dec 6 00:05 part_state.old
-rw------- 1 slurm slurm 10 Dec 6 00:10 qos_usage
-rw------- 1 slurm slurm 10 Dec 6 00:05 qos_usage.old
-rw------- 1 slurm slurm 35 Dec 6 00:10 resv_state
-rw------- 1 slurm slurm 35 Dec 6 00:05 resv_state.old
-rw------- 1 slurm slurm 31 Dec 6 00:10 trigger_state
-rw------- 1 slurm slurm 31 Dec 6 00:05 trigger_state.old

/var/log/parallelcluster/slurm_fleet_status_manager.log:

slurm-log.zip

Sorry, I no longer have those log files because the cluster was already taken down. As I could remember, the three logs files attached to the previous report have contents, all other logs were either not available or empty.

From the slurmctld logs, the job 1 not able to start because the compute node fail to join the cluster before ResumeTimeout, which is 30 minutes

[2023-12-05T23:59:28.620] node ua1-pipeline-dy-i6i4-1 not resumed by ResumeTimeout(1800) - marking down and power_save

The instance is launched backing the compute node, but it is not ready in 30 minutes, we need to check the logs of the compute nodes to see why it get stuck or why it take so long to come up.
There are many things that can cause the compute node is not able to come up within 30 minutes, one thing to check is the networking configuration. It looks like you have some customization on headnode security groups: SecurityGroups: [sg-09cddbd7a2d8a2d4a, sg-063975629d1e61300]. Please verify if the headnode is able to talk with the compute nodes in the cluster with the setting.
Also it would be nice if you would be able to reproduce the problem, log into the compute instance and provide the logs from the compute instance, and you could also check if you find something indicates the root cause in /var/log/cloud-init-output.log of the compute node

Thank you!

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.