slurm job not able to run

Question

slurm job not able to run

JeffNing opened this issue 7 months ago · comments

aws-parallel cluster v3.7.2

config file:
Region: us-west-2
Image:
Os: ubuntu2004
CustomAmi: ami-06b643a05666f7089
HeadNode:
InstanceType: t2.micro
Networking:
SubnetId: subnet-0c9aa7683d6da903b
ElasticIp: false
SecurityGroups: [sg-09cddbd7a2d8a2d4a, sg-063975629d1e61300]
DisableSimultaneousMultithreading: false
Ssh:
KeyName: us-west-2-rnd
AllowedIps: 0.0.0.0/0
LocalStorage:
RootVolume:
Size: 40
Encrypted: true
VolumeType: gp3
Iops: 3000
DeleteOnTermination: true
Iam:
InstanceRole: arn:aws:iam::aws:role/CustomHeadNodeRole
Imds:
Secured: True
Image:
CustomAmi: ami-06b643a05666f7089
Scheduling:
Scheduler: slurm
SlurmSettings:
ScaledownIdletime: 10
Dns:
DisableManagedDns: true
SlurmQueues:
- Name: ua1-pipeline
CapacityType: ONDEMAND
Networking:
SubnetIds:
- subnet-0c9aa7683d6da903b
ComputeResources:
- Name: r6in4
InstanceType: r6in.4xlarge
MinCount: 0
MaxCount: 10
- Name: i6i4
InstanceType: r6i.4xlarge
MinCount: 0
MaxCount: 500
Iam:
S3Access:
- BucketName: rnd-nextseq-raw
KeyName: read_only/*
EnableWriteAccess: false
- BucketName: rnd-nextseq-analysis
KeyName: read_and_write/*
EnableWriteAccess: true
AdditionalIamPolicies:
- Policy: arn:aws:iam::aws:policy/AdministratorAccess
Image:
CustomAmi: ami-06b643a05666f7089
SharedStorage:

MountDir: /shared
Name: reference
StorageType: Ebs
EbsSettings:
VolumeType: gp3
Iops: 3000
Size: 1000
Encrypted: False
SnapshotId: snap-086d5c648153019f5
DeletionPolicy: Retain
MountDir: /data
Name: data
StorageType: Efs
EfsSettings:
ThroughputMode: provisioned
ProvisionedThroughput: 1024
Monitoring:
DetailedMonitoring: true
Logs:
CloudWatch:
Enabled: true
RetentionInDays: 30
DeletionPolicy: Retain
Dashboards:
CloudWatch:
Enabled: true
AdditionalPackages:
IntelSoftware:
IntelHpcPlatform: false
Tags:
Key: Name
Value: UA1-PCluster3
Key: department
Value: DAT

Output of pcluster describe-cluster command:
{
"creationTime": "2023-12-05T23:02:57.712Z",
"headNode": {
"launchTime": "2023-12-05T23:06:50.000Z",
"instanceId": "i-0f244c62107de987a",
"instanceType": "t2.micro",
"state": "running",
"privateIpAddress": "10.108.1.32"
},
"version": "3.7.2",
"clusterConfiguration": {
"url": "https://parallelcluster-d40a76bfb998e45c-v1-do-not-delete.s3.amazonaws.com/parallelcluster/3.7.2/clusters/ua1-test-zsjy87n1zvuzp9xl/configs/cluster-config.yaml?versionId=f1anT9j6NqnkfwA6k0Ou5Vmn9yCdJaeW&AWSAccessKeyId=ASIA5IO3M4DQJHSKNVZ2&Signature=8SFAUdGDPwED7uKURi2zJ%2FSA8e8%3D&x-amz-security-token=IQoJb3JpZ2luX2VjEBgaCXVzLXdlc3QtMiJGMEQCIFdoK44vZyDe%2BnoYpW6XexFL7HRakSUBPLUxsIHZKfFjAiB0Nt2vEAl749Vv1yLsel3GOp8phD%2Bxn%2F1QiGcOZbZuRyqeAwiB%2F%2F%2F%2F%2F%2F%2F%2F%2F%2F8BEAEaDDkxMTUzMDA1Nzk1MiIMTBuOqXBWay%2BHm177KvICevHAQbyIvvYEC%2BznOBGJ1fJr3OOK%2FetSxZt8OFQoL3G%2BvxqDKp2kIoCjhZzj9eHiViJP2gT0zXRQum79IWOF8rgIMDXKKXGY6vmolpheSdlnGpW1cvIbBdbrOBbE%2BJg3t8MGUuuh8IyJDUq3K9rVak7%2BFBqGYRieBQeoom9Nwc0sFPenHMj9BbjFxzLsOb%2BMnFupuDtn2wnW7qsVxy701Nw%2FH5xa35gf4pI6ZdW42%2FQafcGKQI258f5eZ9g3fGzrGZoTH2iR7EaGObUDE6ymt84IgvX8aWcdz7V1%2BSfBOHVfVFiRfJzaXX5AZ6Q631TXdcIEJHAxacV1y3R0NAGqWOGxgFrd7rpZKWGVDFCHKj2lorNvTbCN98q9WVXxEGkPBygTYg1gvAciMBPVy0o4m3UC6Ba8qd7KRg45GsN2BIHZCvVlAHxrYtrBIGcXsuuqaLB9aqpqSlAMmhobGO9rhN1v0fdz5k%2FHig2yWe1G6QmlpjDl8r6rBjqnAT4%2BaKaQxDqCC4CLaOn7oXYULZU2dj8beWUcxOZ3hjrPdpOAVXojxp9fAJxsjXEUwq0GTPJG0fo2XNM18bonAZ1wAgW2avXhCsDc5nKXi4wPtuReU6n564HpUha%2Bwkg%2BXL2k80O6tCxnsHP3HTEqiJTkTnp1DE4gazQq5yzaY3B5%2BxigwU4IKIiqQzxKP1ULJd6xOM43y6Ug5Ie63K%2BzZo1h3lfzuIpa&Expires=1701824387"
},
"tags": [
{
"value": "3.7.2",
"key": "parallelcluster:version"
},
{
"value": "ua1-test",
"key": "parallelcluster:cluster-name"
},
{
"value": "DAT",
"key": "department"
},
{
"value": "UA1-PCluster3",
"key": "Name"
}
],
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "ua1-test",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:us-west-2:911530057952:stack/ua1-test/6e193bb0-93c2-11ee-9936-02100fd5ef73",
"lastUpdatedTime": "2023-12-05T23:02:57.712Z",
"region": "us-west-2",
"clusterStatus": "CREATE_COMPLETE",
"scheduler": {
"type": "slurm"
}
}

squeue output:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1 ua1-pipel Jeff_tes ubuntu CF 5:52 1 ua1-pipeline-dy-i6i4-2

$ sudo ls -l /var/spool/slurm.state/
total 380
-rw------- 1 slurm slurm 20 Dec 6 00:10 assoc_mgr_state
-rw------- 1 slurm slurm 20 Dec 6 00:05 assoc_mgr_state.old
-rw------- 1 slurm slurm 10 Dec 6 00:10 assoc_usage
-rw------- 1 slurm slurm 10 Dec 6 00:05 assoc_usage.old
-rw-r--r-- 1 slurm slurm 8 Dec 5 23:10 clustername
-rw------- 1 slurm slurm 19 Dec 6 00:10 fed_mgr_state
-rw------- 1 slurm slurm 19 Dec 6 00:05 fed_mgr_state.old
drwx------ 3 slurm slurm 4096 Dec 5 23:28 hash.1
-rw------- 1 slurm slurm 1203 Dec 6 00:10 job_state
-rw------- 1 slurm slurm 1203 Dec 6 00:05 job_state.old
-rw------- 1 slurm slurm 38 Dec 5 23:10 last_config_lite
-rw------- 1 slurm slurm 289 Dec 6 00:10 last_tres
-rw------- 1 slurm slurm 289 Dec 6 00:05 last_tres.old
-rw------- 1 slurm slurm 151221 Dec 6 00:10 node_state
-rw------- 1 slurm slurm 151221 Dec 6 00:05 node_state.old
-rw------- 1 slurm slurm 151 Dec 6 00:10 part_state
-rw------- 1 slurm slurm 151 Dec 6 00:05 part_state.old
-rw------- 1 slurm slurm 10 Dec 6 00:10 qos_usage
-rw------- 1 slurm slurm 10 Dec 6 00:05 qos_usage.old
-rw------- 1 slurm slurm 35 Dec 6 00:10 resv_state
-rw------- 1 slurm slurm 35 Dec 6 00:05 resv_state.old
-rw------- 1 slurm slurm 31 Dec 6 00:10 trigger_state
-rw------- 1 slurm slurm 31 Dec 6 00:05 trigger_state.old

/var/log/parallelcluster/slurm_fleet_status_manager.log:

slurm-log.zip

hgreebe · Answer 1 · Fri Dec 08 2023 22:23:43 GMT+0800 (China Standard Time)

Could you provide me the full archive of logs following this guide: https://docs.aws.amazon.com/parallelcluster/latest/ug/troubleshooting-v3-get-logs.html#troubleshooting-v3-get-logs-archive

JeffNing · Answer 2 · Sat Dec 09 2023 02:51:22 GMT+0800 (China Standard Time)

Sorry, I no longer have those log files because the cluster was already taken down. As I could remember, the three logs files attached to the previous report have contents, all other logs were either not available or empty.

chenwany · Answer 3 · Tue Dec 12 2023 09:00:51 GMT+0800 (China Standard Time)

From the slurmctld logs, the job 1 not able to start because the compute node fail to join the cluster before ResumeTimeout, which is 30 minutes

[2023-12-05T23:59:28.620] node ua1-pipeline-dy-i6i4-1 not resumed by ResumeTimeout(1800) - marking down and power_save

The instance is launched backing the compute node, but it is not ready in 30 minutes, we need to check the logs of the compute nodes to see why it get stuck or why it take so long to come up.
There are many things that can cause the compute node is not able to come up within 30 minutes, one thing to check is the networking configuration. It looks like you have some customization on headnode security groups: SecurityGroups: [sg-09cddbd7a2d8a2d4a, sg-063975629d1e61300]. Please verify if the headnode is able to talk with the compute nodes in the cluster with the setting.
Also it would be nice if you would be able to reproduce the problem, log into the compute instance and provide the logs from the compute instance, and you could also check if you find something indicates the root cause in /var/log/cloud-init-output.log of the compute node

Thank you!

github-actions · Answer 4 · Tue Jan 02 2024 02:09:14 GMT+0800 (China Standard Time)

This issue has been automatically closed because there has been no response to our request for more information from the original author. With only the information that is currently in the issue, we don't have enough information to take action. Please reach out if you have or find the answers we need so that we can investigate further.