aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

(3.9.0-3.9.1) Default ThreadsPerCore Slurm setting causes reduced CPU utilization

nihitsaxena4 opened this issue · comments

Bug description

ParallelCluster does not explicitly set the ThreadsPerCore for compute node configuration causing Slurm to use the default value of 1. Slurm v23.11 introduced a change that requires the ThreadsPerCore setting to match the threads per physical core of the underlying instance. For compute resources that support hardware multi-threading and it has not been disabled, this will result in CPU under utilization at around 50% (Slurm will never allocate to the secondary virtual cores).

Affected versions (OSes, schedulers)

  • ParallelCluster 3.9.0, 3.9.1
  • Slurm 23.11.4
  • All operating systems supported by ParallelCluster

Mitigation

You can find a detailed explanation and the mitigation of the problem here.