aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OverSubscribe in ParallelCluster 3.8.0 does not work

michaelmayer2 opened this issue · comments

When trying to use OverSubscribe parameter via CustomSlurmSettings at the Queue level, the pcluster command produces a validation error

  "configurationValidationErrors": [
    {
      "level": "ERROR",
      "type": "CustomSlurmSettingsValidator",
      "message": "Using the following custom Slurm settings at Queue level is not allowed: OverSubscribe"
    },

If setting the same at ComputeResources level, the validator is ok with it but evidently this then leads to a broken SLURM configuration given that the OVERSUBSCRIBE parameter is only allowed at Queue/ Partition level.

In the pcluster source code we are associating OverSubscribe with the Queue level but put it on the DENY_LIST
https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/validators/slurm_settings_validator.py#L50

If removing oversubscribe from the DENY_LIST, pcluster will create a valid SLURM configuration, e.g. using this partial YAML code

....
  SlurmQueues: 
    - Name: interactive 
      ComputeResources:
        - Name: interactive 
          InstanceType: t2.xlarge 
          MaxCount: 20
          MinCount: 1
          Efa:
            Enabled: FALSE
      CustomSlurmSettings:
        OverSubscribe: FORCE
...

The reason we have OverSubscribe in the deny list is because it is mutually exclusive with the JobExclusiveAllocation parameter.

If you really want to use OverSubscribe, make sure you don't have JobExclusiveAllocation and suppress the validator when creating the cluster.

Fair enough - happy to use the --validation-failure-level WARNING workaround for the time being.

It still would be great to add OverSubscribe on the Queue/Partition level though - adding OverSubscribe there will provide a non-functional SLURM cluster, even with validators as default.