OverSubscribe in ParallelCluster 3.8.0 does not work
michaelmayer2 opened this issue · comments
When trying to use OverSubscribe
parameter via CustomSlurmSettings
at the Queue
level, the pcluster
command produces a validation error
"configurationValidationErrors": [
{
"level": "ERROR",
"type": "CustomSlurmSettingsValidator",
"message": "Using the following custom Slurm settings at Queue level is not allowed: OverSubscribe"
},
If setting the same at ComputeResources
level, the validator is ok with it but evidently this then leads to a broken SLURM configuration given that the OVERSUBSCRIBE
parameter is only allowed at Queue
/ Partition
level.
In the pcluster
source code we are associating OverSubscribe
with the Queue
level but put it on the DENY_LIST
https://github.com/aws/aws-parallelcluster/blob/develop/cli/src/pcluster/validators/slurm_settings_validator.py#L50
If removing oversubscribe
from the DENY_LIST
, pcluster
will create a valid SLURM configuration, e.g. using this partial YAML code
....
SlurmQueues:
- Name: interactive
ComputeResources:
- Name: interactive
InstanceType: t2.xlarge
MaxCount: 20
MinCount: 1
Efa:
Enabled: FALSE
CustomSlurmSettings:
OverSubscribe: FORCE
...
The reason we have OverSubscribe
in the deny list is because it is mutually exclusive with the JobExclusiveAllocation
parameter.
If you really want to use OverSubscribe
, make sure you don't have JobExclusiveAllocation
and suppress the validator when creating the cluster.
Fair enough - happy to use the --validation-failure-level WARNING
workaround for the time being.
It still would be great to add OverSubscribe
on the Queue
/Partition
level though - adding OverSubscribe
there will provide a non-functional SLURM cluster, even with validators as default.