aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Slurm Accounting seemingly not fully functional

snemir2 opened this issue · comments

commented

Required Info:

  • AWS ParallelCluster version [e.g. 3.1.1]: 3.8.0

  • Full cluster configuration without any credentials or personal data. ( see next comment, The important point is that slurm accounting database on RDS/mysql aurora is configured)

  • Cluster name: A2AiClustersergey

  • Output of pcluster describe-cluster command.

  • [Optional] Arn of the cluster CloudFormation main stack: arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustersergey/7e9b0c10-aa75-11ee-b08b-0a01aa5f1f7b

Bug description and how to reproduce:
With slurm accounting enabled, sreport does not contain any entries at all.

root@ip-10-2-42-167:~# qstat
Job id              Name             Username        Time Use S Queue          
------------------- ---------------- --------------- -------- - ---------------
1                   hostname         slurm           365:00:0 R cpu2mem4gb     
2                   uptime           root            365:00:0 R cpu2mem4gb     
3                   whoami           root            365:00:0 R cpu2mem4gb     
4                   qstat            sergey          365:00:0 R cpu2mem4gb

# sacct
JobID           JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
1              hostname cpu2mem4gb  pcdefault          1  COMPLETED      0:0 
1.0            hostname             pcdefault          1  COMPLETED      0:0 
2                uptime cpu2mem4gb       root          1  COMPLETED      0:0 
2.0              uptime                  root          1  COMPLETED      0:0 
3                whoami cpu2mem4gb       root          1    RUNNING      0:0 
4                 qstat cpu2mem4gb                     1    RUNNING      0:0 

BUT, sreport returns nothing

root@ip-10-2-42-167:~# sreport cluster AccountUtilizationByUser 
--------------------------------------------------------------------------------
Cluster/Account/User Utilization 2024-01-02T00:00:00 - 2024-01-02T23:59:59 (86400 secs)
Usage reported in CPU Minutes
--------------------------------------------------------------------------------
  Cluster         Account     Login     Proper Name     Used   Energy 
--------- --------------- --------- --------------- -------- -------- 

If you are reporting issues about scaling or job failure:
We cannot work on issues without proper logs. We STRONGLY recommend following this guide and attach the complete cluster log archive with the ticket.

For issues with Slurm scheduler, please attach the following logs:

  • From Head node: /var/log/parallelcluster/clustermgtd, /var/log/parallelcluster/clusterstatusmgtd (if version >= 3.2.0), /var/log/parallelcluster/slurm_resume.log, /var/log/parallelcluster/slurm_suspend.log, /var/log/parallelcluster/slurm_fleet_status_manager.log (if version >= 3.2.0) and/var/log/slurmctld.log.
  • From Compute node: /var/log/parallelcluster/computemgtd.log and /var/log/slurmd.log.

Additional context:
Setup is basically per https://aws.amazon.com/blogs/hpc/leveraging-slurm-accounting-in-aws-parallelcluster/ with rds/aurora database.

Any other context about the problem. E.g.:

  • CLI logs: ~/.parallelcluster/pcluster-cli.log
  • Custom bootstrap scripts, if any
  • Screenshots, if useful.
commented

config file

HeadNode: 
  CustomActions:
    OnNodeStart: 
      Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
      Args:
        - "HeadNodeOnNodeStart"
        - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
        - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

    OnNodeConfigured: 
      Script:  "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
      Args:
        - "HeadNodeOnNodeConfigured"
        - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
        - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
    OnNodeUpdated: 
      Script:  "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
      Args:
        - "HeadNodeOnNodeUpdated"
        - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
        - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        
  Iam: 
    InstanceRole: arn:aws:iam::654225707598:role/HeadNodeRole
  InstanceType: t3.medium
  Networking: 
    AdditionalSecurityGroups: 
    # posit workbench launcher callback
      - sg-0e3cf05f5dc1a59da 
    SubnetId: subnet-00c65375744fb5ea5   ##public for now, but could/should be private same as compute
    ElasticIp: false
  Ssh: 
     KeyName: admin_ED25519
  LocalStorage:
     RootVolume:
       Size: 500 # we allocate 2xRAM swap file on root filesystem
       
       
  Dcv:
    Enabled: true
    Port: 8443
    AllowedIps: 10.0.0.0/8 #only letting to access from private ip space. Accessed via cabby/proxy most of the times anywazys
    

Image: 
  Os: ubuntu2204
  CustomAmi: ami-06d0647411b892c8d
Region: us-east-2
Scheduling: 
  Scheduler: slurm
  SlurmSettings:
    QueueUpdateStrategy: DRAIN     #Must be one of: DRAIN, COMPUTE_FLEET_STOP, TERMINATE.
    Dns:
      DisableManagedDns: False
      #UseEc2Hostnames: True
    Database:
      Uri: slurmaccountingdb-slurmaccountingdb5933648d-brbj4rghp0xk.cluster-c8saxlvocbwo.us-east-2.rds.amazonaws.com:3306
      UserName: admin
      PasswordSecretArn: arn:aws:secretsmanager:us-east-2:654225707598:secret:/a2ai/a2ai-cloud/slurm-accounting-db-password-LLm2tb
  SlurmQueues: 
    - Name: cpu2mem4gb
      
      CapacityType: SPOT
       
      ComputeResources: 
        - Name: c6i-large
          Instances: 
            - InstanceType: c6i.large
         
          MaxCount: 100 
          MinCount: 0 
      CustomActions: 
        OnNodeStart: 
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeConfigured"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
        InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"
    - Name: cpu4mem32gb
      
      CapacityType: SPOT
      
      ComputeResources:  
        - Name: r6i-xlarge
          Instances:
            - InstanceType: r6i.xlarge
          MaxCount: 100 
          MinCount: 0 
         
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"
    - Name: cpu8mem64gb
      
      CapacityType: SPOT
       
      ComputeResources: 
        - Name: r6i-2xlarge
          Instances:
            - InstanceType: r6i.2xlarge
          MaxCount: 100 
          MinCount: 0 
          
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"
  

    - Name: cpu2mem8gb
      
      CapacityType: SPOT
      
      ComputeResources: 
        - Name: c6i-xlarge
          Instances:
            - InstanceType: c6i.xlarge
          MaxCount: 100 
          MinCount: 0 
          DisableSimultaneousMultithreading: true
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"
  
  
    - Name: cpu4mem16gb
      
      CapacityType: SPOT
      
      ComputeResources: 
        - Name: c6i-2xlarge
          Instances:
            - InstanceType: c6i.2xlarge
          MaxCount: 100
          MinCount: 0 
          DisableSimultaneousMultithreading: true
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"

    - Name: cpu16mem128gb
      
      CapacityType: SPOT
       
      ComputeResources: 
        - Name: r6i-4xlarge
          Instances:
            - InstanceType: r6i.4xlarge
          MaxCount: 100 
          MinCount: 0 
          
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"
  

    - Name: cpu8mem32gb
      
      CapacityType: SPOT
      
      ComputeResources: 
        - Name: c6i-4xlarge
          Instances:
            - InstanceType: c6i.4xlarge
          MaxCount: 100
          MinCount: 0 
          DisableSimultaneousMultithreading: true
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"
    - Name: cpu16mem64gb
      
      CapacityType: SPOT
      
      ComputeResources: 
        - Name: c6i-8xlarge
          Instances:
            - InstanceType: c6i.8xlarge
          MaxCount: 100
          MinCount: 0 
          DisableSimultaneousMultithreading: true
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
          SubnetIds: 
            - "subnet-00c65375744fb5ea5"
            - "subnet-080fb7908dc6c99df"
        
    - Name: cpu32mem128gb
      
      CapacityType: SPOT
      
      ComputeResources: 
        - Name: c6i-16xlarge
          Instances:
            - InstanceType: c6i.16xlarge
          MaxCount: 100
          MinCount: 0 
          DisableSimultaneousMultithreading: true
      CustomActions: 
        OnNodeStart:  
          Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
          Args:
            - "ComputeNodeOnNodeStart"
            - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
            - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"
        OnNodeConfigured: 
            Script: "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/scripts/branch/release-v3/download_and_run_cookbook.sh"
            Args:
              - "ComputeNodeOnNodeConfigured"
              - "s3://a2ai-cloud-build-artifacts-dev-654225707598-us-east-2/a2ai-cloud-cookbook/branch/release-v3/a2a-cloud.tar.gz"
              - "s3://a2ai-cluster-provision-artifacts-dev-654225707598-us-east-2/A2AiClustersergey/config/a2ai.json"

      Iam: 
         InstanceRole: arn:aws:iam::654225707598:role/ComputeFleetRole
   
      Networking: 
        SubnetIds: 
          - "subnet-00c65375744fb5ea5"
          - "subnet-080fb7908dc6c99df"


    
   

SharedStorage:

  - MountDir: /data
    Name: FsxSharedDataFilesystem
    StorageType: FsxLustre
    FsxLustreSettings:
      FileSystemId: fs-0db92f52bb64ade74
   

  - MountDir: /cluster-data
    Name: ClusterSharedDataFilesystem
    StorageType: Ebs
    EbsSettings:
      VolumeType: gp3
      Size: 100
      Throughput: 500
      DeletionPolicy: Delete
      Encrypted: true



DirectoryService:
  DomainName: ad.dev.a2-ai.cloud
  DomainAddr: ldaps://ad.dev.a2-ai.cloud
  PasswordSecretArn: arn:aws:secretsmanager:us-east-2:654225707598:secret:/a2ai/a2ai-cloud/ad_bind_user_password-0oxi9e
  DomainReadOnlyUser: CN=AD ReadOnly,OU=Users,OU=ad-dev,DC=ad,DC=dev,DC=a2-ai,DC=cloud
  LdapTlsCaCert: /opt/parallelcluster/shared/directory_service/domain-certificate.crt
  LdapTlsReqCert: demand
  #LdapAccessFilter: string
  GenerateSshKeysForUsers: true
  #AdditionalSssdConfigs: 

Tags:
  - Key: A2AI:a2ai-cloud-version
    Value: branch/release-v3 
  - Key: A2AI:a2ai-cloud-env
    Value: dev 
  - Key: A2AI:creator
    Value: sergey```
commented

describe-cluster output

pcluster describe-cluster -n A2AiClustersergey --region us-east-2
{
  "creationTime": "2024-01-03T20:20:10.856Z",
  "headNode": {
    "launchTime": "2024-01-03T20:24:23.000Z",
    "instanceId": "i-0a44846fde3aa054c",
    "instanceType": "t3.medium",
    "state": "running",
    "privateIpAddress": "10.2.42.167"
  },
  "version": "3.8.0",
  "clusterConfiguration": {
    "url": "https://parallelcluster-97a8b56da16cbe1e-v1-do-not-delete.s3.us-east-2.amazonaws.com/parallelcluster/3.8.0/clusters/a2aiclustersergey-8blzpedizys4pybh/configs/cluster-config.yaml?versionId=EeeUlRSiAZpVZ2ADoj.2GL7keIHf1aqZ&X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=ASIAZQUXECJHNEE4JIEM%2F20240103%2Fus-east-2%2Fs3%2Faws4_request&X-Amz-Date=20240103T204603Z&X-Amz-Expires=3600&X-Amz-SignedHeaders=host&X-Amz-Security-Token=IQoJb3JpZ2luX2VjEMz%2F%2F%2F%2F%2F%2F%2F%2F%2F%2FwEaCXVzLWVhc3QtMiJIMEYCIQCn82A%2BYvjtedv3jOdnJkxl6QwYN46lBZ2NJel3jpeRrAIhAIwSIQCZ3mAmuqBN6E5sToZbqGq6C9qrujHbdFoFAdz%2FKv4CCGUQABoMNjU0MjI1NzA3NTk4IgyswAK689OXTeh%2B3sEq2wL%2B2gfLNIVpbFK1X3ZTnGhvzhuMg9tA5aWKE6c494QTf4dTjimy7oIA1uxi3O8ITg5fRw8V3MGKuiGPgx%2BIxYXUQNDyHEa2cIq4aFnpOTZzExMzJVjjDf7TY8BJz%2B9brhgy21ehRvEf5E7lgDSXbuwGvJC4b7fP3uzHA7pKEPbBzNOcOfow%2FSMDfsSGhNDjp7JK%2FrEh7v%2BK10W2Wgul0vomd1UoNGGzN3M8a4FOKSdE0OSOa6Z%2FrLFhEZnzb8LcT4hI3o2yuvywG7v50cngrmdlg7u1s0JFC6F8KGeUrK1d46BdQWzkaVZ7ev97EN8Bah4urfbmThict7Zv10%2Fc%2Fqa%2B7x6PCcg8V5arnNWHsjAP23VfCfpMGYGIk2my%2BvAVDR%2FA0AUBTDBAoOVwplxTLQCzCnXBVXIxW%2B%2Bul9gkLyAleX0ySGXxh5xbCRiAl2nzKo440zLjaHzlqfe7cjCvgtesBjqmAYz6w4z3fdbHcTCqbzqvnzrtYjkRp%2BkiGc1zAHfd%2FjI%2B4puJZYQe5GoTqTdmOAH5O1xWvAgFGG8BkX8cDkqo5xzU%2B6KB5rGG3Z5TDW5w2ERhbZLGJKUugo14IlntcgTpjO9%2BKcUURrqeZUnWrdadvnWauZ%2BqLga03xCgLPxzUoJc2%2FnfaJXnbi%2Fff%2FcbuzJf5CIxFT48w3Gu2faH1wbQCY%2FF2PbjKTk%3D&X-Amz-Signature=a3ae7089f1513eb8261041530e7dd0fe32b237ef2d695de0e47c17bbae5f8ef2"
  },
  "tags": [
    {
      "value": "branch/release-v3",
      "key": "A2AI:a2ai-cloud-version"
    },
    {
      "value": "3.8.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "A2AiClustersergey",
      "key": "parallelcluster:cluster-name"
    },
    {
      "value": "sergey",
      "key": "A2AI:creator"
    },
    {
      "value": "dev",
      "key": "A2AI:a2ai-cloud-env"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "A2AiClustersergey",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:us-east-2:654225707598:stack/A2AiClustersergey/7e9b0c10-aa75-11ee-b08b-0a01aa5f1f7b",
  "lastUpdatedTime": "2024-01-03T20:20:10.856Z",
  "region": "us-east-2",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

Hi @snemir2,

we performed some tests on some simple internal clusters and we were able to use the sreport command. Though, I admit that it's as immediate as sacct.

Two things we noticed:

  • you have to make sure that jobs are associated to a Slurm account (e.g. via the --account option of sbatch). This seems to be present on your side;
  • you probably have to run applications that last a while: your examples (hostname, uptime, whoami and qstat) are commands that return immediately and won't keep nodes allocated for long.

For any other issue in using sreport, I would recommend contacting SchedMD.

commented

Thank you so much for the feedback; the short demo jobs are just red herrings -- something that I run for demo purposes. We had the same problem in prod with days-long jobs.

A perfect solution for us would be to have a way to map an AD group to a slurm account. Simmingly, no good way of doing that.

A perfect solution for us would be to have a way to map an AD group to a slurm account. Simmingly, no good way of doing that.

I would say that must be done manually when you set up the Slurm accounting database: in ParallelCluster you may have a post-install script that bootstraps the Slurm database from the AD information (although I'm not sure about how to best implement this).