Wrong memory efficiency when using "srun"

Question

Wrong memory efficiency when using "srun"

angel-devicente opened this issue a year ago · comments

Hello,

probably related to #37, but a bit different, so I thought I'd open a new issue.

When I run:
srun -n 8 stress -m 1 -t 52 --vm-keep --vm-bytes 1800M
I use 8 CPUs and almost 16GB, but reportseff gets the CPU efficiency OK, but the memory efficiency way off (it basically reports I only used 1800M).

$ seff 131042
######################## JOB EFFICIENCY REPORT ########################
# Job ID:          131042
# State:           COMPLETED (exit code 0)
# Cores:           8
# CPU Utilized:    00:06:58
# CPU Efficiency:  98.58% of 00:07:04 core-walltime
# Wall-clock time: 00:00:53
# Memory Utilized: 14.86 GB (estimated maximum)
#######################################################################

$ reportseff --debug 131042
^|^8^|^00:00:53^|^131042^|^131042^|^^|^1^|^16000M^|^COMPLETED^|^00:01:00^|^06:57.815
^|^8^|^00:00:53^|^131042.batch^|^131042.batch^|^20264K^|^1^|^^|^COMPLETED^|^^|^00:00.034
^|^8^|^00:00:53^|^131042.extern^|^131042.extern^|^1052K^|^1^|^^|^COMPLETED^|^^|^00:00.001
^|^8^|^00:00:53^|^131042.0^|^131042.0^|^1947276K^|^1^|^^|^COMPLETED^|^^|^06:57.779

   JobID    State       Elapsed  TimeEff   CPUEff   MemEff 
  131042  COMPLETED    00:00:53   88.3%    98.3%    11.9%

Troy Comi · Answer 1 · Wed Sep 13 2023 03:23:41 GMT+0800 (China Standard Time)

Can you run seff -d 131042 to get the raw data? Seems the memory reported by sacct should be scaled by ntasks as shown here

Angel de Vicente · Answer 2 · Wed Sep 13 2023 15:44:25 GMT+0800 (China Standard Time)

$ seff -d 131042
Slurm data: JobID ArrayJobID User Group State Clustername Ncpus Nnodes Ntasks Reqmem PerNode Cput Walltime Mem ExitStatus
Slurm data: 131042  xxx xxx COMPLETED xxxx 8 1 8 16384000 0 418 53 15578208 0

######################## JOB EFFICIENCY REPORT ########################
# Job ID:          131042
# Cluster:         xxx
# User/Group:      xxx/xxx
# State:           COMPLETED (exit code 0)
# Cores:           8
# CPU Utilized:    00:06:58
# CPU Efficiency:  98.58% of 00:07:04 core-walltime
# Wall-clock time: 00:00:53
# Memory Utilized: 14.86 GB (estimated maximum)
#######################################################################

Troy Comi · Answer 3 · Wed Sep 13 2023 21:12:03 GMT+0800 (China Standard Time)

Thank you for providing this information and opening the issue. I don't do many multi-task jobs so their test coverage is lighter than it should be. I should have time to fix this in a week or so.

Angel de Vicente · Answer 4 · Thu Sep 14 2023 00:21:28 GMT+0800 (China Standard Time)

Great, thanks. If you need to run any tests, please let me know.
And, BTW, many thanks for developing this tool!

Troy Comi · Answer 5 · Thu Sep 14 2023 23:49:13 GMT+0800 (China Standard Time)

Should be addressed with version 2.7.6. Please reopen if you notice any problems.

Angel de Vicente · Answer 6 · Fri Sep 15 2023 00:45:19 GMT+0800 (China Standard Time)

Awesome. I'll give it a try as soon as I can. Thanks.