feat: energy data might be interesting, too!
sventor opened this issue · comments
Beside efficiency (which is already great!), energy consumption would be a great feature, too.
As soon as Slurm has configured AcctGatherEnergyType
and AccountingStorageTRES=...,energy,...
in slurm.conf, jobs have rough estimations of consumed energy in J (Ws), so it would be great if reportseff
could be enhanced to show that column in addition to efficiency data...!
I'm an administrator and not much of a power pythoneer, so I wanted to prevent your code being screwed up by trying that myself...
I like it but I'm in the other camp, a pythoneer without a sysadmin background.
$ reportseff -u tcomi --format +tresusageoutave
JobID State Elapsed TimeEff CPUEff MemEff TRESUsageOutAve
43967087 COMPLETED 00:52:56 11.0% 97.0% 40.4% energy=0,fs/disk=1
43967088 COMPLETED 00:58:09 12.1% 86.3% 39.4% energy=0,fs/disk=1
43967089 COMPLETED 00:54:16 11.3% 86.3% 38.4% energy=0,fs/disk=1
Seems like my system isn't set up to collect that information. I think adding an option --energy
would be a good interface.
Can you:
- tell me which tres usage I should be checking?
- supply some
reportseff --debug --format +tresusage
output? I need that for building realistic unit tests.
Thanks for the suggestion!
If there is still interest in this, I wanted to see if anyone with a system configured to collect energy usage could provide sample reportseff --debug
output to help build tests.
... sure, there still is vivid interest - sorry for the delay. Here comes my reportseff
output for a small job array:
$ reportseff --debug 37403870 --format +tresusageoutave
32|00:01:09|37403870_1|37403937||1|32000M|COMPLETED||00:02:00|00:47.734
32|00:01:09|37403870_1.batch|37403937.batch|6300K|1||COMPLETED|energy=33,fs/disk=0||00:47.733
32|00:01:09|37403870_1.extern|37403937.extern|4312K|1||COMPLETED|energy=33,fs/disk=0||00:00.001
32|00:01:21|37403870_2|37403938||1|32000M|COMPLETED||00:02:00|00:41.211
32|00:01:21|37403870_2.batch|37403938.batch|6316K|1||COMPLETED|energy=32,fs/disk=0||00:41.210
32|00:01:21|37403870_2.extern|37403938.extern|4312K|1||COMPLETED|energy=32,fs/disk=0||00:00:00
32|00:01:34|37403870_3|37403939||1|32000M|COMPLETED||00:02:00|00:51.669
32|00:01:34|37403870_3.batch|37403939.batch|6184K|1||COMPLETED|energy=30,fs/disk=0||00:51.667
32|00:01:35|37403870_3.extern|37403939.extern|4312K|1||COMPLETED|energy=30,fs/disk=0||00:00.001
32|00:01:11|37403870_4|37403870||1|32000M|COMPLETED||00:02:00|01:38.184
32|00:01:11|37403870_4.batch|37403870.batch|6300K|1||COMPLETED|energy=27,fs/disk=0||01:38.183
32|00:01:11|37403870_4.extern|37403870.extern|4312K|1||COMPLETED|energy=27,fs/disk=0||00:00.001
JobID State Elapsed TimeEff CPUEff MemEff TRESUsageOutAve
37403870_1 COMPLETED 00:01:09 57.5% 2.1% 0.0% energy=33,fs/disk=0
37403870_2 COMPLETED 00:01:21 67.5% 1.6% 0.0% energy=32,fs/disk=0
37403870_3 COMPLETED 00:01:34 78.3% 1.7% 0.0% energy=30,fs/disk=0
37403870_4 COMPLETED 00:01:11 59.2% 4.3% 0.0% energy=27,fs/disk=0
$
And our system is configured to simply collect energy readings from the Intel RAPL interface (running average power limit):
$ scontrol show config | grep -i energy
AccountingStorageTRES = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/gpu:a100,gres/gpu:v100
AcctGatherEnergyType = acct_gather_energy/rapl
That of course is more fine grained than IPMI's power sensors, but includes only the energy consumed by CPUs and memory, not peripherals (as IPMI would do) - JFYI.
Cheers
No worries, it's the time of year full of delays!
As for the interface, I think having it as a formatting option will be better since you can specify any other formatting you want (column width, justification). For your example:
$ reportseff 37403870 --format +energy
JobID State Elapsed TimeEff CPUEff MemEff Energy
37403870_1 COMPLETED 00:01:09 57.5% 2.1% 0.0% 33
37403870_2 COMPLETED 00:01:21 67.5% 1.6% 0.0% 32
37403870_3 COMPLETED 00:01:34 78.3% 1.7% 0.0% 30
37403870_4 COMPLETED 00:01:11 59.2% 4.3% 0.0% 27
Is that good or do I need to deal with units?