feat: energy data might be interesting, too!

Question

feat: energy data might be interesting, too!

sventor opened this issue 2 years ago · comments

Beside efficiency (which is already great!), energy consumption would be a great feature, too.
As soon as Slurm has configured AcctGatherEnergyType and AccountingStorageTRES=...,energy,... in slurm.conf, jobs have rough estimations of consumed energy in J (Ws), so it would be great if reportseff could be enhanced to show that column in addition to efficiency data...!
I'm an administrator and not much of a power pythoneer, so I wanted to prevent your code being screwed up by trying that myself...

Troy Comi · Answer 1 · Mon Nov 21 2022 21:59:07 GMT+0800 (China Standard Time)

I like it but I'm in the other camp, a pythoneer without a sysadmin background.

$ reportseff -u tcomi --format +tresusageoutave
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff        TRESUsageOutAve
  43967087  COMPLETED    00:52:56   11.0%    97.0%    40.4%        energy=0,fs/disk=1
  43967088  COMPLETED    00:58:09   12.1%    86.3%    39.4%        energy=0,fs/disk=1
  43967089  COMPLETED    00:54:16   11.3%    86.3%    38.4%        energy=0,fs/disk=1

Seems like my system isn't set up to collect that information. I think adding an option --energy would be a good interface.

Can you:

tell me which tres usage I should be checking?
supply some reportseff --debug --format +tresusage output? I need that for building realistic unit tests.

Thanks for the suggestion!

Troy Comi · Answer 2 · Tue Jan 24 2023 03:36:30 GMT+0800 (China Standard Time)

If there is still interest in this, I wanted to see if anyone with a system configured to collect energy usage could provide sample reportseff --debug output to help build tests.

Christian · Answer 3 · Tue Jan 24 2023 06:11:22 GMT+0800 (China Standard Time)

... sure, there still is vivid interest - sorry for the delay. Here comes my reportseff output for a small job array:

$ reportseff --debug 37403870 --format +tresusageoutave
32|00:01:09|37403870_1|37403937||1|32000M|COMPLETED||00:02:00|00:47.734
32|00:01:09|37403870_1.batch|37403937.batch|6300K|1||COMPLETED|energy=33,fs/disk=0||00:47.733
32|00:01:09|37403870_1.extern|37403937.extern|4312K|1||COMPLETED|energy=33,fs/disk=0||00:00.001
32|00:01:21|37403870_2|37403938||1|32000M|COMPLETED||00:02:00|00:41.211
32|00:01:21|37403870_2.batch|37403938.batch|6316K|1||COMPLETED|energy=32,fs/disk=0||00:41.210
32|00:01:21|37403870_2.extern|37403938.extern|4312K|1||COMPLETED|energy=32,fs/disk=0||00:00:00
32|00:01:34|37403870_3|37403939||1|32000M|COMPLETED||00:02:00|00:51.669
32|00:01:34|37403870_3.batch|37403939.batch|6184K|1||COMPLETED|energy=30,fs/disk=0||00:51.667
32|00:01:35|37403870_3.extern|37403939.extern|4312K|1||COMPLETED|energy=30,fs/disk=0||00:00.001
32|00:01:11|37403870_4|37403870||1|32000M|COMPLETED||00:02:00|01:38.184
32|00:01:11|37403870_4.batch|37403870.batch|6300K|1||COMPLETED|energy=27,fs/disk=0||01:38.183
32|00:01:11|37403870_4.extern|37403870.extern|4312K|1||COMPLETED|energy=27,fs/disk=0||00:00.001

       JobID    State       Elapsed  TimeEff   CPUEff   MemEff     TRESUsageOutAve
  37403870_1  COMPLETED    00:01:09   57.5%     2.1%     0.0%    energy=33,fs/disk=0
  37403870_2  COMPLETED    00:01:21   67.5%     1.6%     0.0%    energy=32,fs/disk=0
  37403870_3  COMPLETED    00:01:34   78.3%     1.7%     0.0%    energy=30,fs/disk=0
  37403870_4  COMPLETED    00:01:11   59.2%     4.3%     0.0%    energy=27,fs/disk=0
  $

And our system is configured to simply collect energy readings from the Intel RAPL interface (running average power limit):

$ scontrol show config | grep -i energy
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/gpu:a100,gres/gpu:v100
AcctGatherEnergyType    = acct_gather_energy/rapl

That of course is more fine grained than IPMI's power sensors, but includes only the energy consumed by CPUs and memory, not peripherals (as IPMI would do) - JFYI.

Cheers

Troy Comi · Answer 4 · Tue Jan 24 2023 23:21:35 GMT+0800 (China Standard Time)

No worries, it's the time of year full of delays!

As for the interface, I think having it as a formatting option will be better since you can specify any other formatting you want (column width, justification). For your example:

$ reportseff 37403870 --format +energy
       JobID    State       Elapsed  TimeEff   CPUEff   MemEff Energy
  37403870_1  COMPLETED    00:01:09   57.5%     2.1%     0.0%    33
  37403870_2  COMPLETED    00:01:21   67.5%     1.6%     0.0%    32
  37403870_3  COMPLETED    00:01:34   78.3%     1.7%     0.0%    30
  37403870_4  COMPLETED    00:01:11   59.2%     4.3%     0.0%    27

Is that good or do I need to deal with units?

Christian · Answer 5 · Tue Jan 24 2023 23:50:39 GMT+0800 (China Standard Time)

... no, that's good as you showed it. If you want to add a unit: it' s always in Joule (= Watt*seconds, Ws). Great! Thanks very much! Am 24. Januar 2023 16:21:47 MEZ schrieb Troy Comi ***@***.***>:

No worries, it's the time of year full of delays! As for the interface, I think having it as a formatting option will be better since you can specify any other formatting you want (column width, justification). For your example: ``` $ reportseff 37403870 --format +energy JobID State Elapsed TimeEff CPUEff MemEff Energy 37403870_1 COMPLETED 00:01:09 57.5% 2.1% 0.0% 33 37403870_2 COMPLETED 00:01:21 67.5% 1.6% 0.0% 32 37403870_3 COMPLETED 00:01:34 78.3% 1.7% 0.0% 30 37403870_4 COMPLETED 00:01:11 59.2% 4.3% 0.0% 27 ``` Is that good or do I need to deal with units? -- Reply to this email directly or view it on GitHub: #15 (comment) You are receiving this because you authored the thread. Message ID: ***@***.***>

--

…

_________________ Christian Griebel ***@***.***

Troy Comi · Answer 6 · Wed Jan 25 2023 00:15:27 GMT+0800 (China Standard Time)

Closed with #22. Decided against adding a unit since it may interfere with sorting or other downstream analysis.

Pushing to pypi now, let me know how it works!