troycomi / reportseff

Tabular seff

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

feat: energy data might be interesting, too!

sventor opened this issue · comments

Beside efficiency (which is already great!), energy consumption would be a great feature, too.
As soon as Slurm has configured AcctGatherEnergyType and AccountingStorageTRES=...,energy,... in slurm.conf, jobs have rough estimations of consumed energy in J (Ws), so it would be great if reportseff could be enhanced to show that column in addition to efficiency data...!
I'm an administrator and not much of a power pythoneer, so I wanted to prevent your code being screwed up by trying that myself...

I like it but I'm in the other camp, a pythoneer without a sysadmin background.

$ reportseff -u tcomi --format +tresusageoutave
     JobID    State       Elapsed  TimeEff   CPUEff   MemEff        TRESUsageOutAve
  43967087  COMPLETED    00:52:56   11.0%    97.0%    40.4%        energy=0,fs/disk=1
  43967088  COMPLETED    00:58:09   12.1%    86.3%    39.4%        energy=0,fs/disk=1
  43967089  COMPLETED    00:54:16   11.3%    86.3%    38.4%        energy=0,fs/disk=1

Seems like my system isn't set up to collect that information. I think adding an option --energy would be a good interface.

Can you:

  • tell me which tres usage I should be checking?
  • supply some reportseff --debug --format +tresusage output? I need that for building realistic unit tests.

Thanks for the suggestion!

If there is still interest in this, I wanted to see if anyone with a system configured to collect energy usage could provide sample reportseff --debug output to help build tests.

... sure, there still is vivid interest - sorry for the delay. Here comes my reportseff output for a small job array:

$ reportseff --debug 37403870 --format +tresusageoutave
32|00:01:09|37403870_1|37403937||1|32000M|COMPLETED||00:02:00|00:47.734
32|00:01:09|37403870_1.batch|37403937.batch|6300K|1||COMPLETED|energy=33,fs/disk=0||00:47.733
32|00:01:09|37403870_1.extern|37403937.extern|4312K|1||COMPLETED|energy=33,fs/disk=0||00:00.001
32|00:01:21|37403870_2|37403938||1|32000M|COMPLETED||00:02:00|00:41.211
32|00:01:21|37403870_2.batch|37403938.batch|6316K|1||COMPLETED|energy=32,fs/disk=0||00:41.210
32|00:01:21|37403870_2.extern|37403938.extern|4312K|1||COMPLETED|energy=32,fs/disk=0||00:00:00
32|00:01:34|37403870_3|37403939||1|32000M|COMPLETED||00:02:00|00:51.669
32|00:01:34|37403870_3.batch|37403939.batch|6184K|1||COMPLETED|energy=30,fs/disk=0||00:51.667
32|00:01:35|37403870_3.extern|37403939.extern|4312K|1||COMPLETED|energy=30,fs/disk=0||00:00.001
32|00:01:11|37403870_4|37403870||1|32000M|COMPLETED||00:02:00|01:38.184
32|00:01:11|37403870_4.batch|37403870.batch|6300K|1||COMPLETED|energy=27,fs/disk=0||01:38.183
32|00:01:11|37403870_4.extern|37403870.extern|4312K|1||COMPLETED|energy=27,fs/disk=0||00:00.001

       JobID    State       Elapsed  TimeEff   CPUEff   MemEff     TRESUsageOutAve
  37403870_1  COMPLETED    00:01:09   57.5%     2.1%     0.0%    energy=33,fs/disk=0
  37403870_2  COMPLETED    00:01:21   67.5%     1.6%     0.0%    energy=32,fs/disk=0
  37403870_3  COMPLETED    00:01:34   78.3%     1.7%     0.0%    energy=30,fs/disk=0
  37403870_4  COMPLETED    00:01:11   59.2%     4.3%     0.0%    energy=27,fs/disk=0
  $

And our system is configured to simply collect energy readings from the Intel RAPL interface (running average power limit):

$ scontrol show config | grep -i energy
AccountingStorageTRES   = cpu,mem,energy,node,billing,fs/disk,vmem,pages,gres/gpu,gres/gpu:a100,gres/gpu:v100
AcctGatherEnergyType    = acct_gather_energy/rapl

That of course is more fine grained than IPMI's power sensors, but includes only the energy consumed by CPUs and memory, not peripherals (as IPMI would do) - JFYI.

Cheers

No worries, it's the time of year full of delays!

As for the interface, I think having it as a formatting option will be better since you can specify any other formatting you want (column width, justification). For your example:

$ reportseff 37403870 --format +energy
       JobID    State       Elapsed  TimeEff   CPUEff   MemEff Energy
  37403870_1  COMPLETED    00:01:09   57.5%     2.1%     0.0%    33
  37403870_2  COMPLETED    00:01:21   67.5%     1.6%     0.0%    32
  37403870_3  COMPLETED    00:01:34   78.3%     1.7%     0.0%    30
  37403870_4  COMPLETED    00:01:11   59.2%     4.3%     0.0%    27

Is that good or do I need to deal with units?

Closed with #22. Decided against adding a unit since it may interfere with sorting or other downstream analysis.

Pushing to pypi now, let me know how it works!