HSF / prmon

Standalone monitor for process resource consumption

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Constant update of prmon.json

sciaba opened this issue · comments

It was brought to my attention that a very useful functionality that PrMon would enable is to publish to a job monitoring system the most recent metric values at regular time intervals, as this would provide a useful hint at why e.g. a job was killed by a batch system (in which case often the entire job output is lost).
However, this would leave out the aggregate values contained in prmon.json.
Would it make sense to update prmon.json as PrMon runs instead of producing it just at the end? Maybe as a non-default behaviour controlled by a new command line option (as it would probably create a slight additional overhead)?

Cheers,
Andrea

Hi Andrea,

We already do this (to the first order), only the file is called prmon.json_snapshot and only when the execution terminates successfully it'll be moved to prmon.json. Isn't this the case for you?

Best,
Serhan

I'm sure it is the case, I simply did not know... Thank you!
Just to know, how often is the snapshot generated? Can it be used safely (or there is a danger of having a corrupted snapshot if PrMon is abruptly killed)?

The json snapshot is generated at every sampling interval (the text file is also updated at the same intervals). I tested this in the past by killing prmon and haven't yet seen a case where either the snapshot JSON or the text file is corrupted (but I suppose if something goes horribly wrong it might be possible). So, in general it can safely be used and I actually did so in the past but please let us know if that's not the case. It'd be interesting to see. Many thanks 😄

I'm fully satisfied with the answer, thanks to you!

HI @sciaba - yep, thanks for raising that as an issue, we should make it more public that the snapshot file exists.

I rechecked the code and it's quite robust. Every cycle prmon writes the snapshot file to a temporary file and closes it. Then it renames it to prmon.json_snapshot. The rename will ensure that when you open the snapshot file you get something complete - if prmon were killed writing the temporary file then the rename wouldn't happen.