Final values with 0 -> no IO values

Question

Final values with 0 -> no IO values

elmsheus opened this issue 4 years ago · comments

Hi,

as discussed by email, here is the GitHub issue:

we've discussed the following prmon problem already a while ago and
thought it was fixed: final values set to 0 for the IO values in the
summary json file.

Here is a PanDA job of the recent 22.0.18 validation task:
https://bigpanda.cern.ch/job?pandaid=4858116576

The value of rchar, read_bytes, wchar, write_bytes are set to 0 in
memory_monitor_summary.json:
http://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/memory_monitor_summary.json

because the values of the last line in memory_monitor_output.txt are
mainly 0 in the last line:
https://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/memory_monitor_output.txt

The jobs use the latest stable prmon version setup by the PanDA pilot
from ALRB through: setupATLAS; lsetup prmon

It seems not to be site dependent for this task (happens at CERN-TO,
CERN, CERN-PROD_EOS, TW-FTT, LRZ-LMU_CLOUD, UNI_FREIBURG) and it happens
for 952 jobs out of 2054 (so it's pretty bad...).

For all these jobs, 0-values are reported back to PanDA and the
analytics ES cluster in Chicago, so we cannot do a full analysis w.r.t
to IO for many jobs - the individual job plotting is obviously
unaffected: https://bigpanda.cern.ch/prmonplots/4858116576/

Do you want me open a github issue ? Perhaps the logic in prmon has to
be changed for PanDA jobs since prmon is "killed" too late in the update
cycle ?

Cheers, Johannes

Graeme A Stewart · Answer 1 · Mon Oct 26 2020 20:37:11 GMT+0800 (China Standard Time)

I think I have identified the bug here. The stats that are got from /proc are cummulative, so we reset all counters to zero each cycle, then loop over the PID list and add everything together. Works great untul a child process exits, dissapears from the PID list and then we forget the IO numbers for this process. In the ATLAS jobs the real work is done via processes that exit before the end of the job, so all of the children that actually did IO are forgotten about and prmon ends up resporting zero, or close to it.

Graeme A Stewart · Answer 2 · Mon Oct 26 2020 20:39:24 GMT+0800 (China Standard Time)

P.S. In the case where the children exit, but prmon does not do a polling cycle before the end of the job then we do report the correct numbers, because the last values read from the children are effectively remembered internally.

Graeme A Stewart · Answer 3 · Mon Oct 26 2020 21:29:27 GMT+0800 (China Standard Time)

Just a update... things are weirder than I thought. It seems that in the simple case when a parent process forks a child and that child does some i/o and exits, then the parent inherits the i/o in its /proc counters.

This means that

In simple cases the defect above did not manifest as the parent collected all of the child i/o
Remembering all of the child i/o and summing over both its and the parent's i/o values will likely lead to double counting

So really diagnosing the cause of why ATLAS jobs are missing i/o (maybe the pilot is now doing a daemon type double fork of part of the job?) is harder and the fix is not so clear.

Alaettin Serhan Mete · Answer 4 · Mon Oct 26 2020 21:40:30 GMT+0800 (China Standard Time)

BTW, it's perhaps worth noting that the prmon that's is launched by the job transformation to listen to each step separately doesn't show this behavior:

http://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/prmon.summary.RAWtoESD.json

http://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/prmon.summary.ESDtoAOD.json

It's only the instance that's launched by the pilot seems to have this issue.

Graeme A Stewart · Answer 5 · Mon Oct 26 2020 21:44:10 GMT+0800 (China Standard Time)

Thanks for checking @amete - I was indeed going to look at that. I have asked @PalNilsson for insight into how the pilot is launching the payloads.

elmsheus · Answer 6 · Mon Oct 26 2020 21:44:33 GMT+0800 (China Standard Time)

The pilot starts prmon just after the singularity container for the payload is launched and "listens" to this process.

In https://bigpanda.cern.ch//media/filebrowser/35939f9e-bbac-4f18-9978-8efb01e7db46/valid3/tarball_PandaJob_4858116576_LRZ-LMU_CLOUD/pilotlog.txt search for the first occurrence of "prmon" - the lines before is where the pilot identifies the singularity process which is just launched above.

Cheers, Johannes

Graeme A Stewart · Answer 7 · Mon Oct 26 2020 21:54:26 GMT+0800 (China Standard Time)

Oh, I wonder if it's somethind to do with running in singularity then? Is that ~new? (I am also thinking we didn't really change this logic in years).

If I look at the reproducer that PanDA gives me...

http://pandaserver.cern.ch:25080//server/panda/getScriptOfflineRunning?pandaID=4858116576

There's no mention of singularity there. Is there an easy recipe to start the job in the same way as the pilot's doing?

elmsheus · Answer 8 · Mon Oct 26 2020 22:05:04 GMT+0800 (China Standard Time)

We run singularity for every PanDA payload since summer 2019 - the following is also written in the pilot logs (admittedly one has to search for it).

The singularity launch is hidden in: "...atlasLocalSetup.sh -c $thePlatform ..." in the final command (see below).

Create the following 2 files, adjust "/srv" in the commands below and probably X509_USER_PROXY

my_release_setup.sh:

source $AtlasSetup/scripts/asetup.sh Athena,22.0.18,notest --platform x86_64-centos7-gcc8-opt --makeflags='$MAKEFLAGS'

container_script.sh:

export TMPDIR=/srv;export GFORTRAN_TMPDIR=/srv;export ATHENA_CORE_NUMBER=10;export ATHENA_PROC_NUMBER=10;export PANDA_RESOURCE='LRZ-LMU_CLOUD';export FRONTIER_ID="[22770742_4858116576]";export CMSSW_VERSION=$FRONTIER_ID;export PandaID=4858116576;export PanDA_TaskID='22770742';export PanDA_AttemptNr='2';export INDS='data18_13TeV:data18_13TeV.00357750.physics_Main.daq.RAW';export RUCIO_APPID='reprocessing';export RUCIO_ACCOUNT='pilot'; Reco_tf.py --inputBSFile=data18_13TeV.00357750.physics_Main.daq.RAW._lb0106._SFO-8._0005.data --multithreaded=True --postExec 'all:from IOVDbSvc.CondDB import conddb; conddb.addOverride("/TRT/Cond/StatusHT", "TrtStrawStatusHT-RUN2-Reprocessing");' 'RAWtoESD:StreamESD.ExtraOutputs = [('"'"'DataHeader'"'"','"'"'StoreGateSvc+StreamESD'"'"')]' --preExec 'all:from AthenaCommon.AthenaCommonFlags import athenaCommonFlags; from InDetRecExample.InDetJobProperties import InDetFlags; from RecExConfig.RecFlags import rec; rec.doTrigger = False; InDetFlags.useDynamicAlignFolders.set_Value_and_Lock(True); InDetFlags.useMuForTRTErrorScaling.set_Value_and_Lock(True if not rec.doRDOTrigger else False); from MuonCombinedRecExample.MuonCombinedRecFlags import muonCombinedRecFlags; from AthenaMonitoring.DQMonFlags import DQMonFlags; DQMonFlags.doMonitoring=False' --autoConfiguration=everything --conditionsTag all:CONDBR2-BLKPA-2018-14 --geometryVersion=default:ATLAS-R2-2016-01-00-01 --runNumber=357750 --AMITag=r12148 --outputAODFile=AOD.22770742._002116.pool.root.1 --outputESDFile=ESD.22770742._002116.pool.root.1 --jobNumber=1177;

final command:

export X509_USER_PROXY=/var/lib/condor/execute/dir_28465/prod;if [ -z "$ATLAS_LOCAL_ROOT_BASE" ]; then export ATLAS_LOCAL_ROOT_BASE=/cvmfs/atlas.cern.ch/repo/ATLASLocalRootBase; fi;export thePlatform="x86_64-centos7-gcc8-opt";export PandaID=4858116576;source ${ATLAS_LOCAL_ROOT_BASE}/user/atlasLocalSetup.sh -c $thePlatform -s /srv/my_release_setup.sh -r /srv/container_script.sh -e "-C"

Cheers, Johannes

Graeme A Stewart · Answer 9 · Mon Oct 26 2020 22:13:48 GMT+0800 (China Standard Time)

Thanks a lot @elmsheus. Can I ask one more favour (you or @amete)? Could you put the input file somewhere I can see it? I just generated a new grid certificate (it has been years since I needed one), but the ATLAS VOMS server is not letting me re-register. I'd like to avoid this rabbit hole, so if I can just get the input file it will speed things up a lot.

elmsheus · Answer 10 · Mon Oct 26 2020 22:19:54 GMT+0800 (China Standard Time)

@graeme-a-stewart, no problem - I've copied the RAW file to: /afs/cern.ch/user/e/elmsheus/public/forGraeme/data18_13TeV/data18_13TeV.00357750.physics_Main.daq.RAW._lb0106._SFO-8._0005.data

Side note: X509_USER_PROXY should not be needed for the local job execution.

Graeme A Stewart · Answer 11 · Tue Oct 27 2020 16:15:41 GMT+0800 (China Standard Time)

Thanks a lot for the input file @elmsheus. So, I have re-run the job, both plain and in singularity. However, I have not been able to reproduce the problem as seen in the real production job, even when pointing prmon right at the singularity launch command.

As it's not 100% clear what the problem actually is it's also not completely clear how to properly fix it. I can envisage a monkey patch, where prmon remembers what the highest i/o values it saw were, and it will never allow the reported values to be less than these (as they have to be monotonic increasing). If it's only a glitch that's happening at the end of the job, for reasons we still don't quite understand, then it will be a decent work around, but it's not entirely satisfactory.

elmsheus · Answer 12 · Tue Oct 27 2020 16:25:29 GMT+0800 (China Standard Time)

@graeme-a-stewart, thanks again for looking into this. I guess it's most probably a timing/time-out problem at the end of the singularity process, prmon or the pilot terminating the remaining processes (without having real evidence...).

The suggested "monkey-patch" makes conceptually sense, since the json file reports the total sum of the I/O numbers and they should be the "last good" values of the txt file - could we go ahead with this ?

Graeme A Stewart · Answer 13 · Tue Oct 27 2020 19:33:19 GMT+0800 (China Standard Time)

Hi @elmsheus - so it's fixed in #174. I will let @amete check it over and then we'll get a new minor release out for you.

elmsheus · Answer 14 · Tue Oct 27 2020 19:39:16 GMT+0800 (China Standard Time)

Hi @graeme-a-stewart, Thanks a lot for this !

Graeme A Stewart · Answer 15 · Tue Oct 27 2020 19:42:29 GMT+0800 (China Standard Time)

P.S. If prmon detects this problem you'll see messages like this:

prmon: Warning, detected drop in i/o values for wchar (15085652 became 0)

Graeme A Stewart · Answer 16 · Wed Oct 28 2020 02:46:34 GMT+0800 (China Standard Time)

Thinking about this a little more, I realise that there can be a race condition. The list of PIDs is gathered, then passed to each of the different monitors in turn. If the monitored process exits before it can be polled then it's impossible to read the I/O stats and we could end up with a spurious reading of zero.

Usually this would be really rare, so why it would happen to almost 50% of the jobs in this task I have no idea. So I suspect there is still an different/additional cause.