Reduce job submission count

Question

Reduce job submission count

pdeperio opened this issue 7 years ago · comments

We need to reduce the number of short jobs being submitted on Midway. Some possible solutions that may or may not be combined:

Bundling runs (which should be fine since we're not running very long pax processing anymore) so each job runs longer,
Using job arrays to reduce number of jobs scheduler handles (I think),
Running Corrections locally (this seems to be fast now after previous hax improvements) and implementing local checks for intensive processes (e.g. AddChecksum, ProcessBatchQueueHax) before submitting jobs that actually run those tasks.
Add minitrees to RunsDB, to facilitate the local checking in 3.

Francesco Lombardi · Answer 1 · Fri Nov 03 2017 23:18:37 GMT+0800 (China Standard Time)

So, I did a first test on datamanger adding the correction tasks together the checksum, but they take too much time because for each task run over all the runs and in the meanwhile the runs are waiting to be verified.
I'll try to create a new session of cax in parallel to check if it work with reasonable time.

Francesco Lombardi · Answer 2 · Sat Nov 04 2017 00:30:39 GMT+0800 (China Standard Time)

Test definitely negative! also if with a new cax process, AddElectronLifetime and AddGains take a huge amount of time to run over all the runs. On the single run is fast, but using cax --once --config ... took very long time.
The bottle neck I think is that each task run ever all runs before to pass at next task. maybe could be more efficient do the contrary, for each run do all tasks.
In any case we saturate the RAM (MB) on datamanager computer:

             total       used       free     shared    buffers     cached
Mem:         20424      20423          0       2649          0      17792
-/+ buffers/cache:       2631      17792 
Swap:         1024       1023          0

Patrick de Perio · Answer 3 · Wed Nov 08 2017 22:52:27 GMT+0800 (China Standard Time)

I think this is related to #108 and #114, which we never understood, (i.e.:

Why is it looping over all runs per task? I thought it was looping over each task per run.
Why does it skip tasks?

Please review that issue and PR.

Francesco Lombardi · Answer 4 · Fri Nov 17 2017 08:36:15 GMT+0800 (China Standard Time)

Ciao, maybe I found the way to stop the submission of thousand jobs useless with massive-cax.

Basically I made a check on the variables present on RunDB in "processor" field and I check that all entries of "correction_versions" are present.
If true the code generate the script to submit the jobs.

Of course another check is if there are the processed and the minitrees files on the local directories on midway and understand where the code is running (which host).
I have still to complete the code but since the first test it works.

Boris Bauermeister · Answer 5 · Sun Nov 19 2017 21:12:20 GMT+0800 (China Standard Time)

@lucrlom I think we can raise the memory on xetransfer for the virtual machine xe1t-datamanager if it helps. I run in any case two cax-like sessions (massive-cax and massive-ruciax) with the user xe1ttransfer. Each process needs ~12 GB of memory. I haven't yet understood why these processes need so much memory (it seems a lot to me).