Barriers when writing ECL output with tasklets.

Question

Barriers when writing ECL output with tasklets.

blattms opened this issue 6 years ago · comments

I just took a look at the code of ECLWriter and could not help but wonder:

Why do we need to wait here until a previous write finishes before dispatching the next write?
Yet there is no barrier in the destructor of EclWriter to make sure that the last write actually finishes before program shutdown. Don't we need that and are just lucky (maybe because of the other barrier) that all the writes finished until now? Or did I miss something?

Markus Blatt · Answer 1 · Wed Aug 29 2018 19:03:00 GMT+0800 (China Standard Time)

Found the answer for the 2nd question; TaskletRunner seems to make sure all tasks finish in its destructor ~~(unless there are more than 1 worker threads, see next comment).~~

Markus Blatt · Answer 2 · Wed Aug 29 2018 19:16:15 GMT+0800 (China Standard Time)

It also seems like there might be issues when the number of worker threads of the Taskletrunner is more than 1. In that case the BarrierTasklet is only executed by one thread. So we are waiting only for one thread. Not a big problem (even less synchronization), but the function name is a bit misleading.

But in the destructor of the TaskletRunner this seems to mean that only one thread will get the TerminateThreadTasklet here that will end its run methods. All other threads seem to be running forever and as they are joined so should the program.

Markus Blatt · Answer 3 · Wed Aug 29 2018 21:50:18 GMT+0800 (China Standard Time)

Ok, finally got that one. There is a refernence count in the TaskletInterface which in case of Barrier and Terminate is the number of threads. That count is decremented if the task is run by a thread. If the count is zero the task is removed from the quere. The only thing I cannot see is how we make sure that each thread runs a task only once. Otherwise with two threads a task could also be run by one thread twice and removed from the queue, right?

Andreas Lauser · Answer 4 · Thu Aug 30 2018 23:00:38 GMT+0800 (China Standard Time)

Thanks for looking into this! I think that everything is working as it is supposed to be, but I'll try to answer your questions below anyway. Threading related code is really non-trivial, so don't take this as criticism.

But in the destructor of the TaskletRunner this seems to mean that only one thread will get the TerminateThreadTasklet here that will end its run methods.

this is indeed confusing, but correct AFAICS: end markers are not removed from the queue: https://github.com/OPM/ewoms/blob/3a03f35acf78b9aa3b3a954a5254a7c3161e1219/ewoms/parallel/tasklets.hh#L303-L310

The only thing I cannot see is how we make sure that each thread runs a task only once.

we don't. we just guarantee that each job is executed the specified number of times, i.e., it is not guaranteed which thread runs a job. barriers take advantage of the fact that each worker thread that becomes idle gets assigned with the first tasklet in the queue: in essence they block until all workers are stuck in the condition variable, then consider themselves to be finished.

Otherwise with two threads a task could also be run by one thread twice and removed from the queue, right?

no, to exclude these kinds of races std::condition_variable sets the unique_lock. (I know this is quite tricky, but at least it is mandated by the STL ;).)

Markus Blatt · Answer 5 · Thu Aug 30 2018 23:31:12 GMT+0800 (China Standard Time)

So the barrier is only a barrier if you are lucky, if we would use more than thread which we are not as tasklets are not used despite in the output code where we explicitly limit them two one additional thread. And why is barrier needed at all in the output layer?

Andreas Lauser · Answer 6 · Thu Aug 30 2018 23:53:41 GMT+0800 (China Standard Time)

So the barrier is only a barrier if you are lucky,

no. because they are dispaced with refcount=numWorkerThreads. (in sequential mode they are no-ops.)

If you want to play around with this: it is also tested by test_tasklets.cpp.

And why is barrier needed at all in the output layer?

to not swamp the filesystem with I/O requests, because libecl is not threads safe and to prevent using obscene amounts of RAM on systems with really slow I/O (which are the whole point for asynchronous I/O in the first place)

Markus Blatt · Answer 7 · Fri Aug 31 2018 00:35:57 GMT+0800 (China Standard Time)

~~Sorry for stupidity, but: Doesn't this just make sure that the BarrierTasklet is run as many times as there are threads but does not guarantee that each thread runs it only once?~~

Did not read the previous answer

to not swamp the filesystem with I/O requests. because libecl is not threads safe and to prevent using obscene amounts of RAM

There is only one worker thread and therefore there will always be just one call to I/O. But the barrier prevents us from computing...
I doubt that the simulator needs more ram than compiling OPM with more than 3 threads and at the time we reach the barrier and the output thread is still busy we already have more than 2 times the output data available.

Andreas Lauser · Answer 8 · Fri Aug 31 2018 16:18:54 GMT+0800 (China Standard Time)

There is only one worker thread and therefore there will always be just one call to I/O.

right. so the threadsafety and flesystem swamping arguments were indeed wrong. what can still happen, though, is that results accumulate in RAM because they come in faster than the I/O system writes them. Having the result data for about 50 time steps of model 2 in RAM at some point is not a terribly good idea I think. (in particular because we have to wait for these tasklets to be finished before the process terminates anyway.)

But the barrier prevents us from computing...

that's only the case if the writing tasklet from the previous time step is not yet finished. When choosing between sometimes having to wait for a while and potentially running out of RAM, I'll choose the former.

I doubt that the simulator needs more ram than compiling OPM with more than 3 threads and at the time we reach the barrier and the output thread is still busy we already have more than 2 times the output data available.

that strongly depends on the deck: e.g. SPE-10 needs > 6 GB per process. (the result data to be written per time step is smaller, but not insignificant.) Also, the simulator is not necessarily compiled on an identical machine as the one which runs it...

Andreas Lauser · Answer 9 · Fri Sep 14 2018 23:06:56 GMT+0800 (China Standard Time)

I hope all questions have been answered and no problems have been found here. please reopen if this is not the case.