`estimate_effects` sometimes hangs indefinitely when using `n_multi`

Question

`estimate_effects` sometimes hangs indefinitely when using `n_multi`

dannycunningham-8451 opened this issue 9 months ago · comments

dannycunningham-8451 commented 9 months ago

If I have the n_multi parameter set for multithreading (for example, n_multi=os.cpu_count())when calling estimate_effects, it sometimes hangs indefinitely. I have a dataset where estimate_effects typically takes about 10 seconds, but occasionally (maybe 1 in 20 times) it hangs and I have to kill the process. I haven't been able to figure out why it's happening.

Not urgent because the workaround is to just run it single threaded. But I'm curious if anyone has experienced similar issues or knows why it might be happening.

Jason Thorpe · Answer 1 · Thu Apr 11 2024 04:20:37 GMT+0800 (China Standard Time)

Sorry for the late response. I'm wondering if using multiprocessing is causing python to allocate more memory than you have ram which will run into page memory which is very slow (e.g. 1000's of times slower for memory intensive applications like these). Memory requirements for Sparse SC are much worse than linear with the size of your data (with the bottleneck being a single an inverse matrix calculation). I had that issue early on, which is why I developed the Azure Batch utilities, allowing you to run the jobs in parallel on the cloud.

If you (or other readers) go the azure batch route, it can be surprisingly affordable when using Azure Spot instances. A couple of things I should probably add to the Azure Batch implementation docs (and here for now):

If the docker container that runs runs the Synthetic Controls tasks on Azure Batch tries to allocate too much memory, it will fail with an exit code of 137 (which you can view in the task logs for failed tasks using Azure Batch Explorer). If/when this happens, just delete the pool and create a new pool using a larger VM size.
Using "Spot / Low Priority" VMs in your pool can reduce your costs significantly. The downside is that if your VM is preempted while your task is running the task will have to be re-started on another VM (which Azure Batch will do for you automatically). For reference, I'm currently fitting a model with ~4000 units and 50 observations, and I'm running it in a pool of 62 Azure Spot / Low Priority "standard_ds13_v2" instances (combined 498 cores and 3472 GB of memory!), and at the current spot prices, it costs me around $3 for the 40 minutes that my job runs (AS LONG AS YOU REMEMBER TO RESCALE THE POOL TO ZERO WHEN YOUR JOB IS DONE). During my most recent run, 7 of my 62 VMs were pre-empted, (though typically none of them are), which made my tasks take about 10 minutes longer to complete than they would have otherwise.
The possibility of having your VM's preempted is the "cost" of getting (as of this writing) a 90% discount on the VMs. (The discount on Spot VM's change monthly depending on their demand).