Error when running theano on a cluster
fbartolic opened this issue · comments
Hi,
I'm running a PyMC3 model with Theano 1.0.4
on a cluster running Red Hat Enterprise Linux Server 6.4 (Santiago), after some time I get the following error:
"error_stackoverflow" 7081L, 101687476C
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/op.py", line 955, in make_thunk
no_recycling)
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/op.py", line 858, in make_c_thunk
output_storage=node_output_storage)
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py", line 1217, in make_thunk
keep_lock=keep_lock)
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py", line 1157, in __compile__
keep_lock=keep_lock)
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py", line 1624, in cthunk_factory
key=key, lnk=self, keep_lock=keep_lock)
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 1155, in module_from_key
module = self._get_from_hash(module_hash, key, keep_lock=keep_lock)
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 1055, in _get_from_hash
key_data.add_key(key, save_pkl=bool(key[0]))
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 519, in add_key
self.save_pkl()
File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 540, in save_pkl
with open(self.key_pkl, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs1/home/fb90/.theano/compiledir_Linux-2.6-el6.Bull.122.x86_64-x86_64-with-redhat-6.4-Santiago-x86_64-3.7.3-64/tmpm9ri1on2/key.pkl'
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "fit_multiple_events.py", line 307, in <module>
run_parallel_analysis(sys.argv[1], sys.argv[2])
File "fit_multiple_events.py", line 292, in run_parallel_analysis
for directory in dirs
File "/home/fb90/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 934, in __call__
self.retrieve()
File "/home/fb90/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "/home/fb90/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "/home/fb90/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 425, in result
return self.__get_result()
File "/home/fb90/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs1/home/fb90/.theano/compiledir_Linux-2.6-el6.Bull.122.x86_64-x86_64-with-redhat-6.4-Santiago-x86_64-3.7.3-64/tmpm9ri1on2/key.pkl'
slurmstepd: task_p_post_term: rmdir(/dev/cpuset/slurm143774/slurm143774.4294967294_0) failed Device or resource busy
Seems related to #5694
Initially I thought that it only happens when running the script on multiple nodes but that doesn't seem to be the case, the jobs just end at a random time independent on wether I'm using one or multiple nodes. I don't have any issues on Ubuntu or MacOS.
I realize that Theano is no longer maintained, I'm just posting here in case someone had the same issue.
I am having the same issue and cannot figure out how to fix it.
Theano is unmaintained, https://github.com/aesara-devs/aesara is its successor.