Theano / Theano

Theano was a Python library that allows you to define, optimize, and evaluate mathematical expressions involving multi-dimensional arrays efficiently. It is being continued as PyTensor: www.github.com/pymc-devs/pytensor

Home Page:https://www.github.com/pymc-devs/pytensor

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Error when running theano on a cluster

fbartolic opened this issue · comments

Hi,
I'm running a PyMC3 model with Theano 1.0.4 on a cluster running Red Hat Enterprise Linux Server 6.4 (Santiago), after some time I get the following error:

"error_stackoverflow" 7081L, 101687476C
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/op.py", line 955, in make_thunk
    no_recycling)
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/op.py", line 858, in make_c_thunk
    output_storage=node_output_storage)
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py", line 1217, in make_thunk
    keep_lock=keep_lock)
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py", line 1157, in __compile__
    keep_lock=keep_lock)
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cc.py", line 1624, in cthunk_factory
    key=key, lnk=self, keep_lock=keep_lock)
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 1155, in module_from_key
    module = self._get_from_hash(module_hash, key, keep_lock=keep_lock)
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 1055, in _get_from_hash
    key_data.add_key(key, save_pkl=bool(key[0]))
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 519, in add_key
    self.save_pkl()
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/theano/gof/cmodule.py", line 540, in save_pkl
    with open(self.key_pkl, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs1/home/fb90/.theano/compiledir_Linux-2.6-el6.Bull.122.x86_64-x86_64-with-redhat-6.4-Santiago-x86_64-3.7.3-64/tmpm9ri1on2/key.pkl'
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "fit_multiple_events.py", line 307, in <module>
    run_parallel_analysis(sys.argv[1], sys.argv[2])
  File "fit_multiple_events.py", line 292, in run_parallel_analysis
    for directory in dirs
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 934, in __call__
    self.retrieve()
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/joblib/parallel.py", line 833, in retrieve
    self._output.extend(job.get(timeout=self.timeout))
  File "/home/fb90/anaconda3/lib/python3.7/site-packages/joblib/_parallel_backends.py", line 521, in wrap_future_result
    return future.result(timeout=timeout)
  File "/home/fb90/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 425, in result
    return self.__get_result()
  File "/home/fb90/anaconda3/lib/python3.7/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
FileNotFoundError: [Errno 2] No such file or directory: '/gpfs1/home/fb90/.theano/compiledir_Linux-2.6-el6.Bull.122.x86_64-x86_64-with-redhat-6.4-Santiago-x86_64-3.7.3-64/tmpm9ri1on2/key.pkl'
slurmstepd: task_p_post_term: rmdir(/dev/cpuset/slurm143774/slurm143774.4294967294_0) failed Device or resource busy

Seems related to #5694

Initially I thought that it only happens when running the script on multiple nodes but that doesn't seem to be the case, the jobs just end at a random time independent on wether I'm using one or multiple nodes. I don't have any issues on Ubuntu or MacOS.

I realize that Theano is no longer maintained, I'm just posting here in case someone had the same issue.

I am having the same issue and cannot figure out how to fix it.

Theano is unmaintained, https://github.com/aesara-devs/aesara is its successor.