Readonly database error on HPC servers
ethanthoma opened this issue · comments
Hello,
I am trying to use tinygrad on a HPC server that uses slurm. I created a simple virtual environment and submitted the job but get this error:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/decision-transformer/app/__init__.py", line 7, in train
train_rl();
File "/decision-transformer/app/train.py", line 94, in train_rl
loss, accuracy = step(states, actions, returns_to_go, timesteps, targets)
File "/decision-transformer/app/train.py", line 71, in step
optim.step()
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/nn/optim.py", line 34, in step
Tensor.realize(*self.schedule_step())
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/tensor.py", line 201, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 190, in run_schedule
for ei in lower_schedule(schedule):
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 183, in lower_schedule
while len(schedule): yield lower_schedule_item(schedule.pop(0))
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 169, in lower_schedule_item
runner = get_runner(si.outputs[0].device, si.ast)
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 142, in get_runner
method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname))
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 60, in __init__
self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src)
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/device.py", line 178, in compile_cached
if self.cachekey is not None: diskcache_put(self.cachekey, src, lib)
File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/helpers.py", line 217, in diskcache_put
cur.execute(f"CREATE TABLE IF NOT EXISTS '{table}_{VERSION}' ({ltypes}, val blob, PRIMARY KEY ({', '.join(key.keys())}))")
sqlite3.OperationalError: attempt to write a readonly database
Please let me know what additional info is needed. Thanks!
My job script is:
#!/usr/bin/env bash
#SBATCH --time=16:0:0
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --output=output/%j.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=64G
#SBATCH --gres=gpu:v100:1
#SBATCH --job-name=decision-transformer
lscpu
nvidia-smi
# Load modules for environment
module load gcc/9.4.0
module load python/3.8.10
module load py-virtualenv/16.7.6
virtualenv decision-transformer
source decision-transformer/bin/activate
# run eval
CUDA=1 python -c 'import app; app.train()'
This is my pip list
:
Package Version
-------------------------- --------
ale-py 0.8.1
AutoROM 0.4.2
AutoROM.accept-rom-license 0.6.1
certifi 2024.7.4
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
Farama-Notifications 0.0.4
gymnasium 0.29.1
idna 3.7
importlib_metadata 8.2.0
importlib_resources 6.4.0
numpy 1.24.4
pip 24.2
requests 2.32.3
setuptools 68.0.0
Shimmy 0.2.1
tinygrad 0.9.1
tqdm 4.66.4
typing_extensions 4.12.2
urllib3 2.2.2
virtualenv 16.7.6
wheel 0.43.0
zipp 3.19.2
Is ~/.cache/tinygrad/cache.db
writable by the user running the train script?
I used chmod and reran it but still get the same error. I confirmed the perms are set:
(decision-transformer) [user decision-transformer]$ ll ~/.cache/tinygrad/cache.db
-rw-rw-rw- 1 user user 0 Jul 29 20:30 /home/user/.cache/tinygrad/cache.db
my guess is it may be how slurm manages jobs. Is there a way to set the cache to a different location?
you can specify the location using env var CACHEDB
Ill try that ty
I changed the variable and it works. I think its because the virtual environment is shared between nodes. Thank you!