Readonly database error on HPC servers

ethanthoma opened this issue · comments


I am trying to use tinygrad on a HPC server that uses slurm. I created a simple virtual environment and submitted the job but get this error:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/decision-transformer/app/__init__.py", line 7, in train
  File "/decision-transformer/app/train.py", line 94, in train_rl
    loss, accuracy = step(states, actions, returns_to_go, timesteps, targets)
  File "/decision-transformer/app/train.py", line 71, in step
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/nn/optim.py", line 34, in step
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/tensor.py", line 201, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 190, in run_schedule
    for ei in lower_schedule(schedule):
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 183, in lower_schedule
    while len(schedule): yield lower_schedule_item(schedule.pop(0))
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 169, in lower_schedule_item
    runner = get_runner(si.outputs[0].device, si.ast)
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 142, in get_runner
    method_cache[ckey] = method_cache[bkey] = ret = CompiledRunner(replace(prg, dname=dname))
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/engine/realize.py", line 60, in __init__
    self.lib:bytes = precompiled if precompiled is not None else Device[p.dname].compiler.compile_cached(p.src)
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/device.py", line 178, in compile_cached
    if self.cachekey is not None: diskcache_put(self.cachekey, src, lib)
  File "/decision-transformer/decision-transformer/lib/python3.8/site-packages/tinygrad/helpers.py", line 217, in diskcache_put
    cur.execute(f"CREATE TABLE IF NOT EXISTS '{table}_{VERSION}' ({ltypes}, val blob, PRIMARY KEY ({', '.join(key.keys())}))")
sqlite3.OperationalError: attempt to write a readonly database

Please let me know what additional info is needed. Thanks!

My job script is:

#!/usr/bin/env bash

#SBATCH --time=16:0:0
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --output=output/%j.out
#SBATCH --cpus-per-task=24
#SBATCH --mem=64G
#SBATCH --gres=gpu:v100:1
#SBATCH --job-name=decision-transformer


# Load modules for environment
module load gcc/9.4.0
module load python/3.8.10
module load py-virtualenv/16.7.6

virtualenv decision-transformer

source decision-transformer/bin/activate

# run eval
CUDA=1 python -c 'import app; app.train()'

This is my pip list:

Package                    Version
-------------------------- --------
ale-py                     0.8.1
AutoROM                    0.4.2
AutoROM.accept-rom-license 0.6.1
certifi                    2024.7.4
charset-normalizer         3.3.2
click                      8.1.7
cloudpickle                3.0.0
Farama-Notifications       0.0.4
gymnasium                  0.29.1
idna                       3.7
importlib_metadata         8.2.0
importlib_resources        6.4.0
numpy                      1.24.4
pip                        24.2
requests                   2.32.3
setuptools                 68.0.0
Shimmy                     0.2.1
tinygrad                   0.9.1
tqdm                       4.66.4
typing_extensions          4.12.2
urllib3                    2.2.2
virtualenv                 16.7.6
wheel                      0.43.0
zipp                       3.19.2

Is ~/.cache/tinygrad/cache.db writable by the user running the train script?

I used chmod and reran it but still get the same error. I confirmed the perms are set:

(decision-transformer) [user decision-transformer]$ ll ~/.cache/tinygrad/cache.db
-rw-rw-rw- 1 user user 0 Jul 29 20:30 /home/user/.cache/tinygrad/cache.db

my guess is it may be how slurm manages jobs. Is there a way to set the cache to a different location?

you can specify the location using env var CACHEDB

Ill try that ty

I changed the variable and it works. I think its because the virtual environment is shared between nodes. Thank you!