Metadata has hardcoded paths which prevent training from being run
chaitjo opened this issue · comments
Thanks for the great repository!
I've been unable to run training after setting up the repository, as there seem to be hardcoded paths from which the datamodule loads preprocessed data that do not exist on my system.
Here's an example output:
$ python -W ignore experiments/train_se3_flows.py
[2024-02-08 17:50:16,382][__main__][INFO] - Checkpoints saved to ckpt/se3-fm/baseline/2024-02-08_17-49-56
[2024-02-08 17:50:16,436][__main__][INFO] - Using devices: [0]
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1
[2024-02-08 17:50:17,002][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2024-02-08 17:50:17,003][torch.distributed.distributed_c10d][INFO] - Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 1 processes
----------------------------------------------------------------------------------------------------
[2024-02-08 17:50:18,461][data.pdb_dataloader][INFO] - Training: 3938 examples
[2024-02-08 17:50:18,531][data.pdb_dataloader][INFO] - Validation: 40 examples with lengths [ 20 38 53 68 83 98 113 128]
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
------------------------------------
0 | model | FlowModel | 16.7 M
------------------------------------
16.7 M Trainable params
0 Non-trainable params
16.7 M Total params
66.984 Total estimated model params size (MB)
Sanity Checking: 0it [00:00, ?it/s]Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2ymza1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2ymza1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2ymza1.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2xw6a1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2xw6a1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2xw6a1.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2hewf1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2hewf1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2hewf1.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2xw6a1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2xw6a1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2xw6a1.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2hewf1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2hewf1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2hewf1.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2ymza1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2ymza1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2ymza1.pkl'
Failed to read /data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d5uj5a1.pkl. First error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d5uj5a1.pkl'
Second error: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d5uj5a1.pkl'
Error executing job with overrides: []
Traceback (most recent call last):
File "/home/ckj24/protein-frame-flow/experiments/train_se3_flows.py", line 97, in main
exp.train()
File "/home/ckj24/protein-frame-flow/experiments/train_se3_flows.py", line 72, in train
trainer.fit(
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 532, in fit
call._call_and_handle_interrupt(
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/trainer/call.py", line 42, in _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
return function(*args, **kwargs)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 571, in _fit_impl
self._run(model, ckpt_path=ckpt_path)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 980, in _run
results = self._run_stage()
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1021, in _run_stage
self._run_sanity_check()
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/trainer/trainer.py", line 1050, in _run_sanity_check
val_loop.run()
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/loops/utilities.py", line 181, in _decorator
return loop_run(self, *args, **kwargs)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/loops/evaluation_loop.py", line 108, in run
batch, batch_idx, dataloader_idx = next(data_fetcher)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 137, in __next__
self._fetch_next_batch(self.dataloader_iter)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/loops/fetchers.py", line 151, in _fetch_next_batch
batch = next(iterator)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 285, in __next__
out = next(self._iterator)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/pytorch_lightning/utilities/combined_loader.py", line 123, in __next__
out = next(self.iterators[0])
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 633, in __next__
data = self._next_data()
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1345, in _next_data
return self._process_data(data)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 1371, in _process_data
data.reraise()
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/_utils.py", line 644, in reraise
raise exception
FileNotFoundError: Caught FileNotFoundError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/ckj24/protein-frame-flow/data/utils.py", line 195, in read_pkl
with open(read_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 308, in _worker_loop
data = fetcher.fetch(index)
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ckj24/miniforge-pypy3/envs/fm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 51, in <listcomp>
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/ckj24/protein-frame-flow/data/pdb_dataloader.py", line 157, in __getitem__
chain_feats = self._process_csv_row(processed_file_path)
File "/home/ckj24/protein-frame-flow/data/pdb_dataloader.py", line 119, in _process_csv_row
processed_feats = du.read_pkl(processed_file_path)
File "/home/ckj24/protein-frame-flow/data/utils.py", line 200, in read_pkl
raise(e)
File "/home/ckj24/protein-frame-flow/data/utils.py", line 191, in read_pkl
with open(read_path, 'rb') as handle:
FileNotFoundError: [Errno 2] No such file or directory: '/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d2voua2.pkl'
Here's what preprocessed/metadata.csv
looks like:
pdb_name,processed_path,raw_path,num_chains,quaternary_category,seq_len,modeled_seq_len,coil_percent,helix_percent,strand_percent,radius_gyration
d1hp1a2,/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d1hp1a2.pkl,/data/rsg/chemistry/jyim/large_data/scope/d1hp1a2.pdb,1,homomer,328,328,0.45121951219512196,0.2682926829268293,0.2804878048780488,1.9195774410000415
d1w25a2,/data/rsg/chemistry/jyim/projects/flow-matching/preprocessed/d1w25a2.pkl,/data/rsg/chemistry/jyim/large_data/scope/d1w25a2.pdb,1,homomer,153,153,0.43137254901960786,0.4117647058823529,0.1568627450980392,1.64664663551551
...
I simply changed all the hardcoded paths to relative paths to the preprocessed data, which fixed the issue and enabled training to be run. However, the maintainers may want to fix this pesky issue in subsequent releases.
Yeah this is a bug. I plan to release code for motif-scaffolding during which I will update the datasets and metadata. Thanks for pointing this out for me to remember.