deAlmeida22 re-analysis error
BartTheeuwes opened this issue · comments
First off, thank you for creating this package, I'm hoping it will speed up some of the analysis that I am planning to do. As a trial run, I was hoping to re-analyse the deAlmeida22 data in the same way as in the original paper. However, I'm running into multiple issues along the way.
Here is the code that I tried to run to analyse the data and the output describing the error. (downloading the data using seqdatasets.deAlmeida22() failed to correctly load in the sequence data, hence the strange workaround I implemented)
# Change this to where you would like to save all your results
import os
os.chdir("/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR") # TODO: change this to your own directory
cwd = os.getcwd()
cwd
# Configure EUGENe directories, if you do not set these, EUGENe will use the default directories
from eugene import settings
settings.logging_dir = "./tutorial_logs" # Directory where EUGENe will save Tensorboard training logs and model checkpoints to
settings.output_dir = "./tutorial_output" # Directory where EUGENe will save output files to
#### Data
import seqdatasets
sdata_train = seqdatasets.deAlmeida22(dataset='train')
sdata_val = seqdatasets.deAlmeida22(dataset='val')
sdata_test = seqdatasets.deAlmeida22(dataset='test')
from eugene import preprocess as pp
# Something goes wrong with the ZARR creation so have to manually change the seq folder in the zarr... no idea why this goes wrong
import seqdata as sd
sdata = sd.read_flat_fasta(
name="seq",
out='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/deAlmeida22_train.zarr',
fasta='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/Sequences_Train.fa',
batch_size=128,
fixed_length=True,
overwrite=True
)
sdata = sd.read_flat_fasta(
name="seq",
out='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/deAlmeida22_val.zarr',
fasta='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/Sequences_Val.fa',
batch_size=128,
fixed_length=True,
overwrite=True
)
sdata = sd.read_flat_fasta(
name="seq",
out='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/deAlmeida22_test.zarr',
fasta='/mnt/scratchc/bggroup/bt392/eugene_practice/Tutorial/DeepSTARR/eugene_data/deAlmeida22/Sequences_Test.fa',
batch_size=128,
fixed_length=True,
overwrite=True
)
sdata_train = seqdatasets.deAlmeida22(dataset='train')
sdata_val = seqdatasets.deAlmeida22(dataset='val')
sdata_test = seqdatasets.deAlmeida22(dataset='test')
# One hot encode all the sequences in the sdata using the wrapper function
pp.ohe_seqs_sdata(sdata_train, alphabet="DNA")
pp.ohe_seqs_sdata(sdata_val, alphabet="DNA")
pp.ohe_seqs_sdata(sdata_test, alphabet="DNA")
# Make unique ids for each sequence in the sdata
pp.make_unique_ids_sdata(sdata_train)
pp.make_unique_ids_sdata(sdata_val)
pp.make_unique_ids_sdata(sdata_test)
#########
## CANT FIGURE OUT HOW TO COMBINE TRAIN AND VAL & ADD COLUMN SO JUST SPLITTING TRAIN UP AGAIN
#########
# Split the training set into training and validation
pp.train_test_random_split(sdata_train, dim="_sequence", train_var="train_val", test_size=0.1)
model = zoo.DeepSTARR(
input_len=249,
output_dim=2
)
#### Model
from eugene.models import zoo
from eugene import models
model = models.SequenceModule(
arch=model,
task="regression",
loss_fxn= "mse",
scheduler='reduce_lr_on_plateau',
optimizer="adam",
metric="spearman",
optimizer_lr=0.002
)
models.init_weights(model)
from eugene import train
import torch
train.fit_sequence_module(
model=model,
sdata=sdata_train,
seq_var="ohe_seq",
target_vars=["Hk_log2_enrichment", "Dev_log2_enrichment"],
in_memory=True,
train_var="train_val",
epochs=10,
batch_size=128,
num_workers=2,
prefetch_factor=2,
drop_last=False,
name="DeepSTARR",
version="Trial5",
log_dir=settings.logging_dir,
transforms={"ohe_seq": lambda x: x.swapaxes(1, 2)}, #"target": lambda x: torch.tensor(x, dtype=torch.float32)}
# early_stopping_metric = "val_loss_epoch",
# early_stopping_patience = 10
)
This resulted in the following error:
Dropping 0 sequences with NaN targets.
Loading ohe_seq and ['Hk_log2_enrichment', 'Dev_log2_enrichment'] into memory
No seed set
/Users/theeuw01/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/plugins/environments/slurm.py:191: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /Users/theeuw01/miniconda3/envs/eugene_cuda/lib/pyth ...
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/Users/theeuw01/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/plugins/environments/slurm.py:191: The `srun` command is available on your system but is not used. HINT: If your intention is to run Lightning on SLURM, prepend your python command with `srun` like so: srun python /Users/theeuw01/miniconda3/envs/eugene_cuda/lib/pyth ...
2023-12-10 17:29:58.588960: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-10 17:29:58.589003: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-10 17:29:58.590824: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-10 17:29:58.600109: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-10 17:29:59.952313: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1]
| Name | Type | Params
--------------------------------------------------
0 | arch | DeepSTARR | 7.7 M
1 | train_metric | SpearmanCorrCoef | 0
2 | val_metric | SpearmanCorrCoef | 0
3 | test_metric | SpearmanCorrCoef | 0
--------------------------------------------------
7.7 M Trainable params
0 Non-trainable params
7.7 M Total params
30.745 Total estimated model params size (MB)
Sanity Checking DataLoader 0: 100%|██████████| 2/2 [00:00<00:00, 4.94it/s]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[33], line 4
1 from eugene import train
2 import torch
----> 4 train.fit_sequence_module(
5 model=model,
6 sdata=sdata_train,
7 seq_var="ohe_seq",
8 target_vars=["Hk_log2_enrichment", "Dev_log2_enrichment"],
9 in_memory=True,
10 train_var="train_val",
11 epochs=10,
12 batch_size=128,
13 num_workers=2,
14 prefetch_factor=2,
15 drop_last=False,
16 name="DeepSTARR",
17 version="Trial5",
18 log_dir=settings.logging_dir,
19 transforms={"ohe_seq": lambda x: x.swapaxes(1, 2)}, #"target": lambda x: torch.tensor(x, dtype=torch.float32)}
20 # early_stopping_metric = "val_loss_epoch",
21 # early_stopping_patience = 10
22 )
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/eugene/train/_fit.py:273, in fit_sequence_module(model, sdata, seq_var, target_vars, in_memory, train_var, epochs, gpus, batch_size, num_workers, prefetch_factor, transforms, drop_last, logger, log_dir, name, version, early_stopping_metric, early_stopping_patience, early_stopping_verbose, model_checkpoint_k, model_checkpoint_monitor, seed, return_trainer, **kwargs)
270 name = name if name is not None else model_name
272 # Use the above fit function
--> 273 trainer = fit(
274 model=model,
275 train_dataloader=train_dataloader,
276 val_dataloader=val_dataloader,
277 epochs=epochs,
278 gpus=gpus,
279 logger=logger,
280 log_dir=log_dir,
281 name=name,
282 version=version,
283 early_stopping_metric=early_stopping_metric,
284 early_stopping_patience=early_stopping_patience,
285 early_stopping_verbose=early_stopping_verbose,
286 model_checkpoint_k=model_checkpoint_k,
287 model_checkpoint_monitor=model_checkpoint_monitor,
288 seed=seed,
289 return_trainer=return_trainer,
290 **kwargs,
291 )
293 if return_trainer:
294 return trainer
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/eugene/train/_fit.py:123, in fit(model, train_dataloader, val_dataloader, epochs, gpus, logger, log_dir, name, version, early_stopping_metric, early_stopping_patience, early_stopping_verbose, model_checkpoint_k, model_checkpoint_monitor, seed, return_trainer, **kwargs)
113 trainer = Trainer(
114 max_epochs=epochs,
115 logger=logger,
(...)
119 **kwargs,
120 )
122 # Fit
--> 123 trainer.fit(
124 model, train_dataloaders=train_dataloader, val_dataloaders=val_dataloader
125 )
126 if return_trainer:
127 return trainer
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:544, in Trainer.fit(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
542 self.state.status = TrainerStatus.RUNNING
543 self.training = True
--> 544 call._call_and_handle_interrupt(
545 self, self._fit_impl, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path
546 )
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py:44, in _call_and_handle_interrupt(trainer, trainer_fn, *args, **kwargs)
42 if trainer.strategy.launcher is not None:
43 return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
---> 44 return trainer_fn(*args, **kwargs)
46 except _TunerExitException:
47 _call_teardown_hook(trainer)
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:580, in Trainer._fit_impl(self, model, train_dataloaders, val_dataloaders, datamodule, ckpt_path)
573 assert self.state.fn is not None
574 ckpt_path = self._checkpoint_connector._select_ckpt_path(
575 self.state.fn,
576 ckpt_path,
577 model_provided=True,
578 model_connected=self.lightning_module is not None,
579 )
--> 580 self._run(model, ckpt_path=ckpt_path)
582 assert self.state.stopped
583 self.training = False
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:989, in Trainer._run(self, model, ckpt_path)
984 self._signal_connector.register_signal_handlers()
986 # ----------------------------
987 # RUN THE TRAINER
988 # ----------------------------
--> 989 results = self._run_stage()
991 # ----------------------------
992 # POST-Training CLEAN UP
993 # ----------------------------
994 log.debug(f"{self.__class__.__name__}: trainer tearing down")
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1033, in Trainer._run_stage(self)
1031 if self.training:
1032 with isolate_rng():
-> 1033 self._run_sanity_check()
1034 with torch.autograd.set_detect_anomaly(self._detect_anomaly):
1035 self.fit_loop.run()
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py:1062, in Trainer._run_sanity_check(self)
1059 call._call_callback_hooks(self, "on_sanity_check_start")
1061 # run eval step
-> 1062 val_loop.run()
1064 call._call_callback_hooks(self, "on_sanity_check_end")
1066 # reset logger connector
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/utilities.py:182, in _no_grad_context.<locals>._decorator(self, *args, **kwargs)
180 context_manager = torch.no_grad
181 with context_manager():
--> 182 return loop_run(self, *args, **kwargs)
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:141, in _EvaluationLoop.run(self)
139 self._restarting = False
140 self._store_dataloader_outputs()
--> 141 return self.on_run_end()
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:253, in _EvaluationLoop.on_run_end(self)
250 self.trainer._logger_connector._evaluation_epoch_end()
252 # hook
--> 253 self._on_evaluation_epoch_end()
255 logged_outputs, self._logged_outputs = self._logged_outputs, [] # free memory
256 # include any logged outputs on epoch_end
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/loops/evaluation_loop.py:331, in _EvaluationLoop._on_evaluation_epoch_end(self)
328 call._call_callback_hooks(trainer, hook_name)
329 call._call_lightning_module_hook(trainer, hook_name)
--> 331 trainer._logger_connector.on_epoch_end()
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:187, in _LoggerConnector.on_epoch_end(self)
185 def on_epoch_end(self) -> None:
186 assert self._first_loop_iter is None
--> 187 metrics = self.metrics
188 self._progress_bar_metrics.update(metrics["pbar"])
189 self._callback_metrics.update(metrics["callback"])
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py:226, in _LoggerConnector.metrics(self)
224 on_step = self._first_loop_iter is not None
225 assert self.trainer._results is not None
--> 226 return self.trainer._results.metrics(on_step)
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/pytorch_lightning/trainer/connectors/logger_connector/result.py:492, in _ResultCollection.metrics(self, on_step)
490 # populate progress_bar metrics. convert tensors to numbers
491 if result_metric.meta.prog_bar:
--> 492 metrics["pbar"][forked_name] = convert_tensors_to_scalars(value)
494 return metrics
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/utilities/apply_func.py:128, in convert_tensors_to_scalars(data)
123 raise ValueError(
124 f"The metric `{value}` does not contain a single element, thus it cannot be converted to a scalar."
125 )
126 return value.item()
--> 128 return apply_to_collection(data, Tensor, to_item)
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_utilities/core/apply_func.py:64, in apply_to_collection(data, dtype, function, wrong_dtype, include_none, allow_frozen, *args, **kwargs)
62 # fast path for the most common cases:
63 if isinstance(data, dtype): # single element
---> 64 return function(data, *args, **kwargs)
65 if isinstance(data, list) and all(isinstance(x, dtype) for x in data): # 1d homogeneous list
66 return [function(x, *args, **kwargs) for x in data]
File ~/miniconda3/envs/eugene_cuda/lib/python3.9/site-packages/lightning_fabric/utilities/apply_func.py:123, in convert_tensors_to_scalars.<locals>.to_item(value)
121 def to_item(value: Tensor) -> Union[int, float, bool]:
122 if value.numel() != 1:
--> 123 raise ValueError(
124 f"The metric `{value}` does not contain a single element, thus it cannot be converted to a scalar."
125 )
126 return value.item()
ValueError: The metric `tensor([-0.0243, -0.0024], device='cuda:0')` does not contain a single element, thus it cannot be converted to a scalar.
Thanks for trying out the package!
This stems from an annoying nuance in using "spearman"
as the metric in the SequenceModule (SpearmanCorrCoef
). It won't automatically average the metric across tasks like R2Score
does, so PyTorch Lightning complains when you give it a 2D Tensor of correlations to log.
After playing with it for a bit, I wasn't able to get a workaround for using "spearman" with the current release on PyPI. For now, I'd recommend just training with "r2score"
as the passed in metric.
model = models.SequenceModule(
arch=model,
task="regression",
loss_fxn= "mse",
scheduler='reduce_lr_on_plateau',
optimizer="adam",
metric="r2score",
optimizer_lr=0.002
)
The metric doesn't affect the fitting of the model, but you won't get to see how Spearman changes across training. You can always calculate it on the test set post-hoc and I'm going to clean up how metrics are handled in a future release so this isn't an issue.
I was able to reproduce the issue with the DeepSTARR zarr files. There's a bug that I will patch and you can install SeqDatasets from source if you want to go that route (this line shouldn't exist: (https://github.com/ML4GLand/SeqDatasets/blob/main/seqdatasets/_datasets.py#L318C13-L318C49). However, your workaround should work just fine.
Sorry about the difficulty with concatenating objects. I love XArray, but it's not the most intuitive to use. This is how I would concatenate two SeqDatas:
import xarray as xr
sdata_train["train_val"] = True
sdata_val["train_val"] = False
sdata_training = xr.concat([sdata_train, sdata_val], dim="_sequence")
Thank you Adam,
After looking around on the ML4GLand I came across the use_cases repository. It seems like these are really useful tutorials, I'm running it for the DeepSTARR using eugene now, using the following notebooks: https://github.com/ML4GLand/use_cases/tree/main/DeepSTARR/eugene
There are some minor errors in the code that I'm tweaking one by one. The first 2 notebooks now run fine, I'm testing the attribution analysis now. I noticed that the original Eugene documentation does not link to the use_cases repository, I think it might be very helpful for new users to have tutorial more easily accessible.
Once again, thank you for creating this package.
Great, glad you found those useful. Some were created with an earlier version of the package and do need to be updated. If you end up with a set of working notebooks, feel free to submit a pull request! :)
That's a great suggestion, I will add the link to the landing page of the documentation.
Going to close this for now, but feel free to reopen if you run into any other sticking points!