TabTransformer CUDA issue
duncanmcelfresh opened this issue · comments
duncanmcelfresh commented
occurs on datasets:
- openml__Amazon_employee_access__34539
- openml__PhishingWebsites__14952
- openml__analcatdata_dmft__3560
- openml__breast-cancer__145799
- openml__car__146821
- openml__connect-4__146195
- openml__dna__167140
- openml__kr-vs-kp__3
- openml__primary-tumor__146032
- openml__soybean__41
- openml__splice__45
- openml__tic-tac-toe__49
traceback:
Traceback (most recent call last):
File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 137, in __call__
result = cross_validation(model, self.dataset, self.time_limit)
File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 236, in cross_validation
loss_history, val_loss_history = curr_model.fit(
File "/home/shared/tabzilla/TabSurvey/models/tabtransformer.py", line 120, in fit
loss.backward()
File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/_tensor.py", line 363, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
duncanmcelfresh commented
update - this is a nasty bug.. there are a handful discussions on stackexchange and other github repos trying to diagnose this "CUDA error: invalid configuration argument" error.
this is also an intermediate bug - e.g. it occurs on the datasets listed in the original post, but doesn't occur on many other datasets (e.g., "openml__credit-approval__29" is fine)