TabTransformer CUDA issue

Question

TabTransformer CUDA issue

duncanmcelfresh opened this issue 2 years ago · comments

occurs on datasets:

openml__Amazon_employee_access__34539
openml__PhishingWebsites__14952
openml__analcatdata_dmft__3560
openml__breast-cancer__145799
openml__car__146821
openml__connect-4__146195
openml__dna__167140
openml__kr-vs-kp__3
openml__primary-tumor__146032
openml__soybean__41
openml__splice__45
openml__tic-tac-toe__49

traceback:

Traceback (most recent call last):
  File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 137, in __call__
    result = cross_validation(model, self.dataset, self.time_limit)
  File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 236, in cross_validation
    loss_history, val_loss_history = curr_model.fit(
  File "/home/shared/tabzilla/TabSurvey/models/tabtransformer.py", line 120, in fit
    loss.backward()
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

duncanmcelfresh · Answer 1 · Fri Feb 24 2023 07:50:01 GMT+0800 (China Standard Time)

update - this is a nasty bug.. there are a handful discussions on stackexchange and other github repos trying to diagnose this "CUDA error: invalid configuration argument" error.

this is also an intermediate bug - e.g. it occurs on the datasets listed in the original post, but doesn't occur on many other datasets (e.g., "openml__credit-approval__29" is fine)