naszilla / tabzilla

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TabTransformer CUDA issue

duncanmcelfresh opened this issue · comments

occurs on datasets:

  • openml__Amazon_employee_access__34539
  • openml__PhishingWebsites__14952
  • openml__analcatdata_dmft__3560
  • openml__breast-cancer__145799
  • openml__car__146821
  • openml__connect-4__146195
  • openml__dna__167140
  • openml__kr-vs-kp__3
  • openml__primary-tumor__146032
  • openml__soybean__41
  • openml__splice__45
  • openml__tic-tac-toe__49

traceback:

Traceback (most recent call last):
  File "/home/shared/tabzilla/TabSurvey/tabzilla_experiment.py", line 137, in __call__
    result = cross_validation(model, self.dataset, self.time_limit)
  File "/home/shared/tabzilla/TabSurvey/tabzilla_utils.py", line 236, in cross_validation
    loss_history, val_loss_history = curr_model.fit(
  File "/home/shared/tabzilla/TabSurvey/models/tabtransformer.py", line 120, in fit
    loss.backward()
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/_tensor.py", line 363, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
  File "/opt/conda/envs/torch/lib/python3.10/site-packages/torch/autograd/__init__.py", line 173, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: invalid configuration argument
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

update - this is a nasty bug.. there are a handful discussions on stackexchange and other github repos trying to diagnose this "CUDA error: invalid configuration argument" error.

this is also an intermediate bug - e.g. it occurs on the datasets listed in the original post, but doesn't occur on many other datasets (e.g., "openml__credit-approval__29" is fine)