GoogleCloudPlatform / practical-ml-vision-book

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Some notebooks cause GPU OOM with TF2.6

enakai00 opened this issue · comments

Hi, I found that some notebooks cause GPU OOM with TF2.6 (that doesn't happen with TF2.3).

Steps to reproduce:

  1. Launch Vertex AI notebook from the menu "TensorFlow Enterprise 2.6 (with LTS) -> With 1 NVIDIA Tesla T4" (leaving other options as default).
  2. Execute the notebook: https://github.com/GoogleCloudPlatform/practical-ml-vision-book/blob/master/02_ml_models/02b_neural_network.ipynb
  3. At the following cell,
model = train_and_evaluate(batch_size=32, lrate=0.0001, l1=0, l2=0, num_hidden=128)

you will see the following OOM error.

2021-12-03 08:48:43.805276: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 73.50MiB (rounded to 77070336)requested by op Mul
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 

The full error message is here.

My aplogiy..., I fould that it has nothing to do with TF versions. It was caused becase I was opening multiple notebooks.

I will be closing this issue, but I will send a pull-req for README.md to add an explicite statement saying you would shutdown kernel for other notebooks to avoid OOM.