tensorflow / serving

A flexible, high-performance serving system for machine learning models

Home Page:https://www.tensorflow.org/serving

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Docker container is not crashing, getting killed/aborted but only print out errors on GPU out of memory!

spate141 opened this issue · comments

Bug Report

Describe the problem

How should I restart / reload my model that has been running inside a tf-serving docker container when it goes out of memory on GPU? When a model goes OOM inside a docker container, it won't really crash but only print out all errors. I want the model to crash so I can reload using some cron/bash scripts if I want to. Without knowing the docker container is stopped, how do I restart it!? I have a service script that will load the docker container model on system load. I want to set up something similar that will reload the model once I know it has crashed.

  • When model goes OOM, you can still get the 200 OK ping back from health check API, it should not return 200 OK!

curl --request GET 'http://0.0.0.0:6007/v1/models/taxonomy'

{
    "model_version_status": [
        {
            "version": "1",
            "state": "AVAILABLE",
            "status": {
                "error_code": "OK",
                "error_message": ""
            }
        }
    ]
}
  • Above API should not return 200 OK response in a scenario where model running inside docker container goes OOM.

Batching Documentation:

  • Last major update to TF-Serving batching was made in 2018 and since then we have seen influx of large models running over TF-Serving. It would be nice to have a new updated version of how batching works with examples that improves throughput and latency or both.

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • TensorFlow Serving installed from (source or binary): binary
  • TensorFlow Serving version: tensorflow/serving:2.8.0-gpu

Exact Steps to Reproduce

Just load any model in with tensorflow/serving:2.8.0-gpu and throw a large batch of input at it to classify and once the model starts generating out of memory errors, you can see it's not crashing just yet.. it only keeps printing all those errors.

Logs:

2022-10-06 18:58:38.750542: I tensorflow_serving/model_servers/server_core.cc:465] Adding/updating models.
2022-10-06 18:58:38.750627: I tensorflow_serving/model_servers/server_core.cc:594]  (Re-)adding model: taxonomy
2022-10-06 18:58:38.918394: I tensorflow_serving/core/basic_manager.cc:740] Successfully reserved resources to load servable {name: taxonomy version: 1}
2022-10-06 18:58:38.918426: I tensorflow_serving/core/loader_harness.cc:66] Approving load for servable version {name: taxonomy version: 1}
2022-10-06 18:58:38.918447: I tensorflow_serving/core/loader_harness.cc:74] Loading servable version {name: taxonomy version: 1}
2022-10-06 18:58:38.918497: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:43] Reading SavedModel from: /models/taxonomy/1
2022-10-06 18:58:59.281162: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:78] Reading meta graph with tags { serve }
2022-10-06 18:58:59.281207: I external/org_tensorflow/tensorflow/cc/saved_model/reader.cc:119] Reading SavedModel debug info (if present) from: /models/taxonomy/1
2022-10-06 18:58:59.475622: I external/org_tensorflow/tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-10-06 18:59:00.880752: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-06 18:59:05.394605: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-06 18:59:05.395301: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-06 19:00:11.262985: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-06 19:00:11.263648: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-06 19:00:11.264206: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:936] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2022-10-06 19:00:11.264763: I external/org_tensorflow/tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 13795 MB memory:  -> device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5
2022-10-06 19:00:24.566042: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:228] Restoring SavedModel bundle.
2022-10-06 19:29:30.346777: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:212] Running initialization op on SavedModel bundle at path: /models/taxonomy/1
2022-10-06 19:29:35.348343: I external/org_tensorflow/tensorflow/cc/saved_model/loader.cc:301] SavedModel load for tags { serve }; Status: success: OK. Took 1856429847 microseconds.
2022-10-06 19:29:36.323289: I tensorflow_serving/servables/tensorflow/saved_model_bundle_factory.cc:162] Wrapping session to perform batch processing
2022-10-06 19:29:36.323342: I tensorflow_serving/servables/tensorflow/bundle_factory_util.cc:65] Wrapping session to perform batch processing
2022-10-06 19:29:36.323438: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:71] Starting to read warmup data for model at /models/taxonomy/1/assets.extra/tf_serving_warmup_requests with model-warmup-options 
2022-10-06 19:31:34.905796: I external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8100
2022-10-06 19:35:21.087528: I tensorflow_serving/servables/tensorflow/saved_model_warmup_util.cc:122] Finished reading warmup data for model at /models/taxonomy/1/assets.extra/tf_serving_warmup_requests. Number of warmup records read: 1. Elapsed time (microseconds): 344764098.
2022-10-06 19:35:21.091480: I tensorflow_serving/core/loader_harness.cc:87] Successfully loaded servable version {name: taxonomy version: 1}
2022-10-06 19:35:21.092675: I tensorflow_serving/model_servers/server_core.cc:486] Finished adding/updating models
2022-10-06 19:35:21.094343: I tensorflow_serving/model_servers/server.cc:133] Using InsecureServerCredentials
2022-10-06 19:35:21.094363: I tensorflow_serving/model_servers/server.cc:391] Profiler service is enabled
2022-10-06 19:35:21.260532: I tensorflow_serving/model_servers/server.cc:417] Running gRPC ModelServer at 0.0.0.0:8500 ...
[warn] getaddrinfo: address family for nodename not supported
[evhttp_server.cc : 245] NET_LOG: Entering the event loop ...
2022-10-06 19:35:21.262967: I tensorflow_serving/model_servers/server.cc:438] Exporting HTTP/REST API at:localhost:8501 ...
2022-10-06 19:35:37.572271: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:462] Allocator (GPU_0_bfc) ran out of memory trying to allocate 19.53MiB (rounded to 20480000)requested by op model/model_merged_t9/bilstm_0_G8_SG0_T9/forward_lstm_0_G8_SG0_T9/CudnnRNNV2
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation. 
Current allocation summary follows.
Current allocation summary follows.
2022-10-06 19:35:37.572927: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1010] BFCAllocator dump for GPU_0_bfc
2022-10-06 19:35:37.572955: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (256): 	Total Chunks: 2655, Chunks in use: 2655. 663.8KiB allocated for chunks. 663.8KiB in use in bin. 662.0KiB client-requested in use in bin.
2022-10-06 19:35:37.572967: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (512): 	Total Chunks: 2924, Chunks in use: 2924. 1.43MiB allocated for chunks. 1.43MiB in use in bin. 1.27MiB client-requested in use in bin.
2022-10-06 19:35:37.572977: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1024): 	Total Chunks: 2666, Chunks in use: 2665. 2.62MiB allocated for chunks. 2.61MiB in use in bin. 2.61MiB client-requested in use in bin.
2022-10-06 19:35:37.572991: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2048): 	Total Chunks: 58, Chunks in use: 58. 159.2KiB allocated for chunks. 159.2KiB in use in bin. 152.9KiB client-requested in use in bin.
2022-10-06 19:35:37.573004: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4096): 	Total Chunks: 640, Chunks in use: 640. 2.64MiB allocated for chunks. 2.64MiB in use in bin. 2.63MiB client-requested in use in bin.
2022-10-06 19:35:37.573018: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8192): 	Total Chunks: 98, Chunks in use: 98. 1.08MiB allocated for chunks. 1.08MiB in use in bin. 1.07MiB client-requested in use in bin.
2022-10-06 19:35:37.573034: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16384): 	Total Chunks: 278, Chunks in use: 278. 7.57MiB allocated for chunks. 7.57MiB in use in bin. 7.57MiB client-requested in use in bin.
2022-10-06 19:35:37.573048: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (32768): 	Total Chunks: 1288, Chunks in use: 1288. 63.03MiB allocated for chunks. 63.03MiB in use in bin. 62.88MiB client-requested in use in bin.
2022-10-06 19:35:37.573060: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (65536): 	Total Chunks: 1850, Chunks in use: 1850. 132.49MiB allocated for chunks. 132.49MiB in use in bin. 122.73MiB client-requested in use in bin.
2022-10-06 19:35:37.573075: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (131072): 	Total Chunks: 738, Chunks in use: 738. 151.49MiB allocated for chunks. 151.49MiB in use in bin. 151.49MiB client-requested in use in bin.
2022-10-06 19:35:37.573090: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (262144): 	Total Chunks: 2065, Chunks in use: 2065. 687.08MiB allocated for chunks. 687.08MiB in use in bin. 655.84MiB client-requested in use in bin.
2022-10-06 19:35:37.573106: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (524288): 	Total Chunks: 816, Chunks in use: 816. 502.73MiB allocated for chunks. 502.73MiB in use in bin. 406.69MiB client-requested in use in bin.
2022-10-06 19:35:37.573122: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (1048576): 	Total Chunks: 75, Chunks in use: 75. 101.29MiB allocated for chunks. 101.29MiB in use in bin. 101.29MiB client-requested in use in bin.
2022-10-06 19:35:37.573132: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (2097152): 	Total Chunks: 1880, Chunks in use: 1880. 5.96GiB allocated for chunks. 5.96GiB in use in bin. 5.95GiB client-requested in use in bin.
2022-10-06 19:35:37.573147: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (4194304): 	Total Chunks: 138, Chunks in use: 138. 778.58MiB allocated for chunks. 778.58MiB in use in bin. 513.43MiB client-requested in use in bin.
2022-10-06 19:35:37.573160: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (8388608): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-10-06 19:35:37.573172: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (16777216): 	Total Chunks: 276, Chunks in use: 276. 5.14GiB allocated for chunks. 5.14GiB in use in bin. 5.13GiB client-requested in use in bin.
2022-10-06 19:35:37.573184: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (33554432): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-10-06 19:35:37.573192: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (67108864): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-10-06 19:35:37.573205: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (134217728): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-10-06 19:35:37.573218: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1017] Bin (268435456): 	Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2022-10-06 19:35:37.573229: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1033] Bin for 19.53MiB was 16.00MiB, Chunk State: 
2022-10-06 19:35:37.573243: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1046] Next region of size 14465892352
2022-10-06 19:35:37.573258: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000000 of size 1280 next 1
2022-10-06 19:35:37.573270: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000500 of size 512 next 2
2022-10-06 19:35:37.573283: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000700 of size 512 next 3
2022-10-06 19:35:37.573290: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000900 of size 512 next 4
2022-10-06 19:35:37.573307: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000b00 of size 512 next 5
2022-10-06 19:35:37.573312: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000d00 of size 512 next 6
2022-10-06 19:35:37.573323: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c000f00 of size 512 next 7
2022-10-06 19:35:37.573335: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c001100 of size 512 next 8
2022-10-06 19:35:37.573345: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c001300 of size 512 next 9
2022-10-06 19:35:37.573354: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c001500 of size 512 next 10
2022-10-06 19:35:37.573362: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c001700 of size 512 next 11
2022-10-06 19:35:37.573370: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c001900 of size 1644800 next 12
2022-10-06 19:35:37.573385: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c193200 of size 512 next 13
2022-10-06 19:35:37.573395: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c193400 of size 828928 next 14
2022-10-06 19:35:37.573407: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c25da00 of size 16384 next 15
2022-10-06 19:35:37.573420: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c261a00 of size 2816 next 16
2022-10-06 19:35:37.573430: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c262500 of size 14592 next 17
2022-10-06 19:35:37.573441: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c265e00 of size 333312 next 18
2022-10-06 19:35:37.573457: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c2b7400 of size 9728 next 19
2022-10-06 19:35:37.573471: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c2b9a00 of size 8192 next 20
2022-10-06 19:35:37.573478: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c2bba00 of size 1177344 next 21
2022-10-06 19:35:37.573485: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c3db100 of size 675584 next 22
2022-10-06 19:35:37.573492: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c480000 of size 674304 next 23
2022-10-06 19:35:37.573500: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c524a00 of size 585728 next 24
2022-10-06 19:35:37.573510: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5b3a00 of size 11776 next 25
2022-10-06 19:35:37.573517: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5b6800 of size 4096 next 26
2022-10-06 19:35:37.573525: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5b7800 of size 512 next 27
2022-10-06 19:35:37.573533: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5b7a00 of size 262144 next 28
2022-10-06 19:35:37.573541: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5f7a00 of size 512 next 29
2022-10-06 19:35:37.573549: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5f7c00 of size 512 next 30
2022-10-06 19:35:37.573557: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5f7e00 of size 512 next 31
2022-10-06 19:35:37.573564: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c5f8000 of size 262144 next 32
2022-10-06 19:35:37.573572: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c638000 of size 262144 next 33
2022-10-06 19:35:37.573580: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c678000 of size 262144 next 34
2022-10-06 19:35:37.573588: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c6b8000 of size 262144 next 35
2022-10-06 19:35:37.573596: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c6f8000 of size 7424 next 36
2022-10-06 19:35:37.573604: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c6f9d00 of size 15872 next 37
2022-10-06 19:35:37.573611: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c6fdb00 of size 1559040 next 38
2022-10-06 19:35:37.573620: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c87a500 of size 14848 next 39
2022-10-06 19:35:37.573628: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c87df00 of size 2560 next 40
2022-10-06 19:35:37.573635: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c87e900 of size 4096 next 41
2022-10-06 19:35:37.573644: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382c87f900 of size 1808640 next 42
2022-10-06 19:35:37.573656: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382ca39200 of size 311808 next 43
2022-10-06 19:35:37.573663: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382ca85400 of size 1612032 next 44
2022-10-06 19:35:37.573675: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cc0ed00 of size 5376 next 45
2022-10-06 19:35:37.573685: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cc10200 of size 1083904 next 46
2022-10-06 19:35:37.573696: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cd18c00 of size 897536 next 47
2022-10-06 19:35:37.573713: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cdf3e00 of size 6656 next 48
2022-10-06 19:35:37.573725: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cdf5800 of size 12800 next 49
2022-10-06 19:35:37.573738: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cdf8a00 of size 2048 next 50
2022-10-06 19:35:37.573757: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cdf9200 of size 2816 next 51
2022-10-06 19:35:37.573769: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cdf9d00 of size 1317120 next 52
2022-10-06 19:35:37.573787: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cf3b600 of size 512 next 53
2022-10-06 19:35:37.573799: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382cf3b800 of size 1004544 next 54
2022-10-06 19:35:37.573807: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d030c00 of size 4864 next 55
2022-10-06 19:35:37.573817: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d031f00 of size 137216 next 56
2022-10-06 19:35:37.573834: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d053700 of size 512 next 57
2022-10-06 19:35:37.573845: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d053900 of size 512 next 58
2022-10-06 19:35:37.573853: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d053b00 of size 1036800 next 59
2022-10-06 19:35:37.573864: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d150d00 of size 1280 next 60
2022-10-06 19:35:37.573873: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d151200 of size 809728 next 61
2022-10-06 19:35:37.573884: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d216d00 of size 1765632 next 62
2022-10-06 19:35:37.573894: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d3c5e00 of size 10496 next 63
2022-10-06 19:35:37.573905: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d3c8700 of size 1650944 next 64
2022-10-06 19:35:37.573912: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d55b800 of size 283648 next 65
2022-10-06 19:35:37.573919: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d5a0c00 of size 447744 next 66
2022-10-06 19:35:37.573932: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d60e100 of size 15616 next 67
2022-10-06 19:35:37.573940: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d611e00 of size 6912 next 68
2022-10-06 19:35:37.573947: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d613900 of size 11776 next 69
2022-10-06 19:35:37.573955: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d616700 of size 577536 next 70
2022-10-06 19:35:37.573962: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d6a3700 of size 5376 next 71
2022-10-06 19:35:37.573972: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d6a4c00 of size 14336 next 72
2022-10-06 19:35:37.573982: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d6a8400 of size 745472 next 73
2022-10-06 19:35:37.573988: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d75e400 of size 1414400 next 74
2022-10-06 19:35:37.574000: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d8b7900 of size 220160 next 75
2022-10-06 19:35:37.574011: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d8ed500 of size 297984 next 76
2022-10-06 19:35:37.574019: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d936100 of size 12544 next 77
2022-10-06 19:35:37.574026: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d939200 of size 447232 next 78
2022-10-06 19:35:37.574033: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d9a6500 of size 8704 next 79
2022-10-06 19:35:37.574040: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382d9a8700 of size 527872 next 80
2022-10-06 19:35:37.574048: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382da29500 of size 2048 next 81
2022-10-06 19:35:37.574057: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382da29d00 of size 6400 next 82
2022-10-06 19:35:37.574065: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382da2b600 of size 143104 next 83
2022-10-06 19:35:37.574072: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382da4e500 of size 4608 next 84
2022-10-06 19:35:37.574083: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382da4f700 of size 17664 next 85
2022-10-06 19:35:37.574090: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382da53c00 of size 1172992 next 86
2022-10-06 19:35:37.574097: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382db72200 of size 12032 next 87
2022-10-06 19:35:37.574109: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382db75100 of size 3840 next 88
2022-10-06 19:35:37.574118: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382db76000 of size 3328 next 89
2022-10-06 19:35:37.574126: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382db76d00 of size 1737984 next 90
2022-10-06 19:35:37.574137: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382dd1f200 of size 760320 next 91
2022-10-06 19:35:37.574148: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382ddd8c00 of size 1301504 next 92
2022-10-06 19:35:37.574157: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382df16800 of size 8192 next 93
2022-10-06 19:35:37.574165: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382df18800 of size 584192 next 94
2022-10-06 19:35:37.574172: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382dfa7200 of size 1595392 next 95
2022-10-06 19:35:37.574180: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e12ca00 of size 3328 next 96
2022-10-06 19:35:37.574188: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e12d700 of size 12288 next 97
2022-10-06 19:35:37.574196: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e130700 of size 1792 next 98
2022-10-06 19:35:37.574202: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e130e00 of size 6912 next 99
2022-10-06 19:35:37.574209: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e132900 of size 1380352 next 100
2022-10-06 19:35:37.574221: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e283900 of size 2304 next 101
2022-10-06 19:35:37.574231: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e284200 of size 957952 next 102
2022-10-06 19:35:37.574246: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e36e000 of size 7168 next 103
2022-10-06 19:35:37.574256: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e36fc00 of size 229376 next 104
2022-10-06 19:35:37.574270: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e3a7c00 of size 710144 next 105
2022-10-06 19:35:37.574282: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e455200 of size 2560 next 106
2022-10-06 19:35:37.574297: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e455c00 of size 498688 next 107
2022-10-06 19:35:37.574309: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e4cf800 of size 1978368 next 108
2022-10-06 19:35:37.574317: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e6b2800 of size 2816 next 109
2022-10-06 19:35:37.574329: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e6b3300 of size 1327872 next 110
2022-10-06 19:35:37.574342: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e7f7600 of size 424448 next 111
2022-10-06 19:35:37.574355: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e85f000 of size 361984 next 112
2022-10-06 19:35:37.574368: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e8b7600 of size 13312 next 113
2022-10-06 19:35:37.574381: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e8baa00 of size 15104 next 114
2022-10-06 19:35:37.574393: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e8be500 of size 6656 next 115
2022-10-06 19:35:37.574401: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e8bff00 of size 911360 next 116
2022-10-06 19:35:37.574411: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e99e700 of size 3072 next 117
2022-10-06 19:35:37.574419: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e99f300 of size 14592 next 118
2022-10-06 19:35:37.574426: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e9a2c00 of size 349952 next 119
2022-10-06 19:35:37.574433: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382e9f8300 of size 1369088 next 120
2022-10-06 19:35:37.574446: I external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:1066] InUse at 7f382eb46700 of size 179200 next 121

@spate141,

Can you try using rest_api_timeout_in_ms argument to timeout HTTP/REST API calls.

You can also try to batch requests for better throughput. Thank you!

@singhniraj08 yes, I've put together a simple code that will try to classify some sample data and based on the response to that request, I'm killing docker process and re-loading my model back again when model goes OOM.

Model status API still need some sort of fix which let user know that model is not available for requests when it goes OOM. Right now, it still return "state": "AVAILABLE" on OOM model.

IMHO, batching documentation available here sucks. Last major update was in late 2018 and since then we have seen more bigger models that everyone is trying to put in production. But guess what!? There's no proper documentation on how to handle things properly w.r.t batching and large models on GPU! Could you please flag this up somewhere on your side so that someone can take a look at batching documentation and release a new version with more examples?

@spate141,

Could you please create a bug and report the above issue and documentation request.
Thank you!