Multiple Cuda Context and multi threading
bhargavravat opened this issue · comments
hi @SthPhoenix :
Can you give some hint in creating of the below use case:
I want to now create 2 different cuda context ! I am having sufficient GPU memory which can hold both the context.
Now in this case ,
I will receive frame in queue and every alternate frame will be processed by single cuda context.
Any headway ??
Hi! This is a bit out of scope of this API.
What you are looking for can be achieved by running multiple FastAPI workers by changing appropriate param in deploy_trt.sh
to number of workers you GPU can afford. Than you can use rest API to process data in multiple threads.
hello @SthPhoenix i try to create multi thead with your tensorrt project by follow this thread.
but i got a problem when i stop threading.
PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.
reproduce:
trt loader
class TrtModel(object):
def __init__(self, model):
self.cfx = None
self.engine_file = model
self.engine = None
self.inputs = None
self.outputs = None
self.bindings = None
self.stream = None
self.context = None
self.input_shapes = None
self.out_shapes = None
self.max_batch_size = 1
def build(self):
self.cfx = cuda.Device(0).make_context()
with open(self.engine_file, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
self.engine = runtime.deserialize_cuda_engine(f.read())
self.inputs, self.outputs, self.bindings, self.stream, self.input_shapes, self.out_shapes, self.out_names, self.max_batch_size = allocate_buffers(
self.engine)
self.context = self.engine.create_execution_context()
self.context.active_optimization_profile = 0
def run(self, input, deflatten: bool = True, as_dict=False):
threading.Thread.__init__(self)
self.cfx.push()
# lazy load implementation
engine = self.engine
bindings = self.bindings
inputs = self.inputs
outputs = self.outputs
stream = self.stream
context = self.context
out_shapes = self.out_shapes
out_names = self.out_names
if engine is None:
self.build()
input = np.asarray(input)
batch_size = input.shape[0]
allocate_place = np.prod(input.shape)
inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32)
context.set_binding_shape(0, input.shape)
trt_outputs = do_inference(
context, bindings=bindings,
inputs=inputs, outputs=outputs, stream=stream)
#Reshape TRT outputs to original shape instead of flattened array
if deflatten:
trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, out_shapes)]
if as_dict:
return {name: trt_outputs[i] for i, name in enumerate(self.out_names)}
# self.cfx.pop()
return [trt_outputs[0][:batch_size]]
def destroy(self):
self.cfx.pop()
del self.cfx
del self.engine
kill thread code:
det_model = thread_buckets[thread_name]['det_model']
rec_model = thread_buckets[thread_name]['rec_model']
processor_thread = thread_buckets[thread_name]['processor_thread']
processor_thread.kill()
processor_thread.join()
det_model.retina.model.rec_model.destroy()
rec_model.rec_model.destroy()
del det_model
del rec_model
del thread_buckets[thread_name]
Hi, @ThiagoMateo ! I haven't checked my code in such scenario, as I said before it's a bit out of scope of this project. I might check it later, but can't give you any warranty right now.
You can check #18 for now, it seems to be connected to you problem, but I haven't figured out how to neglect GPU RAM overhead.
Closing for now, since problem isn't related to current intended use cases.