SthPhoenix / InsightFace-REST

InsightFace REST API for easy deployment of face recognition services with TensorRT in Docker.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Multiple Cuda Context and multi threading

bhargavravat opened this issue · comments

hi @SthPhoenix :

Can you give some hint in creating of the below use case:

I want to now create 2 different cuda context ! I am having sufficient GPU memory which can hold both the context.

Now in this case ,

I will receive frame in queue and every alternate frame will be processed by single cuda context.

Any headway ??

Hi! This is a bit out of scope of this API.
What you are looking for can be achieved by running multiple FastAPI workers by changing appropriate param in deploy_trt.sh to number of workers you GPU can afford. Than you can use rest API to process data in multiple threads.

hello @SthPhoenix i try to create multi thead with your tensorrt project by follow this thread.
but i got a problem when i stop threading.

PyCUDA ERROR: The context stack was not empty upon module cleanup.
-------------------------------------------------------------------
A context was still active when the context stack was being
cleaned up. At this point in our execution, CUDA may already
have been deinitialized, so there is no way we can finish
cleanly. The program will be aborted now.
Use Context.pop() to avoid this problem.

reproduce:
trt loader

class TrtModel(object):
    def __init__(self, model):
        self.cfx = None
        self.engine_file = model
        self.engine = None
        self.inputs = None
        self.outputs = None
        self.bindings = None
        self.stream = None
        self.context = None
        self.input_shapes = None
        self.out_shapes = None
        self.max_batch_size = 1

    def build(self):
        self.cfx = cuda.Device(0).make_context()
        with open(self.engine_file, 'rb') as f, trt.Runtime(TRT_LOGGER) as runtime:
            self.engine = runtime.deserialize_cuda_engine(f.read())
        self.inputs, self.outputs, self.bindings, self.stream, self.input_shapes, self.out_shapes, self.out_names, self.max_batch_size = allocate_buffers(
            self.engine)

        self.context = self.engine.create_execution_context()
        self.context.active_optimization_profile = 0

    def run(self, input, deflatten: bool = True, as_dict=False):
        threading.Thread.__init__(self)
        self.cfx.push()
        # lazy load implementation
        engine = self.engine
        bindings = self.bindings
        inputs = self.inputs
        outputs = self.outputs
        stream = self.stream
        context = self.context
        out_shapes = self.out_shapes
        out_names = self.out_names
        if engine is None:
            self.build()

        input = np.asarray(input)
        batch_size = input.shape[0]
        allocate_place = np.prod(input.shape)
        inputs[0].host[:allocate_place] = input.flatten(order='C').astype(np.float32)
        context.set_binding_shape(0, input.shape)
        trt_outputs = do_inference(
            context, bindings=bindings,
            inputs=inputs, outputs=outputs, stream=stream)
        #Reshape TRT outputs to original shape instead of flattened array
        if deflatten:
            trt_outputs = [output.reshape(shape) for output, shape in zip(trt_outputs, out_shapes)]
        if as_dict:
            return {name: trt_outputs[i] for i, name in enumerate(self.out_names)}
        # self.cfx.pop()
        return [trt_outputs[0][:batch_size]]

    def destroy(self):
        self.cfx.pop()
        del self.cfx
        del self.engine

kill thread code:

det_model = thread_buckets[thread_name]['det_model']
        rec_model = thread_buckets[thread_name]['rec_model']
        processor_thread = thread_buckets[thread_name]['processor_thread']
        processor_thread.kill()
        processor_thread.join()
        det_model.retina.model.rec_model.destroy()
        rec_model.rec_model.destroy()
        del det_model
        del rec_model
        del thread_buckets[thread_name]

Hi, @ThiagoMateo ! I haven't checked my code in such scenario, as I said before it's a bit out of scope of this project. I might check it later, but can't give you any warranty right now.

You can check #18 for now, it seems to be connected to you problem, but I haven't figured out how to neglect GPU RAM overhead.

Closing for now, since problem isn't related to current intended use cases.