gustavz / realtime_object_detection

Plug and Play Real-Time Object Detection App with Tensorflow and OpenCV

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Over 20 FPS on TX2 with thread

naisy opened this issue · comments

commented

Hi, GustavZ

Nice work!
It seems very good to split the model into Detection part and NMS part.
I changed CPU part run with thread. It over 20 FPS on Jetson TX2.

Thank you.

would you like to share the code modifications you did?

nice!
where did you find the session_worker.py ?
I did not work with tensorflows multi threading yet.
Do you know any manual or tutorial on how to use it?

commented

i wrote it. here.
https://github.com/naisy/realtime_object_detection/blob/master/lib/session_worker.py
https://github.com/naisy/realtime_object_detection/blob/master/lib/__init__.py

  1. make worker.
    gpu_worker = SessionWorker("GPU",detection_graph,config)
  2. set queue.
gpu_opts = [score_out, expand_out]
gpu_feeds = {image_tensor: image_expanded}
gpu_extras = image # for visualization frame
gpu_worker.put_sess_queue(gpu_opts,gpu_feeds,gpu_extras)
  1. get result.
g = gpu_worker.get_result_queue()
score,expand,image = g["results"][0],g["results"][1],g["extras"]

sorry, for simple usage.

# usage:
# before:
#     results = sess.run([opt1,opt2],feed_dict={input_x:x,input_y:y})
# after:
#     opts = [opt1,opt2]
#     feeds = {input_x:x,input_y:y}
#     woker = SessionWorker("TAG",graph,config)
#     worker.put_sess_queue(opts,feeds)
#     q = worker.get_result_queue()
#     if q is None:
#         continue
#     results = q['results']
#     extras = q['extras']
#
# extras: None or frame image data for draw. GPU detection thread doesn't wait result. Therefore, keep frame image data if you want to draw detection result boxes on image.

ah nice, so this no tensorflow code but your own?
Maybe i use it? I credit and link you ofcourse.

But one more question: are you sure the fps calculation is not affected by this?

commented

not tensorflow code. I wrote it myself. Anyone can use it freely.

About FPS, this time it is as follows.
IMAGE(main-thread) -> GPU(thread-1) -> GPU RESULT(main-thread) -> CPU(thread-2) -> CPU RESULT(main-thread) -> VISUALIZE(main-thread) -> FPS UPDATE(main-thread)
If CPU RESULT has not been set yet, FPS will not be updated.

                    c = cpu_worker.get_result_queue()
                    if c is None:
                        cpu_counter += 1
                        '''
                        cpu thread has no output queue. ok, nothing to do. continue
                        '''
                        time.sleep(0.005)
                        continue

thanks for the explanation :)
I read a lot that using multi threading in python is not such a good idea because of the global interpreter lock.
Is this a problem here? Or is it not because the Threads are IO bound?

Are you familiar with using multiprocessing? Could this be even faster here?

commented

ok, let's check about GIL slow down.

single thread code:

import time
def count(n):
    while n > 0:
        n -= 1
 
if __name__ == '__main__':
    start_time,start_clock=time.time(),time.clock()
    count(100000000)
    count(100000000)
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("Single-Thread time:{:.8}, clock:{:.8}".format(end_time,end_clock))

multi thread code:

import time
from threading import Thread
 
def count(n):
    while n > 0:
        n -= 1
 
if __name__ == '__main__':
    start_time,start_clock=time.time(),time.clock()
    t1 = Thread(target=count, args=(100000000,))
    t1.start()
    t2 = Thread(target=count, args=(100000000,))
    t2.start()
    t1.join();
    t2.join();
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("Multi-Thread time:{:.8}, clock:{:.8}".format(end_time,end_clock))

result on JetsonTX2:

Single-Thread time:23.624587, clock:23.624219
Multi-Thread time:122.06561, clock:123.2893

Normary, multi thread is too slow in python.

How about TF?
let's check.

tf single thread code:

import tensorflow as tf
import time

class Variable():
    def __init__(self):
        self.x = tf.Variable(1.0,dtype=tf.float32,name="variable_x")
        self.y = tf.Variable(1.0,dtype=tf.float32,name="variable_y")

def addOp(tag,variable):
    add_op = tf.add(variable.x,variable.y,name=tag+"_add_op")
    return add_op

v = Variable()

with tf.device('/gpu:0'):
    tag = "gpu"
    gpu = addOp(tag,v)

with tf.device('/cpu:0'):
    tag = "cpu"
    cpu = addOp(tag,v)

def work(sess,op,n):
    while n > 0:
        _ = sess.run(op)
        n -= 1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start_time,start_clock = time.time(),time.clock()
    work(sess,gpu,100000)
    work(sess,cpu,100000)
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("TF Single-Thread(GPU/CPU) time:{:.8}, clock:{:.8}".format(end_time,end_clock))

tf multi thread code:

import time
import tensorflow as tf
from threading import Thread

class Variable():
    def __init__(self):
        self.x = tf.Variable(1.0,dtype=tf.float32,name="variable_x")
        self.y = tf.Variable(1.0,dtype=tf.float32,name="variable_y")

def addOp(tag,variable):
    add_op = tf.add(variable.x,variable.y,name=tag+"_add_op")
    return add_op

v = Variable()

with tf.device('/gpu:0'):
    tag = "gpu"
    gpu = addOp(tag,v)

with tf.device('/cpu:0'):
    tag = "cpu"
    cpu = addOp(tag,v)

def work(sess,op,n):
    while n > 0:
        _ = sess.run(op)
        n -= 1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start_time,start_clock = time.time(),time.clock()
    t1 = Thread(target=work, args=(sess,gpu,100000,))
    t1.start()
    t2 = Thread(target=work, args=(sess,cpu,100000,))
    t2.start()
    t1.join();
    t2.join();
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("TF Multi-Thread(GPU/CPU) time:{:.8}, clock:{:.8}".format(end_time,end_clock))

in TF, multi thread is faster than single thread.

TF Single-Thread(GPU/CPU) time:79.972953, clock:173.22859
TF Multi-Thread(GPU/CPU) time:49.060713, clock:151.49888

I also checked about the same device.

TF Multi-Thread(CPU/CPU) time:49.992958, clock:156.02945
TF Multi-Thread(GPU/GPU) time:42.76184, clock:140.32868

This is also faster.

In TF, it seems that there is no need to worry about the bottleneck of multithreading in python.

Really interesting test and explanation, thank you!

Other Questions:

  • Did you already use TF's graph transform tool to decrease the networks size to speed it up on mobile devices like the Jetson (https://www.tensorflow.org/mobile/prepare_models) ?
  • I saw you are using TensorRT, did you manage to optimize ssd_mobilenet with it?
commented

Multiprocessing can not share objects. Therefore, it is necessary to use file I / O etc.
However, sess.run () takes longer than GIL, so I think that there is no big merit in using multiprocessing.

Sorry, I am not familiar with TensorRT and model tuning.

@naisy maybe you heard or even used mask r-cnn.
My plan is to do a mask ssd implementation, so that the ssd ouputs not only a bounding box per class but also a segmentation mask.
Would you be intrested in joining?

commented

I think that it is very interesting.
I have seen Tensorflow 's Mask R-CNN, but I have not used it yet.
Since SSD seems to be faster than R-CNN, I am excited about Mask SSD.
I would like to use it if it works with Jetson TX2.

commented

I found my mistake in download_model().
please fix to 'frozen_inference_graph.pb'.

you need to make sure that yo set the right paths in the config.yaml

model_name: 'ssd_mobilenet_v11_coco'
model_path: 'models/ssd_mobilenet_v11_coco/frozen_inference_graph.pb'
label_path: 'object_detection/data/mscoco_label_map.pbtxt'
num_classes: 90

if your frozen graph is named different, you can easily modify those lines

commented

I see. Thank you!

Should I need to change the download model name?
model_name: 'ssd_mobilenet_v1_coco_2017_11_17' # download model name

Download url is "HTTP Error 403: Forbidden" from my network.
http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v11_coco.tar.gz

@naisy I think you are right--the link in the model zoo shows the link as http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2017_11_17.tar.gz

I have a quick question about the thread implementation--does this work for self-trained version of ssd-mobilenet instead of the frozen checkpoint based off of COCO? From what I can tell they've altered the implementation of that model a bit in the repo and they've also added a new, even lighter embbedded ssd mobilenet.

Also @gustavz I am quite interested in implementing the mask. Were you able to get Mask-RCNN running on the TX2? I am also considering implementing a mask based off of SSD but from my limited understanding doing instance segmentation is pretty difficult without region proposal like like RCNN. I would be interested in working with you two on figuring this out.

commented

@Kowasaki Thank you for your information.

The implementation of multi-threading is not dependent on the model implementation.
It is effective for separate processing like Detection part and NMS part.

Mask-RCNN works with CPU on Jetson TX2. Add the next line to the code.
os.environ['CUDA_VISIBLE_DEVICES'] = ''

@naisy , @Kowasaki "ssd_mobilenet_v1_coco_2017_11_17" is the original model from the modelzoo.
ssd_mobilenet_v11_coco is my own model which i modified and re-exported based on the original. It is not available on the model zoo, just in my model folder on my repo, so ofcourse, trying to download it will fail. The automated model download only works for the model zoo.
I think i wrote this in the readme.

And yes you are right, the multithreading is model invariant for own trained models, but they must base on ssd_mobilenet. Splitting and Threading R-CNN will not work with this code.

About mask-ssd, i talked to a guy who did a first try of combining psp-net with ssd-net to be able to predict segmentation masks parallel to bounding boxes.

@gustavz Thank you for implementation.
Is it possible to do the same in c++? i am working on real time object detection with tensorflow c++.

@SANTHAKUMAR91 please open another issue for c++ related questions as this is another topic.
But to give a brief answer: I have no idea as i have never worked with tensorflow in connection with c++.

@naisy @gustavz Thanks for the heads-up! I'll create another issue for my mask-related question.

Closing this issue now