Over 20 FPS on TX2 with thread

Question

Over 20 FPS on TX2 with thread

naisy opened this issue 6 years ago · comments

Hi, GustavZ

Nice work!
It seems very good to split the model into Detection part and NMS part.
I changed CPU part run with thread. It over 20 FPS on Jetson TX2.

Thank you.

Gustav von Zitzewitz · Answer 1 · Sun Mar 04 2018 05:29:10 GMT+0800 (China Standard Time)

would you like to share the code modifications you did?

naisy · Answer 2 · Mon Mar 05 2018 14:44:17 GMT+0800 (China Standard Time)

of course. here
https://github.com/naisy/realtime_object_detection

Gustav von Zitzewitz · Answer 3 · Mon Mar 05 2018 18:34:29 GMT+0800 (China Standard Time)

nice!
where did you find the session_worker.py ?
I did not work with tensorflows multi threading yet.
Do you know any manual or tutorial on how to use it?

naisy · Answer 4 · Mon Mar 05 2018 18:48:03 GMT+0800 (China Standard Time)

i wrote it. here.
https://github.com/naisy/realtime_object_detection/blob/master/lib/session_worker.py
https://github.com/naisy/realtime_object_detection/blob/master/lib/__init__.py

make worker.
gpu_worker = SessionWorker("GPU",detection_graph,config)
set queue.

gpu_opts = [score_out, expand_out]
gpu_feeds = {image_tensor: image_expanded}
gpu_extras = image # for visualization frame
gpu_worker.put_sess_queue(gpu_opts,gpu_feeds,gpu_extras)

get result.

g = gpu_worker.get_result_queue()
score,expand,image = g["results"][0],g["results"][1],g["extras"]

sorry, for simple usage.

# usage:
# before:
#     results = sess.run([opt1,opt2],feed_dict={input_x:x,input_y:y})
# after:
#     opts = [opt1,opt2]
#     feeds = {input_x:x,input_y:y}
#     woker = SessionWorker("TAG",graph,config)
#     worker.put_sess_queue(opts,feeds)
#     q = worker.get_result_queue()
#     if q is None:
#         continue
#     results = q['results']
#     extras = q['extras']
#
# extras: None or frame image data for draw. GPU detection thread doesn't wait result. Therefore, keep frame image data if you want to draw detection result boxes on image.

Gustav von Zitzewitz · Answer 5 · Mon Mar 05 2018 19:18:10 GMT+0800 (China Standard Time)

ah nice, so this no tensorflow code but your own?
Maybe i use it? I credit and link you ofcourse.

But one more question: are you sure the fps calculation is not affected by this?

naisy · Answer 6 · Tue Mar 06 2018 09:18:32 GMT+0800 (China Standard Time)

not tensorflow code. I wrote it myself. Anyone can use it freely.

About FPS, this time it is as follows.
IMAGE(main-thread) -> GPU(thread-1) -> GPU RESULT(main-thread) -> CPU(thread-2) -> CPU RESULT(main-thread) -> VISUALIZE(main-thread) -> FPS UPDATE(main-thread)
If CPU RESULT has not been set yet, FPS will not be updated.

                    c = cpu_worker.get_result_queue()
                    if c is None:
                        cpu_counter += 1
                        '''
                        cpu thread has no output queue. ok, nothing to do. continue
                        '''
                        time.sleep(0.005)
                        continue

Gustav von Zitzewitz · Answer 7 · Tue Mar 06 2018 16:38:13 GMT+0800 (China Standard Time)

thanks for the explanation :)
I read a lot that using multi threading in python is not such a good idea because of the global interpreter lock.
Is this a problem here? Or is it not because the Threads are IO bound?

Are you familiar with using multiprocessing? Could this be even faster here?

naisy · Answer 8 · Tue Mar 06 2018 18:51:40 GMT+0800 (China Standard Time)

ok, let's check about GIL slow down.

single thread code:

import time
def count(n):
    while n > 0:
        n -= 1
 
if __name__ == '__main__':
    start_time,start_clock=time.time(),time.clock()
    count(100000000)
    count(100000000)
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("Single-Thread time:{:.8}, clock:{:.8}".format(end_time,end_clock))

multi thread code:

import time
from threading import Thread
 
def count(n):
    while n > 0:
        n -= 1
 
if __name__ == '__main__':
    start_time,start_clock=time.time(),time.clock()
    t1 = Thread(target=count, args=(100000000,))
    t1.start()
    t2 = Thread(target=count, args=(100000000,))
    t2.start()
    t1.join();
    t2.join();
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("Multi-Thread time:{:.8}, clock:{:.8}".format(end_time,end_clock))

result on JetsonTX2:

Single-Thread time:23.624587, clock:23.624219
Multi-Thread time:122.06561, clock:123.2893

Normary, multi thread is too slow in python.

How about TF?
let's check.

tf single thread code:

import tensorflow as tf
import time

class Variable():
    def __init__(self):
        self.x = tf.Variable(1.0,dtype=tf.float32,name="variable_x")
        self.y = tf.Variable(1.0,dtype=tf.float32,name="variable_y")

def addOp(tag,variable):
    add_op = tf.add(variable.x,variable.y,name=tag+"_add_op")
    return add_op

v = Variable()

with tf.device('/gpu:0'):
    tag = "gpu"
    gpu = addOp(tag,v)

with tf.device('/cpu:0'):
    tag = "cpu"
    cpu = addOp(tag,v)

def work(sess,op,n):
    while n > 0:
        _ = sess.run(op)
        n -= 1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start_time,start_clock = time.time(),time.clock()
    work(sess,gpu,100000)
    work(sess,cpu,100000)
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("TF Single-Thread(GPU/CPU) time:{:.8}, clock:{:.8}".format(end_time,end_clock))

tf multi thread code:

import time
import tensorflow as tf
from threading import Thread

class Variable():
    def __init__(self):
        self.x = tf.Variable(1.0,dtype=tf.float32,name="variable_x")
        self.y = tf.Variable(1.0,dtype=tf.float32,name="variable_y")

def addOp(tag,variable):
    add_op = tf.add(variable.x,variable.y,name=tag+"_add_op")
    return add_op

v = Variable()

with tf.device('/gpu:0'):
    tag = "gpu"
    gpu = addOp(tag,v)

with tf.device('/cpu:0'):
    tag = "cpu"
    cpu = addOp(tag,v)

def work(sess,op,n):
    while n > 0:
        _ = sess.run(op)
        n -= 1

with tf.Session() as sess:
    sess.run(tf.global_variables_initializer())

    start_time,start_clock = time.time(),time.clock()
    t1 = Thread(target=work, args=(sess,gpu,100000,))
    t1.start()
    t2 = Thread(target=work, args=(sess,cpu,100000,))
    t2.start()
    t1.join();
    t2.join();
    end_time,end_clock=time.time()-start_time,time.clock()-start_clock
    print("TF Multi-Thread(GPU/CPU) time:{:.8}, clock:{:.8}".format(end_time,end_clock))

in TF, multi thread is faster than single thread.

TF Single-Thread(GPU/CPU) time:79.972953, clock:173.22859
TF Multi-Thread(GPU/CPU) time:49.060713, clock:151.49888

I also checked about the same device.

TF Multi-Thread(CPU/CPU) time:49.992958, clock:156.02945
TF Multi-Thread(GPU/GPU) time:42.76184, clock:140.32868

This is also faster.

In TF, it seems that there is no need to worry about the bottleneck of multithreading in python.

Gustav von Zitzewitz · Answer 9 · Tue Mar 06 2018 22:12:02 GMT+0800 (China Standard Time)

Really interesting test and explanation, thank you!

Other Questions:

Did you already use TF's graph transform tool to decrease the networks size to speed it up on mobile devices like the Jetson (https://www.tensorflow.org/mobile/prepare_models) ?
I saw you are using TensorRT, did you manage to optimize ssd_mobilenet with it?

naisy · Answer 10 · Wed Mar 07 2018 09:29:02 GMT+0800 (China Standard Time)

Multiprocessing can not share objects. Therefore, it is necessary to use file I / O etc.
However, sess.run () takes longer than GIL, so I think that there is no big merit in using multiprocessing.

Sorry, I am not familiar with TensorRT and model tuning.

Gustav von Zitzewitz · Answer 11 · Thu Mar 08 2018 21:06:34 GMT+0800 (China Standard Time)

@naisy maybe you heard or even used mask r-cnn.
My plan is to do a mask ssd implementation, so that the ssd ouputs not only a bounding box per class but also a segmentation mask.
Would you be intrested in joining?

naisy · Answer 12 · Fri Mar 09 2018 13:49:28 GMT+0800 (China Standard Time)

I think that it is very interesting.
I have seen Tensorflow 's Mask R-CNN, but I have not used it yet.
Since SSD seems to be faster than R-CNN, I am excited about Mask SSD.
I would like to use it if it works with Jetson TX2.

naisy · Answer 13 · Fri Mar 09 2018 17:39:15 GMT+0800 (China Standard Time)

I found my mistake in download_model().
please fix to 'frozen_inference_graph.pb'.

Gustav von Zitzewitz · Answer 14 · Fri Mar 09 2018 17:48:19 GMT+0800 (China Standard Time)

you need to make sure that yo set the right paths in the config.yaml

model_name: 'ssd_mobilenet_v11_coco'
model_path: 'models/ssd_mobilenet_v11_coco/frozen_inference_graph.pb'
label_path: 'object_detection/data/mscoco_label_map.pbtxt'
num_classes: 90

if your frozen graph is named different, you can easily modify those lines

naisy · Answer 15 · Fri Mar 09 2018 18:48:01 GMT+0800 (China Standard Time)

I see. Thank you!

Should I need to change the download model name?
model_name: 'ssd_mobilenet_v1_coco_2017_11_17' # download model name

Download url is "HTTP Error 403: Forbidden" from my network.
http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v11_coco.tar.gz

Wei-Yin Ko · Answer 16 · Fri Mar 09 2018 23:48:44 GMT+0800 (China Standard Time)

@naisy I think you are right--the link in the model zoo shows the link as http://download.tensorflow.org/models/object_detection/ssd_mobilenet_v1_coco_2017_11_17.tar.gz

I have a quick question about the thread implementation--does this work for self-trained version of ssd-mobilenet instead of the frozen checkpoint based off of COCO? From what I can tell they've altered the implementation of that model a bit in the repo and they've also added a new, even lighter embbedded ssd mobilenet.

Also @gustavz I am quite interested in implementing the mask. Were you able to get Mask-RCNN running on the TX2? I am also considering implementing a mask based off of SSD but from my limited understanding doing instance segmentation is pretty difficult without region proposal like like RCNN. I would be interested in working with you two on figuring this out.

naisy · Answer 17 · Mon Mar 12 2018 09:47:31 GMT+0800 (China Standard Time)

@Kowasaki Thank you for your information.

The implementation of multi-threading is not dependent on the model implementation.
It is effective for separate processing like Detection part and NMS part.

Mask-RCNN works with CPU on Jetson TX2. Add the next line to the code.
os.environ['CUDA_VISIBLE_DEVICES'] = ''

Gustav von Zitzewitz · Answer 18 · Mon Mar 12 2018 16:36:58 GMT+0800 (China Standard Time)

@naisy , @Kowasaki "ssd_mobilenet_v1_coco_2017_11_17" is the original model from the modelzoo.
ssd_mobilenet_v11_coco is my own model which i modified and re-exported based on the original. It is not available on the model zoo, just in my model folder on my repo, so ofcourse, trying to download it will fail. The automated model download only works for the model zoo.
I think i wrote this in the readme.

And yes you are right, the multithreading is model invariant for own trained models, but they must base on ssd_mobilenet. Splitting and Threading R-CNN will not work with this code.

About mask-ssd, i talked to a guy who did a first try of combining psp-net with ssd-net to be able to predict segmentation masks parallel to bounding boxes.

Deleted user · Answer 19 · Mon Mar 12 2018 20:57:12 GMT+0800 (China Standard Time)

@gustavz Thank you for implementation.
Is it possible to do the same in c++? i am working on real time object detection with tensorflow c++.

Gustav von Zitzewitz · Answer 20 · Mon Mar 12 2018 21:28:32 GMT+0800 (China Standard Time)

@SANTHAKUMAR91 please open another issue for c++ related questions as this is another topic.
But to give a brief answer: I have no idea as i have never worked with tensorflow in connection with c++.

Wei-Yin Ko · Answer 21 · Mon Mar 12 2018 23:04:43 GMT+0800 (China Standard Time)

@naisy @gustavz Thanks for the heads-up! I'll create another issue for my mask-related question.

Deleted user · Answer 22 · Tue Mar 13 2018 12:51:02 GMT+0800 (China Standard Time)

Sure thanks. On 12-Mar-2018 6:58 PM, "Gustav vZ" <notifications@github.com> wrote: @SANTHAKUMAR91 <https://github.com/santhakumar91> please add another issue for c++ related questions as this is another topic. But to give a brief answer: I have no idea as i have never worked with tensorflow in connection with c++. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AVlWuYXRHeBDum6aItR-PVMBhr0JhVrtks5tdngAgaJpZM4SQZLc> .

Gustav von Zitzewitz · Answer 23 · Wed Mar 14 2018 17:14:33 GMT+0800 (China Standard Time)

Closing this issue now