Very slow training speed

Question

Very slow training speed

gaosanyuan opened this issue 6 months ago · comments

I found that the training speed is very slow when pivot releated logic is used.
I guess the main reason is dynamic programming logic in match cost and map loss, which are run on cpu.
I want to know if they can be implemented on GPU.
Thanks

Wenjie · Answer 1 · Mon Mar 18 2024 18:14:50 GMT+0800 (China Standard Time)

Although pivot dynamic matching can be seamlessly implemented on a GPU, the minimal matrix multiplication involved suggests that migrating this part to a GPU may not yield significant time savings.
To improve training time efficiency, we can adapt the current implementation by either increasing the batch size or decreasing the input image size.
Another potential area for improvement lies in the assignment process. Currently, we perform instance-level assignment followed by point-level assignment. However, given that point-level assignment is inherent in instance-level assignment, obtaining the point-level assignment results concurrently with the instance-level assignment could enhance time efficiency.

gaosanyuan · Answer 2 · Mon Mar 18 2024 19:54:10 GMT+0800 (China Standard Time)

Although pivot dynamic matching can be seamlessly implemented on a GPU, the minimal matrix multiplication involved suggests that migrating this part to a GPU may not yield significant time savings.

To improve training time efficiency, we can adapt the current implementation by either increasing the batch size or decreasing the input image size.

Another potential area for improvement lies in the assignment process. Currently, we perform instance-level assignment followed by point-level assignment. However, given that point-level assignment is inherent in instance-level assignment, obtaining the point-level assignment results concurrently with the instance-level assignment could enhance time efficiency.

Thanks @wenjie710

But

When you try to train the model using hybrid matching strategy and shifting the polygon many times, you can find that the pivot dynamic matching is really a bottleneck
I think we should always change only one group of variables when comparing two experiments
When computing the match cost, the time complexity is O(m x n), where m is the number of queries and n is number of gts multiplied by number of shifts. And when computing loss, the the time complexity is only O(number of gt). We can see that,
#gts << #queries * #shifts * #gts. So, although point-level assignment can used for instance-levle assignment, it may be not necessary.

Wenjie · Answer 3 · Tue Mar 19 2024 16:57:50 GMT+0800 (China Standard Time)

I think I may have misunderstood you before, so I reopened this issue.

"I think we should always change only one group of variables when comparing two experiments" Could you clarify which two experiments you are referring to? Are they detailed in the paper or the accompanying code?
Could you elaborate on the concept of "shifting polygon many times"? Are you referring to the approach employed by MapTR, or is it a different concept?
The complexity of sequence matching cost is O(NT), where N is the max number of points in an instance and T is the max length of ground truth sequences, which is independent of the number of GT and DT instances Code. Therefore, obtaining the point-level assignment results concurrently with the instance-level assignment can enhance time efficiency.
I think implementing the matching part in C++/ CUDA extension would make it faster if necessary.

Please let me know if you have any further concerns.

gaosanyuan · Answer 4 · Mon Apr 08 2024 17:16:34 GMT+0800 (China Standard Time)

I think I may have misunderstood you before, so I reopened this issue.

"I think we should always change only one group of variables when comparing two experiments" Could you clarify which two experiments you are referring to? Are they detailed in the paper or the accompanying code?

Could you elaborate on the concept of "shifting polygon many times"? Are you referring to the approach employed by MapTR, or is it a different concept?

The complexity of sequence matching cost is O(NT), where N is the max number of points in an instance and T is the max length of ground truth sequences, which is independent of the number of GT and DT instances Code. Therefore, obtaining the point-level assignment results concurrently with the instance-level assignment can enhance time efficiency.

I think implementing the matching part in C++/ CUDA extension would make it faster if necessary.

Please let me know if you have any further concerns.

Thanks for your reply.
I believe that there are many methods to impove the training speed.
But the training speed will be a problem when many stategies are used together, such as "shifting polygon many times" (mentined in the MapTR), using hybrid matching etc. Now, the pivot matching process will be very slow in CPU (base on my experiment) if the number of queries are large.
So I think it is necessary to have a cuda version of pivot dynamic matching including computing the matching score and matching indexes.
Thanks.