About the test code

Question

About the test code

yuan-zm opened this issue 3 years ago · comments

Hello, thanks for your amazing work.

After the model had trained. I using test_SemanticKITTI.py for inference. However, I found the self.test_dataset.min_possibility is not updating during the test time. Could you please give me some suggestions?

Tsung-Han (Patrick) Wu · Answer 1 · Tue Jan 26 2021 19:25:24 GMT+0800 (China Standard Time)

Hi,

According to the source code here, we picked the least possibility points while testing. And as you can see, the minimal probability is increasing (+= delta) each time. In fact, it takes a lot of time to reach the terminal condition (i.e. min_possibility > threshold), but I think the implementation is correct.

If you found further questions, feel free to discuss with me. (Maybe I made a mistakes but I even didn't find it XD)

Zhimin_Yuan · Answer 2 · Tue Jan 26 2021 23:50:27 GMT+0800 (China Standard Time)

Thank you for your helpful reply!

Yes, the minimal probability is increasing (+= delta) each time. As you mentioned in readme, I using pytorch 1.4 at inference time. It's strange that id(self.test_data.min_possibility) function in test_semanticKITTI.py(line 107) and id(self.test_data.min_possibility) function in semkitti_testset.py(line 58) shows the same id, but the self.test_data.min_possibility is not increasing in test_SemanticKITTI.py(line 107) during inference time. So I using collate_fn(line 138 in semkitti_testset.py) to return test_data.min_possibility, it works.

After the above question has solved, I meet another problem. When I using nvidia-smi during inference time, the Volatile GPU-Util is too low(about 23%). Increasing the number of num_works or batch_size brings no gain for the Volatile GPU-Util. Are you facing the same problem?

Sorry, I am not good at English. Thanks for your helpful answer.

Tsung-Han (Patrick) Wu · Answer 3 · Thu Jan 28 2021 15:28:16 GMT+0800 (China Standard Time)

Hi, Thanks for your kindly reply.

First, the min_possibility issue you pointed out above might be a bug. Maybe I made some mistake when implementation. Sorry to make you confused. However, I am busy recently; thus I might verify and fix the bug a few days later. Really glad to hear you found the solution to fix it! (You can raise a PR if you want)

Second, for the low GPU-Utilization issue, I've suffered from it too. The fact is that in the official implementation, when inference, Rand-LA-Net always sequentially select the minimal possibility points and predict patch-by-patch and then finally merge/ensemble all the prediction. Thus, the commonly used Dataset class cannot make it because in this scenario, multiple workers see the same min_possibility table and choose the same min_possibility point (not last-1, last-2, ... last-N points) and thus it cause duplicated samples in a batch. Thus, I use PyTorch IterableDataset rather than Dataset class to implement sequential selection. In this implementation, I set the batch_size when declaring dataset class instead of calling Dataloader. As for num_workers, I face the same problems. Whenever I want to increase workers to > 0 value, the speed becomes dramatically low. Currently, I didn't find any better solutions to overcome this strange phenomenon.

Lastly, I am not native-English (I'm born in Taiwan) and thus my English is not proficient.
If you find further problems, feel free to ask me questions or report anything.
Thank you.

Zhimin_Yuan · Answer 4 · Thu Jan 28 2021 22:07:10 GMT+0800 (China Standard Time)

Thank you for your helpful suggestions despite your busy schedules.

I've never made PR in github before, I'm going to give it a try.

Em, when I use your code to train or inference, I also meet other problems. For example, I must using pytorch 1.1 to train and using pytorch 1.4 for inference. When I using pytorch1.1 to train RandLA Net, I meet an error at line 119 in train_SemanticKITTI.py. The error is that loss is not contiguous. When I using pytorch 1.1 for inferencing, I can not use IterableDataset. Sadly, I did' t find a good solution for this.

And you are missing two functions in data_process.py.
`@staticmethod
def load_pc_kitti(pc_path):
scan = np.fromfile(pc_path, dtype=np.float32)
scan = scan.reshape((-1, 4))
points = scan[:, 0:3] # get xyz
return points

@staticmethod
def load_label_kitti(label_path, remap_lut):
    label = np.fromfile(label_path, dtype=np.uint32)
    label = label.reshape((-1))
    sem_label = label & 0xFFFF  # semantic label in lower half
    inst_label = label >> 16  # instance id in upper half
    assert ((sem_label + (inst_label << 16) == label).all())
    sem_label = remap_lut[sem_label]
    return sem_label.astype(np.int32)`

Your work helps me a lot. Many thanks for sharing!

Tsung-Han (Patrick) Wu · Answer 5 · Mon Feb 08 2021 07:25:15 GMT+0800 (China Standard Time)

Close for finishing the conversation.

caoyifeng001 · Answer 6 · Fri Apr 09 2021 20:20:47 GMT+0800 (China Standard Time)

@tsunghan-mama @dream-toy hello
i have a question,how long time when run test_SemanticKITTI.py

Zhimin_Yuan · Answer 7 · Sat Apr 10 2021 08:41:08 GMT+0800 (China Standard Time)

@tsunghan-mama @dream-toy hello
i have a question,how long time when run test_SemanticKITTI.py

About 40 minutes. I think the bottleneck is self.update_predict function. Because the network's output should save to self.test_probs. So there are many I/0 between memory and GPU during prediction. Until now, I don't have a good solution for this.

caoyifeng001 · Answer 8 · Sat Apr 10 2021 09:39:38 GMT+0800 (China Standard Time)

Is it similar to the original Tensorflow code

Tsung-Han (Patrick) Wu · Answer 9 · Sat Apr 10 2021 14:20:01 GMT+0800 (China Standard Time)

@dream-toy @caoyifeng001 I agree that voting several prediction is not efficient. I've tried to remove the voting procedure but get a really bad result. I think maybe you can use some other methods which do not need this step, such as MinkowskiNet or SPVCNN.

Xu Ma · Answer 10 · Sat Jun 26 2021 02:36:46 GMT+0800 (China Standard Time)

Hi, thanks a lot for the great work.
Just want to know how long it takes to finish one epoch training.
I'm using one V100 GPU, the training time is over 2 hours / epoch. Not sure if this is normal.

Huixian Cheng · Answer 11 · Tue Jul 06 2021 21:11:00 GMT+0800 (China Standard Time)

@13952522076. for a single 20800ti need 42min peer epoch in my experiment.

Xu Ma · Answer 12 · Tue Jul 06 2021 23:16:37 GMT+0800 (China Standard Time)

@huixiancheng Thanks a lot. Just want to know if you have made some modifications?

Huixian Cheng · Answer 13 · Wed Jul 07 2021 12:08:00 GMT+0800 (China Standard Time)

@13952522076 Just modify num_worker and reduce val_batch_size to 15 to fit the device. and some unrelated log output modifications
From the logs, it seems to be work on the right way.

Huixian Cheng · Answer 14 · Thu Jul 08 2021 17:18:36 GMT+0800 (China Standard Time)

Sorry, please allow me to refuse

Xu Ma · Answer 15 · Fri Jul 09 2021 00:10:37 GMT+0800 (China Standard Time)

@huixiancheng No problem, thanks a lot for your kind help.

Huixian Cheng · Answer 16 · Tue Aug 17 2021 12:47:03 GMT+0800 (China Standard Time)

Hi, I would like to know how much memory you need for testing SemanticKITTI. When setting batch=1, I need almost 32G of memory (not GPU memory). Is this normal? Or is there any way to reduce that demand？ @dream-toy