Yang7879 / 3D-BoNet

🔥3D-BoNet in Tensorflow (NeurIPS 2019, Spotlight)

Home Page:https://arxiv.org/abs/1906.01140

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

训练自己数据集的问题

lifeiwen opened this issue · comments

commented

杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是:
tensorflow1.15
cuda10
cudnn7.0.4
可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答
epoch 32 end time is : 2021-01-08 14:21:29.075227
train files shuffled!
is training ep : 33
total train batch num: 100
ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816
ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947
test pred bborder [[2 1 0]]
ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114
ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012
test pred bborder [[2 0 1]]
ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937
ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209
test pred bborder [[2 0 1]]
ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234
ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504
test pred bborder [[2 0 1]]
ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894
ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999
test pred bborder [[2 0 1]]
model saved in : ./log/train_mod/model033.cptk
epoch 33 end time is : 2021-01-08 14:21:44.245053
train files shuffled!
is training ep : 34
total train batch num: 100
ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307
ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805
test pred bborder [[0 2 1]]
ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615
ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012
test pred bborder [[0 2 1]]
ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554
ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296
test pred bborder [[1 0 2]]
2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only
row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment
raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
return fn(*args)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
target_list, run_metadata)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only
row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment
raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

 [[{{node bbox/PyFunc}}]]
 [[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]

(1) Invalid argument: ValueError: matrix contains invalid numeric entries
Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call
ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only
row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment
raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

Hi @lifeiwen, it seems that some value(s) of the cost matrix is nan? this numerical issue may happen when computing the costs (siou or ce).

Hi @Yang7879

Same error here, i am using the s3dis dataset.

What is the solution to this ?

commented

@Yang7879 I also used s3dis to train the model ,but i got the same error

@Yang7879 I also used s3dis to train the model ,but i got the same error

Hi @lifeiwen

I used the solution from https://github.com/Yang7879/3D-BoNet/issues/24#issuecomment-666012822 but it still resulted into the same issue.

commented

Hi @piseabhijeet used Tensorflow1.14 and solved this problem ,I don't know why? you can try tenforflow1.14

Hi @lifeiwen

You used TF 1.14 and the same source code for S3DIS dataset without any changes?

Thanks again for your response.

commented

Hi, @piseabhijeet Yes, I did not change the source code and data, what environment are you using? I have tried the version above 1.15, but this problem exists, but when I use the version below 1.15, the problem disappears, it may be caused by one of the functions.

Hi @lifeiwen

Thank you for your quick response.

I just tried running the code on TF 1.13 without any changes and it is working fine so far:

image

Yes, i agree with your observation - it does not work on TF 1.15. Thanks to your inputs because of which i was able to quickly downgrade and experiment.

commented

@piseabhijeet ok, if you find a problem on version tf 1.15, please tell me the reason,thanks

@lifeiwen - sure will do, thanks

hi, I also want to train my own data, and can I ask you some questions by emai? my emai (huangwei_zs@163.com)

@clare19997 @lifeiwen can you please give me your data preprocessing code of dividing cloud into blocks to generate h5 file? my email is souri123@163.com. thanks in advance

How do you make your own dataset?

杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是: tensorflow1.15 cuda10 cudnn7.0.4 可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答 epoch 32 end time is : 2021-01-08 14:21:29.075227 train files shuffled! is training ep : 33 total train batch num: 100 ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816 ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947 test pred bborder [[2 1 0]] ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114 ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012 test pred bborder [[2 0 1]] ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937 ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209 test pred bborder [[2 0 1]] ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234 ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504 test pred bborder [[2 0 1]] ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894 ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999 test pred bborder [[2 0 1]] model saved in : ./log/train_mod/model033.cptk epoch 33 end time is : 2021-01-08 14:21:44.245053 train files shuffled! is training ep : 34 total train batch num: 100 ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307 ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805 test pred bborder [[0 2 1]] ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615 ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012 test pred bborder [[0 2 1]] ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554 ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296 test pred bborder [[1 0 2]] 2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

 [[{{node bbox/PyFunc}}]]
 [[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]

(1) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)

File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)

File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")

ValueError: matrix contains invalid numeric entries

请问您是如何准备自己的数据集的,我的是深度相机拍出来的.ply格式,该怎样转成网络中需要的.h5格式呢?

杨博士你好,我尝试用你的网络训练自己的数据集,训练的环境是: tensorflow1.15 cuda10 cudnn7.0.4 可以正常编译和读数据,但是训练到第30多个epoch时训练中断,训练日志如下,希望杨博士能够解答 epoch 32 end time is : 2021-01-08 14:21:29.075227 train files shuffled! is training ep : 33 total train batch num: 100 ep 33 i 0 psemce 0.0 bbvert -0.23279694 l2 0.060497742 ce 0.26028115 siou -0.5535758 bbscore 0.0038582273 pmask 0.6467816 ep 33 i 0 test psem 0.0 bbvert 1.9851223 l2 0.059121773 ce 2.2534707 siou -0.3274702 bbscore 0.0048341155 pmask 3.4519947 test pred bborder [[2 1 0]] ep 33 i 20 psemce 0.0 bbvert -0.44303733 l2 0.030050844 ce 0.16678412 siou -0.6398723 bbscore 0.0026201883 pmask 0.63400114 ep 33 i 20 test psem 0.0 bbvert -0.4450612 l2 0.04898257 ce 0.23810822 siou -0.732152 bbscore 0.0016184862 pmask 0.47476012 test pred bborder [[2 0 1]] ep 33 i 40 psemce 0.0 bbvert 0.43996847 l2 0.08725857 ce 0.8109299 siou -0.45822 bbscore 0.0034747643 pmask 0.9270937 ep 33 i 40 test psem 0.0 bbvert -0.07955924 l2 0.040802542 ce 0.4581066 siou -0.5784684 bbscore 0.0087565 pmask 1.0110209 test pred bborder [[2 0 1]] ep 33 i 60 psemce 0.0 bbvert 0.057684183 l2 0.071036406 ce 0.5709717 siou -0.58432394 bbscore 0.00046247765 pmask 0.48829234 ep 33 i 60 test psem 0.0 bbvert -0.3817188 l2 0.03684793 ce 0.2770622 siou -0.69562894 bbscore 0.0019779946 pmask 0.54431504 test pred bborder [[2 0 1]] ep 33 i 80 psemce 0.0 bbvert 0.038050413 l2 0.034902867 ce 0.6071266 siou -0.60397905 bbscore 0.015431552 pmask 0.8978894 ep 33 i 80 test psem 0.0 bbvert 1.1076844 l2 0.07785928 ce 1.3761435 siou -0.34631833 bbscore 0.033992507 pmask 1.7343999 test pred bborder [[2 0 1]] model saved in : ./log/train_mod/model033.cptk epoch 33 end time is : 2021-01-08 14:21:44.245053 train files shuffled! is training ep : 34 total train batch num: 100 ep 34 i 0 psemce 0.0 bbvert -0.41581324 l2 0.057975773 ce 0.28829214 siou -0.76208115 bbscore 0.0003172583 pmask 0.34254307 ep 34 i 0 test psem 0.0 bbvert 1.7912706 l2 0.08668331 ce 2.017744 siou -0.31315675 bbscore 0.00576146 pmask 2.1253805 test pred bborder [[0 2 1]] ep 34 i 20 psemce 0.0 bbvert -0.14073128 l2 0.034625944 ce 0.4937689 siou -0.6691261 bbscore 0.0047056335 pmask 0.7088615 ep 34 i 20 test psem 0.0 bbvert 1.9534252 l2 0.0907397 ce 2.1705353 siou -0.3078499 bbscore 0.0019757028 pmask 2.682012 test pred bborder [[0 2 1]] ep 34 i 40 psemce 0.0 bbvert 0.27091432 l2 0.053299602 ce 0.7194822 siou -0.5018675 bbscore 0.001409175 pmask 0.60204554 ep 34 i 40 test psem 0.0 bbvert -0.28416353 l2 0.09286666 ce 0.19349718 siou -0.5705274 bbscore 0.003192804 pmask 0.21589296 test pred bborder [[1 0 2]] 2021-01-08 14:21:52.432564: W tensorflow/core/framework/op_kernel.cc:1639] Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)
File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")
ValueError: matrix contains invalid numeric entries
Traceback (most recent call last): File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call return fn(*args) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn target_list, run_metadata) File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun run_metadata) tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found. (0) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)
File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")
ValueError: matrix contains invalid numeric entries

 [[{{node bbox/PyFunc}}]]
 [[gradients/backbone/fa_layer1/ThreeInterpolate_grad/ThreeInterpolateGrad/_425]]

(1) Invalid argument: ValueError: matrix contains invalid numeric entries Traceback (most recent call last):
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 235, in call ret = func(*args)
File "/home/liu/disk1/Life/3DBoNetPoint818a(linux)/helper_net.py", line 115, in assign_mappings_valid_only row_ind, col_ind = linear_sum_assignment(valid_cost)
File "/home/liu/anaconda3/envs/tf1.15/lib/python3.6/site-packages/scipy/optimize/_hungarian.py", line 93, in linear_sum_assignment raise ValueError("matrix contains invalid numeric entries")
ValueError: matrix contains invalid numeric entries

请问您是如何准备自己的数据集的,我的是深度相机拍出来的.ply格式,该怎样转成网络中需要的.h5格式呢?
您好,请问您如何处理自己的数据集呢?

For those who see the error when adjusting the parameters such as learning rate(like what I just experienced), maybe a too large learning rate is the issue and somehow it's diverging and out of control. Change it to a small learning rate should fix it(in my case).