ZJULearning / RMI

Great work!

We meet an issue caused by the computation of chol = torch.cholesky(matrix). We have pasted the error information as shown below,

RuntimeError:     cholesky_cuda: For batch 0: U(1,1) is zero, singular U.
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010: rmi_now = 0.5 * log_det_by_cholesky(appro_var + diag_matrix.type_as(appro_var) * _POS_ALPHA)
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010:       File "/teamdrive/yuyua/code/segmentation/mmsegmentation/mmseg/models/losses/rmi_loss.py", line 118, in log_det_by_cholesky
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010: chol = torch.cholesky(matrix)
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010:         chol = torch.cholesky(matrix)
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010: chol = torch.cholesky(matrix)
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010: RuntimeError    chol = torch.cholesky(matrix)
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010: RuntimeError: cholesky_cuda: For batch 0: U(1,1) is zero, singular U.
2020-08-11T12:37:40.000Z /container_e2240_1583898264103_325873_01_000010: : cholesky_cuda: For batch 0: U(1,1) is zero, singular U.

Well, this is a problem that will happen once out of ten times.

cholesky_cuda: For batch 0: U(1,1) is zero, singular U.

This shows appro_var + diag_matrix.type_as(appro_var) * _POS_ALPHA is singular, however, theoretically, it cannot be singular.
I think this is caused by the computational unstability of torch.cholesky.

You can increase _POS_ALPHA to reduce the probability of this error occurs. It will cause little disturbance to the final result.
You can also find some new linear algebra APIs in recent versions of PyTorch for better computational stability (see pytorch/pytorch#7500).

I am pretty busy now and I will try to update the code in the recent future.

Thanks for your reply.

So I guess you mean that a better solution should be to modify the following code:

RMI/losses/rmi/rmi.py

Lines 195 to 197 in 1846461

    
           # https://github.com/pytorch/pytorch/issues/7500 
        
           # waiting for batched torch.cholesky_inverse() 
        
           pr_cov_inv = torch.inverse(pr_cov + diag_matrix.type_as(pr_cov) * _POS_ALPHA)

--->

pr_cov_inv = torch.cholesky_inverse(pr_cov)

based on Pytorch-1.6.0?

Thanks for your reply.

So I guess you mean that a better solution should be to modify the following code:

RMI/losses/rmi/rmi.py

Lines 195 to 197 in 1846461

# https://github.com/pytorch/pytorch/issues/7500

# waiting for batched torch.cholesky_inverse()

pr_cov_inv = torch.inverse(pr_cov + diag_matrix.type_as(pr_cov) * _POS_ALPHA)

--->

pr_cov_inv = torch.cholesky_inverse(pr_cov)

based on Pytorch-1.6.0?

Right. Using torch.cholesky_inverse to get the inverse is more numerical stable than torch.inverse.

When I use pr_cov_inv = torch.cholesky_inverse(pr_cov) instead of pr_cov_inv = torch.inverse(pr_cov + diag_matrix.type_as(pr_cov) * _POS_ALPHA), a new problem occurs: RuntimeError: invalid argument 2: A should be non-empty 2 dimensional at /opt/conda/conda-bld/pytorch_1607370156314/work/aten/src/THC/generic/THCTensorMathMagma.cu:223

	# https://github.com/pytorch/pytorch/issues/7500
	# waiting for batched torch.cholesky_inverse()
	pr_cov_inv = torch.inverse(pr_cov + diag_matrix.type_as(pr_cov) * _POS_ALPHA)