xhchrn / MS4L2O

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A bug in the implementation of Equation (17)?

simmonssong opened this issue · comments

Hi, thanks for your great work and code.

I wonder if there were some bugs in Equation (17) implementation in lines 192 to 199 of optimizers/coord_math_lstm.py, where the manipulations of Z and X are confused, and it should be - B1. The referenced codes are listed as follows.

prox_in = B * (optimizees.X + updateX) + (1 - B) * (optimizees.get_var('Z') + updateZ) + B1
prox_out = optimizees.prox({'P':P * self.step_size, 'X':prox_in}, compute_grad=self.training)
prox_diff = prox_out - optimizees.get_var('Z')
optimizees.X = prox_out + A * prox_diff + B2

# Clean up after the current iteration
# optimizees.Z = prox_out
optimizees.set_var('Z', prox_out)

From my understanding, Z in the code represents $y$ in Equation (17). I think the correct implementation should be like this.

prox_in = (1 - B) * (optimizees.X + updateX) + B * (optimizees.get_var('Z') + updateZ) - B1
prox_out = optimizees.prox({'P':P * self.step_size, 'X':prox_in}, compute_grad=self.training)
prox_diff = prox_out - optimizees.X
optimizees.set_var('Z', prox_out + A * prox_diff + B2)
optimizees.X = prox_out

Best,
Qingyu

Hi Qingyu, thank you for bringing up the issue. Yes, you are correct that the code is not strictly aligned with the equations (17) in the paper. I think the misalignment is due to change of formula during the time span of the project.

But I would expect that this will not make a big difference to the empirical performance.

The first line that computes prox_in makes little difference because B and B1 are learned. Variable Z will be pretty close to X when the sequence is converging and B2 gets close to zero.

However, thank you again for mentioning this misalignment and it would definitely help other readers.

Yes, both implementations' performances are close in my experiments.

Hi, I have another follow-up question about the Sigmoid activation function for B1 and B2. I notice that although the NORM_FUNC settings for B1 and B2 are 'eye' in your provided configurations, the self.b_norm (Sigmoid) is applied for them in lines 173 and 174, optimizers/coord_math_lstm.py. Are these constructions motivated by the "bounded" properties proposed in the related theorem? Actually, if 'eye' is applied, it gets extremely worse in my practice.

Yes, both implementations' performances are close in my experiments.

Thank you for confirming :D

Hi, I have another follow-up question about the Sigmoid activation function for B1 and B2. I notice that although the NORM_FUNC settings for B1 and B2 are 'eye' in your provided configurations, the self.b_norm (Sigmoid) is applied for them in lines 173 and 174, optimizers/coord_math_lstm.py. Are these constructions motivated by the "bounded" properties proposed in the related theorem? Actually, if 'eye' is applied, it gets extremely worse in my practice.

Yes, the sigmoid provides the boundedness which is aligned with the theorem. And our experience indeed tells us that sigmoid helps with stabilizing the training process by avoiding "crazy" update to the optimizees. Different components are indeed sensitive to the selection of the norm function.

Get it. Thanks a lot!