taki0112 / RAdam-Tensorflow

Simple Tensorflow implementation of "On The Variance Of The Adaptive Learning Rate And Beyond"

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Update Variable unmatched with Tensorflow implementation

zrbzrb1106 opened this issue · comments

I have run the test cases of Tensorflow-addons for RAdam both for the official implementation and the implementation here. There exists mismatching for the sparse case during my testing.

The use cases provided by TF:

var_0 = tf.Variable([1.0, 2.0])
var_1 = tf.Variable([3.0, 4.0])
grad_0 = tf.IndexedSlices(tf.constant([0.1]), tf.constant([0]), tf.constant([2]))
grad_1 = tf.IndexedSlices(tf.constant([0.04]), tf.constant([1]), tf.constant([2]))

E.g. The first ten rounds of TF code:

tf.Tensor([0.99989 2.     ], shape=(2,), dtype=float32) tf.Tensor([3.      3.99992], shape=(2,), dtype=float32)
tf.Tensor([0.99978006 2.        ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9998398], shape=(2,), dtype=float32)
tf.Tensor([0.9996701 2.       ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9997597], shape=(2,), dtype=float32)
tf.Tensor([0.9995601 2.       ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9996796], shape=(2,), dtype=float32)
tf.Tensor([0.99945015 2.        ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9995995], shape=(2,), dtype=float32)
tf.Tensor([0.99941427 2.        ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9995337], shape=(2,), dtype=float32)
tf.Tensor([0.9993716 2.       ], shape=(2,), dtype=float32) tf.Tensor([3.       3.999461], shape=(2,), dtype=float32)
tf.Tensor([0.9993229 2.       ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9993823], shape=(2,), dtype=float32)
tf.Tensor([0.99926883 2.        ], shape=(2,), dtype=float32) tf.Tensor([3.       3.999298], shape=(2,), dtype=float32)
tf.Tensor([0.9992098 2.       ], shape=(2,), dtype=float32) tf.Tensor([3.        3.9992092], shape=(2,), dtype=float32)

The first ten rounds of this code:

tf.Tensor([0.99989 1.99998], shape=(2,), dtype=float32) tf.Tensor([2.99997 3.99992], shape=(2,), dtype=float32)
tf.Tensor([0.99978006 1.99996   ], shape=(2,), dtype=float32) tf.Tensor([2.99994   3.9998398], shape=(2,), dtype=float32)
tf.Tensor([0.9996701 1.9999399], shape=(2,), dtype=float32) tf.Tensor([2.9999099 3.9997597], shape=(2,), dtype=float32)
tf.Tensor([0.9995601 1.9999199], shape=(2,), dtype=float32) tf.Tensor([2.9998798 3.9996796], shape=(2,), dtype=float32)
tf.Tensor([0.99945015 1.9998999 ], shape=(2,), dtype=float32) tf.Tensor([2.9998498 3.9995995], shape=(2,), dtype=float32)
tf.Tensor([0.99941444 1.9998798 ], shape=(2,), dtype=float32) tf.Tensor([2.9998198 3.9995337], shape=(2,), dtype=float32)
tf.Tensor([0.99937177 1.9998598 ], shape=(2,), dtype=float32) tf.Tensor([2.9997897 3.999461 ], shape=(2,), dtype=float32)
tf.Tensor([0.99932307 1.9998398 ], shape=(2,), dtype=float32) tf.Tensor([2.9997597 3.9993823], shape=(2,), dtype=float32)
tf.Tensor([0.999269  1.9998198], shape=(2,), dtype=float32) tf.Tensor([2.9997296 3.999298 ], shape=(2,), dtype=float32)
tf.Tensor([0.99921   1.9997997], shape=(2,), dtype=float32) tf.Tensor([2.9996996 3.9992092], shape=(2,), dtype=float32)

seems that for this case

var_0 = tf.Variable([1.0, 2.0])
grad_0 = tf.IndexedSlices(tf.constant([0.1]), tf.constant([0]), tf.constant([2]))

The value in the variable tensor at the index 1 (2.0) is also updated which is not expected. After
seeing the code, the main difference is shown below:

TF code:

with tf.control_dependencies([var_t]):
            var_update = self._resource_scatter_add(
                var, indices, tf.gather(-lr_t * var_t, indices))

This code:

var_update = state_ops.assign_sub(
          var, lr_t * var_t, use_locking=self._use_locking)

so it maybe caused by the assign_sub operation which update var on both indexes. I think this could be a question. The results after 2000 iterations generated by this code is [-2, 1.96] for the first example while the result of official implementation is [-2, 2] and it could not pass the test case. Changing assign_sub to _resource_scatter_add solve this problem.

UPDATE:
This implementation is also right. After checking the paper, I think the TF implementation is actually a lazy version of r-adam. So I will close this issue.