You can avoid the top_k and allow usage to be differentiable

Question

You can avoid the top_k and allow usage to be differentiable

Joshuaalbert opened this issue 7 years ago · comments

Joshua George Albert commented 7 years ago

Context

I have replicated the DNC which you show in theory in the Nature paper and in code your repository, with several modifications to addressing. In my case, I am using the framework of keras rather than sonnet. Originally, implementing the DNC as presented in the Nature paper lead to a fairly unstable model for some problems (initialization could have a huge impact on learn-ability). This lead me to reformulate each of the dynamic addressing mechanisms.

Enhancement

Here I request/point out an enhancement to the usage allocation weighting.
I have chosen to implement it without a sorting, which means you can remove this line.
This also allows the user to specify an inferrable batch_size. Pardon me if I'm mistaken, but I think it is impossible to have inferrable dimensions and use tf.unstack, without resorting to TensorArray's or dynamic partitioning.

This was done as follows (you can infer what the variables names are, and ignore self's as it is paste from some classes):

before write weights

free_weighting = K.prod(K.tile(1. - free_gates, (1,1,self.num_slots)) * w_read_tm1,1)
u_t = u_tm1 * free_weighting

get write weights

def _allocation_weights(u_t): 
        # (batch_size, num_slots)
        relative_usage = K.softmax(u_t)
        relative_non_usage = 1. - relative_usage
        relative_non_usage -= K.min(relative_non_usage)
        allocation_weights = K.softmax(relative_non_usage)
        return allocation_weights

# batch_size, W, num_slots
content_address_write = self._content_address(M_tm1, xi['write_keys'], xi['write_strengths'])
# batch_size, W, 1
write_gates = xi['write_gates']
# batch_size, W, 1
allocation_gates = xi['allocation_gates']
# batch_size, W, num_slots
tiled_write_gates = K.tile(xi['write_gates'] ,(1,1,self.num_slots))
# batch_size, W, num_slots
tiled_allocation_gates = K.tile(xi['allocation_gates'],(1,1,self.num_slots))

#write allocation weights
w_write = []
for w in range(self.num_write_heads):
    allocation_weights = self._allocation_weights(u_t)
    w_write.append(tiled_write_gates[:,w,:] * (tiled_allocation_gates[:,w,:] * allocation_weights + (1. - tiled_allocation_gates[:,w,:]) * content_address_write[:,w,:]))
    # update usage 
    u_t += w_write
w_write = K.stack(w_write,axis=1)

Intuition of change

The usage is better represented as an unbounded positive number of access times per slot rather than a number between 0 and 1. The free gates can reset these numbers as in the original implementation.
The allocation weighting then is a simple (albeit approximate) distribution over the relative non-usage.
It deviates from from the way a computer works (in that memory locations cannot be both used and unused on a computer), but it results in a smoother response to changes in memory access patterns. This approximation is counter-balanced by the fact that the write weights remain differentiable, and the sharpness of the allocation weights, as a result, remains quite nominal.

Result

In the problems I apply it too, I had noticeably faster training, and the allocation gates were slightly more often close to 1 (usage addressing preference).

Note: The faster learning might also be related with the temporal linking modifications also implemented.

System

Keras 2.0.8, tensorflow 1.3.0

dm-jrae · Answer 1 · Tue Jan 02 2018 19:26:17 GMT+0800 (China Standard Time)

This is a cool modification, thanks for sharing! We are going to keep this repo frozen as a reference implementation from the nature paper so we will not update the addressing mechanisms here but it's interesting to hear about improvements.

Sufeng Niu · Answer 2 · Wed Mar 14 2018 03:01:38 GMT+0800 (China Standard Time)

I have tried your method that avoid top_k, and I replace the _allocation function in addressing class with

    def _allocation(self, usage):
      with tf.name_scope('allocation'):

        relative_usage = tf.nn.softmax(usage)
        relative_non_usage = 1. - relative_usage
        relative_non_usage -= tf.reduce_min(relative_non_usage)
        allocation_weights = tf.nn.softmax(relative_non_usage)
        return allocation_weights`

and also remove the stop_gradient line you mentioned, it looks worked, compared with original DNC, the loss looks like:

it looks worked (origin is new DNC, blue is original one), and it did performs copy with less glitch on loss, however, if I look at what happened inside the memory, following is the memory and link matrix in new DNC:

in original DNC:

it turns out the original DNC learn this memory content and order pattern, while the new one which used your suggested way seems not. I wonder do you have such problems? or if I did something wrong in my modification?

Thanks