[image captioning] model picture

Question

[image captioning] model picture

rubencart opened this issue 5 years ago · comments

Hi,

In your picture here the output of the LSTM at the 1st timestep (when the input is the image feature vector) is "<start>", which is then fed back into the LSTM at the 2nd timestep. However, I don't think you actually train your LSTM to output the "<start>" token when inputting the image features, right?

So a more correct image would be something like this: image. This is also more similar to the figure at page 4 in the Show & Tell paper by Vinyals et al. (link).

Unless I'm mistaken of course :). Cheers!