[image captioning] model picture
rubencart opened this issue · comments
rubencart commented
Hi,
In your picture here the output of the LSTM at the 1st timestep (when the input is the image feature vector) is "<start>", which is then fed back into the LSTM at the 2nd timestep. However, I don't think you actually train your LSTM to output the "<start>" token when inputting the image features, right?
So a more correct image would be something like this: image. This is also more similar to the figure at page 4 in the Show & Tell paper by Vinyals et al. (link).
Unless I'm mistaken of course :). Cheers!