My own embedding

Question

My own embedding

tanmay2893 opened this issue 7 years ago · comments

Tanmay Kulshrestha commented 7 years ago

I am new to tensorflow. So pardon me if some of these things are too basic.

I just wanted to confirm what changes will I have to do in case I want to use my own encoding. As you told, I have the input vector of [max_time, batch_size, embedding_size]. Here is my thought about this.
I will remove these lines:
encoder_inputs = tf.placeholder(shape=(None, None), dtype=tf.int32, name='encoder_inputs')
decoder_targets = tf.placeholder(shape=(None, None), dtype=tf.int32, name='decoder_targets')
decoder_inputs = tf.placeholder(shape=(None, None), dtype=tf.int32, name='decoder_inputs')

Instead of these lines:

embeddings = tf.Variable(tf.random_uniform([vocab_size, input_embedding_size], -1.0, 1.0), dtype=tf.float32)
encoder_inputs_embedded = tf.nn.embedding_lookup(embeddings, encoder_inputs)
decoder_inputs_embedded = tf.nn.embedding_lookup(embeddings, decoder_inputs)

I'll put these:

encoder_inputs_embedded = tf.placeholder(shape=(time, None, 4), dtype=tf.float32, name='encoder_embeddings')
decoder_inputs_embedded = tf.placeholder(shape=(time, None, 4), dtype=tf.float32, name='encoder_embeddings')

But my doubt is in the projection layer. You are projecting it to vocabulary size although I have the embedding size. So, If I change everything to embedding_size, will the logic and working of the code still remain the same? According to me, it should.

Thanks

Matvey Ezhov · Answer 1 · Fri Jun 02 2017 21:36:31 GMT+0800 (China Standard Time)

(I assume you are talking about output projection layer, i.e. softmax layer)
embedding_size is the dimensionality of word vectors, not amount of different words decoder can produce. In the end you need to output token IDs, not embeddings, so you have to get from decoder's hidden state to probability distribution over possible words, i.e. vocabulary. That is, softmax layer.

I imagine you could try decoding embeddings directly, then searching for a word via cosine similarity vs whole vocabilary, but this would be waaaaay more computationally expensive then softmax. Softmax is already the most expensive operation in this architecture.