xhan77 / ssd-lm

Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

A few questions regarding the paper and the code.

jzhang38 opened this issue · comments

Hi Xiaochuan,

Thanks for your wonderful work. It is really eye-opening to come up with the diffusion process on the logits space instead of the token embeddings!
I have a few questions regarding the paper and the code. Could you kindly respond to them?

  1. It is also possible to perform diffusion directly on the probability space. We just need to normalize the probability by dividing the sum at each timestep. Why do you choose to perform diffusion on the logit space instead of the probability space?
  2. Why don't you link the parameter matrix of "embedding_sum_layer" and "model_embedding_lut"? (You copy the parameter at the beginning of the training, but you did not link them. So the two parameter matrix may become different after training.)
  3. You did not use the transformer-style sine-consine function for the timestep embedding. Is this choise based on empirical results or intuition?

Thank you so much!

Thanks for your interest in our work!

  1. If we perform diffusion on the probability space then yes we need to normalize the probability at each timestep. If Gaussian noise is used, this breaks the nice property that we can get a closed form of noisy representation at any timestep (where we can merge Gaussians with different variance; this blog has a good explanation). The diffusion on logits is just a straightforward extension of original diffusion definitions (e.g., DDPM); the final fully noisy representation in our logit approach would follow a logit-normal distribution. However, I think diffusion on probability space is still possible. A concurrent work here actually offers an interesting perspective.

  2. "embedding_sum_layer" takes in the noisy simplexes and "model_embedding_lut" takes in the clean context word ids. We think the model may learn different information from the context and the noisy diffusion representation so we separate the embedding layer encoding them. I remember in a very early pilot study we linked the two and the result is worse, but more rigorous experiments are needed for a clear conclusion.

  3. This is just for simplicity, and we did not try the sin/cos function for timestep embedding (we simply scale the timestep to a real number between 0 and 1 and pass it through a learned linear layer). Similarly, the original definition of positional embeddings also involves the sin/cos function, but many current implementations just learn a vanilla linear layer. That being said, we are interested to see a comparison between different timestep embedding instantiations though not a focus of this work.

Thanks for your detailed explaination! I will close this thread.