kuprel / min-dalle

min(DALL·E) is a fast, minimal port of DALL·E Mini to PyTorch

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Streaming intermediate images?

sabetAI opened this issue · comments

Is it possible to publish an update of the model that supports streaming intermediate images during reverse diffusion ie with an iterator? Would greatly help UX if the user can see their image form while they're waiting for the process to finish.

This isn't a diffusion model so that wouldn't work

Diffusion models iteratively update the image over multiple steps. These iterates can be streamed out (ie see glide demo). 'Reverse diffusion' is simply the image generation step ('diffusion' is the noising process during training), which is what your model is doing during inference. Can you update the code to output intermediate images?

Using the term 'reverse diffusion' might have caused some confusion with what I was asking.

This model is not like glide or VQGAN+CLIP.
DALL-E works on a entirely different principle. The image is generated with tiny squares (tokens), square by square, from left to right and top to bottom. It does not change all image at once at every iteration like diffuse models. Every iteration it just fills another tiny bit of the empty area with the completely ready tiny portion of the final image.

Ah good point @iScriptLex , I made assumptions about the model architecture. Even if it's outputting autoregressively, tokens can still be streamed out to incrementally update a canvas a pixel at a time. The main use-case here is so to show intermediate results to the user, as waiting kills the UX.

It might be possible to generate the images each time a row of tokens is decoded, and use some kind of blank token for the missing rows

@kuprel yes exactly. Also would it be more efficient just to stream rows of tokens and have the client handle everything else? Want to minimize latency that streaming may add.

This model is not like glide or VQGAN+CLIP. DALL-E works on a entirely different principle. The image is generated with tiny squares (tokens), square by square, from left to right and top to bottom. It does not change all image at once at every iteration like diffuse models. Every iteration it just fills another tiny bit of the empty area with the completely ready tiny portion of the final image.

this would still look cool while it was loading but i worry about latency and bandwidth, wouldn't a loading bar or something work just as well?

@w4ffl35 can you quantify marginal latency/bandwidth costs? Loaders may work for one-time uses, but users will churn if they're stuck looking at loaders 95% of the time. See urzas.ai for example of UX with intermediate outputs. Imo if a flag was made available it would be hugely valuable for devs.

@sabetAI those are great points

Ok I got it working in the colab now I just have to figure out how to get it on replicate. An intermediate image count of 8 only adds a couple seconds to the overall decoding time on the P100

Here's what it looks like (open in new tab to see animation) animated

@kuprel so good 👏. When can you merge 🙏?

I merged it. You can try it in the colab. Hopefully will get it onto replicate by tomorrow

Ok it's live on replicate now