unum-cloud / uform

Pocket-Sized Multimodal AI for content understanding and generation across multilingual texts, images, and 🔜 video, up to 5x faster than OpenAI CLIP and LLaVA 🖼️ & 🖋️

https://unum-cloud.github.io/uform/

CLIP for Voice

chadbrewbaker opened this issue 6 months ago · comments

Chad Brewbaker commented 6 months ago

Would it be sane to get your model to support text to audio clips like this?

One of the DALLE3 engineers has a personal project called Tortise-TTS where he has a voice version of CLIP he calls CLVP.

https://github.com/neonbjb/tortoise-tts/blob/1e061bc6752f05bccb59748c8bd7c7fc85d54988/tortoise/models/clvp.py#L24

I think he used lucidrains CLIP as a template: https://github.com/lucidrains/DALLE-pytorch/blob/58c1e1a4fef10725a79bd45cdb5581c03e3e59e7/dalle_pytorch/dalle_pytorch.py#L272

Ash Vardanian commented 6 months ago

@VoVoR and @kimihailv, what do you think about this?

Mikhail Kim commented 6 months ago

Hello. It is an interesting suggestion. However, it is not our priority for now