Project Roadmap

Question

Project Roadmap

tgaddair opened this issue 9 months ago · comments

Travis Addair commented 9 months ago

WIP project roadmap for LoRAX. We'll continue to update this over time.

v0.10

Speculative decoding adapters
AQLM

v0.11

Previous Releases

v0.9

Adapter memory pool

Backlog

Models

Adapters

Throughput / Latency

Paged Attention v2
Lookahead Decoding
SGMV with variable ranks
SGMV with tensor parallelism

Quantization

bitsandbytes
GPT-Q
AWQ

Usability

Prebuilt server wheels
SkyPilot usage guide
Example notebooks

RileyCodes · Answer 1 · Thu Nov 23 2023 06:45:03 GMT+0800 (China Standard Time)

is AWQ supported?

Travis Addair · Answer 2 · Thu Nov 23 2023 06:57:22 GMT+0800 (China Standard Time)

Hey @RileyCodes, not yet, will add that to the roadmap!

abhibst · Answer 3 · Thu Nov 23 2023 23:53:14 GMT+0800 (China Standard Time)

does we have tested bitsandbytes Quantization ?

Travis Addair · Answer 4 · Fri Nov 24 2023 04:22:51 GMT+0800 (China Standard Time)

Hey @abhibst, I've done some basic sanity checks on it, but haven't tested it very thoroughly. Please feel free to report any issues you encounter and I'll take a look!

abhibst · Answer 5 · Fri Nov 24 2023 05:44:58 GMT+0800 (China Standard Time)

Sure Thanks for confirming

sansavision · Answer 6 · Thu Nov 30 2023 04:48:29 GMT+0800 (China Standard Time)

How would you go about adding this in Stable Diffusion? I am really interested in experimenting with that.

Travis Addair · Answer 7 · Thu Nov 30 2023 06:16:03 GMT+0800 (China Standard Time)

Hey @sansavision, at a high level it would look a lot like the LoRA pipeline used in Diffusers: https://github.com/huggingface/api-inference-community/blob/main/docker_images/diffusers/app/pipelines/text_to_image.py#L25

A v0 shouldn't be too bad, we would basically just run a single forward pass to generate the image and perform postprocessing (as part of the existing Prefill step) and short-circuit the Decode step.

Florian Zimmermeister · Answer 8 · Mon Dec 04 2023 05:38:00 GMT+0800 (China Standard Time)

If no one has started I will start working on awq tomorrow

Travis Addair · Answer 9 · Mon Dec 04 2023 06:14:21 GMT+0800 (China Standard Time)

Nice! Thanks @flozi00, that would be awesome!

Samuel Galanakis · Answer 10 · Wed Dec 06 2023 20:33:00 GMT+0800 (China Standard Time)

Any plans to support vision transformers from huggingface / timm? A lot of potential use cases there for deploying many classifiers. If not what would that entail? Would be open to contributing if possible.

Travis Addair · Answer 11 · Thu Dec 07 2023 01:49:06 GMT+0800 (China Standard Time)

Hey @SamGalanakis, great suggestion! The plan at the moment is to start by supporting text classifiers. Once that framework is in place for that, it should be hopefully relatively straightforward to support image classifiers as well. Happy to start a thread on Discord to discuss!

Florian Zimmermeister · Answer 12 · Thu Dec 07 2023 02:17:45 GMT+0800 (China Standard Time)

Whisper would be also very cool 😄

Samuel Galanakis · Answer 13 · Thu Dec 07 2023 02:25:26 GMT+0800 (China Standard Time)

@tgaddair Ok clear, joined the discord will look out for it!

Hap-Zhang · Answer 14 · Fri Dec 15 2023 15:51:56 GMT+0800 (China Standard Time)

Hi, @tgaddair , could I know how long it will take to support the stable diffusion model?

Travis Addair · Answer 15 · Sun Dec 17 2023 05:19:15 GMT+0800 (China Standard Time)

Hey @Hap-Zhang, the plan at the moment is to add it after we add support for embedding generation and text classification. Both of those are planned for January 2024, so in the next month.

Hap-Zhang · Answer 16 · Mon Dec 18 2023 09:51:50 GMT+0800 (China Standard Time)

@tgaddair Okay, got it. Thank you very much for your efforts. Stay tuned for it.

AdithyanI · Answer 17 · Tue Jan 09 2024 00:10:49 GMT+0800 (China Standard Time)

If we could have OpenAI compatible endpoints that would be great too. So we can use this as drop in replacement for OpenAI models :)

Travis Addair · Answer 18 · Tue Jan 09 2024 01:19:43 GMT+0800 (China Standard Time)

Hey @AdithyanI, yes, this should be coming this week or next! See #145 to follow progress.

AdithyanI · Answer 19 · Tue Jan 09 2024 06:36:26 GMT+0800 (China Standard Time)

@tgaddair oh wow that would be awesome! Thank you so much for the work here.
If you need someone to test it out; let me know. Happy to test it out.

Is the discord still open for others to join :) ?
I followed the link of the repo, and it says it is expired.

Travis Addair · Answer 20 · Wed Jan 10 2024 06:06:20 GMT+0800 (China Standard Time)

@AdithyanI this should be landing some time today :)

#170

Travis Addair · Answer 21 · Wed Jan 10 2024 06:07:03 GMT+0800 (China Standard Time)

Hey @AdithyanI, the Discord should be available. Are you using this link?

https://discord.gg/CBgdrGnZjy

AdithyanI · Answer 22 · Thu Jan 11 2024 15:54:22 GMT+0800 (China Standard Time)

@tgaddair I asked for outlines repo authors to add support to this : outlines-dev/outlines#523
Then it would be great to have text guided generation :)

I don't know how hard is it to integrate that here.
Do you folks know if this is something that can be supported by LORAX?

Travis Addair · Answer 23 · Fri Jan 12 2024 13:22:20 GMT+0800 (China Standard Time)

Thanks for starting the Outlines thread @AdithyanI! Looks like the maintainer created an issue #176. Excited to explore this integration!

Kyle Mistele · Answer 24 · Wed Feb 21 2024 05:52:49 GMT+0800 (China Standard Time)

Would it be possible to add in context length-scaling methods like Self-Extend , Rope scaling, and/or yarn scaling? I know that llama.cpp has a good implementation of these in their server, and self-extend in particular is much more stable than rope or yarn. Having long context or doing context enhancement is super important for RAG applications.

LS · Answer 25 · Tue Feb 27 2024 02:42:57 GMT+0800 (China Standard Time)

About the supported models, could you consider the ChatGLM3 ? @tgaddair

LS · Answer 26 · Mon Mar 11 2024 01:22:09 GMT+0800 (China Standard Time)

LongLoRA

It seems that LongLoRA proposed shifted short attention is compatible with Flash-Attention, and not required during inference (ref: https://huggingface.co/Yukang/Llama-2-13b-longlora-8k#highlights), if that is true, could you share what's the planed support in LoRAX inference side? thanks @tgaddair

remiconnesson · Answer 27 · Sun Mar 17 2024 23:05:21 GMT+0800 (China Standard Time)

Do you plan on supporting AQLM to setve LoRa of Mixtral Instruct with Lorax?

Travis Addair · Answer 28 · Mon Mar 18 2024 04:37:58 GMT+0800 (China Standard Time)

Hey @thincal, the last thing we need to support LongLoRA, if I remember correctly, is #231 which @geoffreyangus is planning to pick up next week.

@remiconnesson, we have PR #233 from @flozi00 for AQLM. It's pretty close to landing, but just needs a little additional work to finish it up. If no one else picks it up, I can probably take a look in the next week or two.

amir-in-a-cynch · Answer 29 · Tue Apr 02 2024 01:07:31 GMT+0800 (China Standard Time)

Are T5 based models on the Roadmap?

remiconnesson · Answer 30 · Tue Apr 02 2024 05:27:34 GMT+0800 (China Standard Time)

@tgaddair

@remiconnesson, we have PR #233 from @flozi00 for AQLM. It's pretty close to landing, but just needs a little additional work to finish it up. If no one else picks it up, I can probably take a look in the next week or two.

Hello :) How far do you think we are for this PR to be merged? :)

Travis Addair · Answer 31 · Thu Apr 04 2024 00:50:20 GMT+0800 (China Standard Time)

Hey @remiconnesson, will probably be the next thing I take a look at after wrapping up speculative decoding this week.

@amir-in-a-cynch we can definitely add T5 to the roadmap!

Thomas Ranzenberger · Answer 32 · Mon Apr 22 2024 22:46:57 GMT+0800 (China Standard Time)

Hello, will you integrate / merge / migrate to the latest hugging face text-generation-inference as it is back now with Apache 2.0 license?

Binoy Dalal · Answer 33 · Sat Aug 10 2024 01:45:27 GMT+0800 (China Standard Time)

Is there an expected release date for v0.11?