Could this be used by a small team?

Question

Could this be used by a small team?

cloudwitch opened this issue a year ago · comments

I would like to use FauxPilot for a small team (less than 10) developers and am curious if a single instance could support my whole team.

Is it possible for a single instance (lets assume a single GPU (p2.xlarge EC2 is what I had in mind)) to support more than one developer?

If so, how many users would a single GPU be able to support before adding a second GPU?

Is there a certain point where you should stop adding GPUs to a single instance and start distributing load across multiple servers?

Thanks! Looking forward to trying this out sometime!

Parth Thakkar · Answer 1 · Fri Feb 10 2023 10:14:09 GMT+0800 (China Standard Time)

Here's some rough calculation:

p2.xlarge GPUs provide K80 GPUs, which can manage 8.7TFlops for 16bit

If you want to use a 16B param model, that'll require (16B params) * (2 fops/param) * (1500 tokens) / ( 8.7TFlops * 0.7 utilization) = 7.8seconds to provide one completion, assuming a context length of 1500 tokens. With an A100 GPU (312 TFlops for 16bit), it'd take 219ms!

So using a 16B param model on K80 will be a non-starter even for one person.

Edit: Also, forgot to mention, that 16B param model will want at least 32GB vRAM, and K80 only has 24GB vRAM. So the model won't even load. But you could try using a 6B param model and that'd also let you have a latency of ~3seconds. Will still not be usable.

Option 1: You could look at g5.xlarge -- they come with one A10 GPU (24GB, 125Tflops@16bit), which would let you use a 6B param on a single GPU at 205ms latency, allowing roughly 5 completions/second without any batching. With batching the number should be better, but I need to look at how the calculation will work for that. This would be borderline-decent for a 10 member team I think? And if you have 2 such machines, you could even fit a 16B model. But again, I'll need to look at how to correctly calculate latencies for this. Will get back on this.

Option 2: g4dn.xlarge come with T4 GPUs, which have (16GB, 65TFlops@16bit). This comes at a tiny-bit higher price than p2.xlarge but much better performance. You can use a 6B model with ~400ms latency, which while noticeable, might still be tolerable. With 2 such instances, you can get better latency ~~/larger model~~.

Btw, these are some estimates (and someone please double check my calculation, (@moyix)). I'll try to do some small experiment to get empirical latency numbers in a couple of days.

Parth Thakkar · Answer 2 · Sat Feb 11 2023 04:19:09 GMT+0800 (China Standard Time)

Oh btw I realized that using Lambda Labs might be a much better option.

They provide an A10 for 0.6/hour, a tad more than the p2.xlarge (0.4/hour).
They also provide an A100 for $1.10/hr -- sameish rate as one g5.xlarge or two g4dn.xlarge. Much better option (40GB RAM, and 312TFlops) if you're willing to pay that much. That'll let you use the 16B param model without issues, and good latency (219ms).

There might be other GPU providers like Lambda that might give cheaper rates like this, and may be worth looking into. I'm not an expert in the GPU infra stuff yet, so can't provide a definitive answer except for hard constraints like memory requirements for model sizes.

Fiona · Answer 3 · Sat Feb 11 2023 05:33:34 GMT+0800 (China Standard Time)

I really appreciate the feedback.

Fred de Gier · Answer 4 · Tue Feb 14 2023 00:04:23 GMT+0800 (China Standard Time)

Does it make sense to move this information to the Wiki?

Parth Thakkar · Answer 5 · Tue Feb 14 2023 00:10:17 GMT+0800 (China Standard Time)

Yes that makes sense. I'll add it there. I'm planning to run some experiments to get latency/gpu numbers to verify these estimates (held up because the system I have access to is a RHEL without docker). Will be adding this info to the wiki once I have that.

Jaeker · Answer 6 · Wed Mar 01 2023 15:16:55 GMT+0800 (China Standard Time)

Here's some rough calculation:

p2.xlarge GPUs provide K80 GPUs, which can manage 8.7TFlops for 16bit

If you want to use a 16B param model, that'll require (16B params) * (2 fops/param) * (1500 tokens) / ( 8.7TFlops * 0.7 utilization) = 7.8seconds to provide one completion, assuming a context length of 1500 tokens. With an A100 GPU (312 TFlops for 16bit), it'd take 219ms!

So using a 16B param model on K80 will be a non-starter even for one person.

Edit: Also, forgot to mention, that 16B param model will want at least 32GB vRAM, and K80 only has 24GB vRAM. So the model won't even load. But you could try using a 6B param model and that'd also let you have a latency of ~3seconds. Will still not be usable.

Option 1: You could look at g5.xlarge -- they come with one A10 GPU (24GB, 125Tflops@16bit), which would let you use a 6B param on a single GPU at 205ms latency, allowing roughly 5 completions/second without any batching. With batching the number should be better, but I need to look at how the calculation will work for that. This would be borderline-decent for a 10 member team I think? And if you have 2 such machines, you could even fit a 16B model. But again, I'll need to look at how to correctly calculate latencies for this. Will get back on this.

Option 2: g4dn.xlarge come with T4 GPUs, which have (16GB, 65TFlops@16bit). This comes at a tiny-bit higher price than p2.xlarge but much better performance. You can use a 6B model with ~400ms latency, which while noticeable, might still be tolerable. With 2 such instances, you can get better latency ~~/larger model~~.

Btw, these are some estimates (and someone please double check my calculation, (@moyix)). I'll try to do some small experiment to get empirical latency numbers in a couple of days.

I'm curious about the meaning of "context length of 1500 tokens", i made some small test:

Hardware: Tesla T4
Model: codegen-multi-2B
Inference backend: FasterTransformer

I use the postman to make some request and calculate the avg latency, here is the result

prompt_tokens	completion_tokens	total_tokens	latency(ms)
128	5	133	200
128	10	138	254
128	16	144	356
128	20	148	410
128	32	160	582
128	64	192	1075

so i want to know if the "1500 tokens" was equal to the total_tokens?

Parth Thakkar · Answer 7 · Fri Mar 03 2023 05:32:34 GMT+0800 (China Standard Time)

Hi @Jaeker0512 by context length of 1500 tokens, I mean the prompt to be having 1500 tokens. That's the number used by copilot. Can you try benchmarking with prompt length of 1500 tokens and 2-16 completion tokens? I'm planning to run these benchmarks but I've been blocked by some other work. Your numbers would be helpful. Thanks!

Parth Thakkar · Answer 8 · Fri Apr 14 2023 14:08:01 GMT+0800 (China Standard Time)

So I finally got around to doing some benchmarking (way longer than I'd have liked but had many things in the way). Some notes from my benchmarking:

Prompt size: 1536 tokens Output tokens: varied from 1 to 64 tokens.

These benchmarks were run on A10 machines on my university cluster. Running a 6B model while having a decent experience would require at least 4 A10 GPUs. That would have a latency around 350ms for most single line completions. For multi-line completions it'll take longer (0.5s-2s depending on the length). You could use the 2B model with 2GPUs to get similar latency.

In my benchmarking, I have not found a lot of difference between 2B and 6B models in terms of quality. This graph shows how close 2B and 6B models are in terms of quality of suggestions.

The overall latency depends on two things: processing the context (1500 tokens) and generating the output token-by-token. The first phase is basically fixed cost you always have to pay, and the second phase costs O(output length). Most of the suggestions are expected to be single-line suggestions, so expect ~10-20 tokens, not a lot more.

Here are latency numbers for different numbers of GPUs:

Model	NumGPUs	Phase 1	Phase 2	Total for 10 tokens	Total for 20 tokens	Total for 50 tokens
2B.	1	250	15ms/token	400ms	550ms	1000ms
2B.	2	150	9ms/token	240ms	330ms	600ms
2B	4	100	6ms/token	160ms	220ms	400ms
6B	1	410	33ms/token	740ms	1070ms	2060ms
6B	2	240.	19ms/token	430ms	620ms	1190ms
6B	4	170	12ms/token.	290ms	410ms	770ms