fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Could this be used by a small team?

cloudwitch opened this issue · comments

commented

I would like to use FauxPilot for a small team (less than 10) developers and am curious if a single instance could support my whole team.

Is it possible for a single instance (lets assume a single GPU (p2.xlarge EC2 is what I had in mind)) to support more than one developer?

If so, how many users would a single GPU be able to support before adding a second GPU?

Is there a certain point where you should stop adding GPUs to a single instance and start distributing load across multiple servers?

Thanks! Looking forward to trying this out sometime!

Here's some rough calculation:

p2.xlarge GPUs provide K80 GPUs, which can manage 8.7TFlops for 16bit

If you want to use a 16B param model, that'll require (16B params) * (2 fops/param) * (1500 tokens) / ( 8.7TFlops * 0.7 utilization) = 7.8seconds to provide one completion, assuming a context length of 1500 tokens. With an A100 GPU (312 TFlops for 16bit), it'd take 219ms!

So using a 16B param model on K80 will be a non-starter even for one person.

Edit: Also, forgot to mention, that 16B param model will want at least 32GB vRAM, and K80 only has 24GB vRAM. So the model won't even load. But you could try using a 6B param model and that'd also let you have a latency of ~3seconds. Will still not be usable.

Option 1: You could look at g5.xlarge -- they come with one A10 GPU (24GB, 125Tflops@16bit), which would let you use a 6B param on a single GPU at 205ms latency, allowing roughly 5 completions/second without any batching. With batching the number should be better, but I need to look at how the calculation will work for that. This would be borderline-decent for a 10 member team I think? And if you have 2 such machines, you could even fit a 16B model. But again, I'll need to look at how to correctly calculate latencies for this. Will get back on this.

Option 2: g4dn.xlarge come with T4 GPUs, which have (16GB, 65TFlops@16bit). This comes at a tiny-bit higher price than p2.xlarge but much better performance. You can use a 6B model with ~400ms latency, which while noticeable, might still be tolerable. With 2 such instances, you can get better latency /larger model.


Btw, these are some estimates (and someone please double check my calculation, (@moyix)). I'll try to do some small experiment to get empirical latency numbers in a couple of days.

Oh btw I realized that using Lambda Labs might be a much better option.

They provide an A10 for 0.6/hour, a tad more than the p2.xlarge (0.4/hour).
They also provide an A100 for $1.10/hr -- sameish rate as one g5.xlarge or two g4dn.xlarge. Much better option (40GB RAM, and 312TFlops) if you're willing to pay that much. That'll let you use the 16B param model without issues, and good latency (219ms).

There might be other GPU providers like Lambda that might give cheaper rates like this, and may be worth looking into. I'm not an expert in the GPU infra stuff yet, so can't provide a definitive answer except for hard constraints like memory requirements for model sizes.

commented

I really appreciate the feedback.

Does it make sense to move this information to the Wiki?

Yes that makes sense. I'll add it there. I'm planning to run some experiments to get latency/gpu numbers to verify these estimates (held up because the system I have access to is a RHEL without docker). Will be adding this info to the wiki once I have that.

commented

Here's some rough calculation:

p2.xlarge GPUs provide K80 GPUs, which can manage 8.7TFlops for 16bit

If you want to use a 16B param model, that'll require (16B params) * (2 fops/param) * (1500 tokens) / ( 8.7TFlops * 0.7 utilization) = 7.8seconds to provide one completion, assuming a context length of 1500 tokens. With an A100 GPU (312 TFlops for 16bit), it'd take 219ms!

So using a 16B param model on K80 will be a non-starter even for one person.

Edit: Also, forgot to mention, that 16B param model will want at least 32GB vRAM, and K80 only has 24GB vRAM. So the model won't even load. But you could try using a 6B param model and that'd also let you have a latency of ~3seconds. Will still not be usable.

Option 1: You could look at g5.xlarge -- they come with one A10 GPU (24GB, 125Tflops@16bit), which would let you use a 6B param on a single GPU at 205ms latency, allowing roughly 5 completions/second without any batching. With batching the number should be better, but I need to look at how the calculation will work for that. This would be borderline-decent for a 10 member team I think? And if you have 2 such machines, you could even fit a 16B model. But again, I'll need to look at how to correctly calculate latencies for this. Will get back on this.

Option 2: g4dn.xlarge come with T4 GPUs, which have (16GB, 65TFlops@16bit). This comes at a tiny-bit higher price than p2.xlarge but much better performance. You can use a 6B model with ~400ms latency, which while noticeable, might still be tolerable. With 2 such instances, you can get better latency /larger model.

Btw, these are some estimates (and someone please double check my calculation, (@moyix)). I'll try to do some small experiment to get empirical latency numbers in a couple of days.

I'm curious about the meaning of "context length of 1500 tokens", i made some small test:

  • Hardware: Tesla T4
  • Model: codegen-multi-2B
  • Inference backend: FasterTransformer

I use the postman to make some request and calculate the avg latency, here is the result

prompt_tokens completion_tokens total_tokens latency(ms)
128 5 133 200
128 10 138 254
128 16 144 356
128 20 148 410
128 32 160 582
128 64 192 1075

so i want to know if the "1500 tokens" was equal to the total_tokens?

Hi @Jaeker0512 by context length of 1500 tokens, I mean the prompt to be having 1500 tokens. That's the number used by copilot. Can you try benchmarking with prompt length of 1500 tokens and 2-16 completion tokens? I'm planning to run these benchmarks but I've been blocked by some other work. Your numbers would be helpful. Thanks!

So I finally got around to doing some benchmarking (way longer than I'd have liked but had many things in the way). Some notes from my benchmarking:

image

Prompt size: 1536 tokens Output tokens: varied from 1 to 64 tokens.

These benchmarks were run on A10 machines on my university cluster. Running a 6B model while having a decent experience would require at least 4 A10 GPUs. That would have a latency around 350ms for most single line completions. For multi-line completions it'll take longer (0.5s-2s depending on the length). You could use the 2B model with 2GPUs to get similar latency.

In my benchmarking, I have not found a lot of difference between 2B and 6B models in terms of quality. image This graph shows how close 2B and 6B models are in terms of quality of suggestions.


The overall latency depends on two things: processing the context (1500 tokens) and generating the output token-by-token. The first phase is basically fixed cost you always have to pay, and the second phase costs O(output length). Most of the suggestions are expected to be single-line suggestions, so expect ~10-20 tokens, not a lot more.

Here are latency numbers for different numbers of GPUs:

Model NumGPUs Phase 1 Phase 2 Total for 10 tokens Total for 20 tokens Total for 50 tokens
2B. 1 250 15ms/token 400ms 550ms 1000ms
2B. 2 150 9ms/token 240ms 330ms 600ms
2B 4 100 6ms/token 160ms 220ms 400ms
6B 1 410 33ms/token 740ms 1070ms 2060ms
6B 2 240. 19ms/token 430ms 620ms 1190ms
6B 4 170 12ms/token. 290ms 410ms 770ms