Parallelism

Question

Parallelism

Nintorac opened this issue a year ago · comments

Hey,

Do you know how to set parallelism? I have wrapped a few API's. eg Azure OpenAI endpoint and I can't seem to get it to serve in parallel?

I have tried modifying the number of threads assigned to the server but dont get any speedups i.e like here. any ideas what I'm missing?

Louis Hénault · Answer 1 · Sat Apr 29 2023 20:55:52 GMT+0800 (China Standard Time)

I suspect number of threads isn’t giving you any speed ups as it works well with IO-bound tasks, while here you’re probably more CPU-bound (or GPU-bound).

How about using something like Kubernetes deployments and increasing the number of replicas (e.g with kubectl scale)? It’s a fairly more involved process than just increasing the number of threads but might be also more flexible.

Nintorac · Answer 2 · Sat Apr 29 2023 21:23:42 GMT+0800 (China Standard Time)

na, this is calling out to OpenAI (in Azure) to do the inference

Louis Hénault · Answer 3 · Sat Apr 29 2023 22:09:35 GMT+0800 (China Standard Time)

I'm slightly confused: are you forwarding requests to OpenAI through a SimpleAI instance?

Nintorac · Answer 4 · Sat Apr 29 2023 22:46:17 GMT+0800 (China Standard Time)

Yea, it was easier than rewriting my consumer code to accommodate the config differences to Azure

Louis Hénault · Answer 5 · Sat Apr 29 2023 22:52:52 GMT+0800 (China Standard Time)

I was thinking about adding some proxy type of backend which just pass the query to another url, exactly for that kind of use cases. That should be quick to implement and would get rid of the gRPC dependency / bottleneck for this use case. Happy to start working on this if you think it's worth it.

Nintorac · Answer 6 · Sat Apr 29 2023 23:23:08 GMT+0800 (China Standard Time)

ooh yeh, that would be cool!

Nintorac · Answer 7 · Sun Apr 30 2023 09:24:37 GMT+0800 (China Standard Time)

Oh, silly me, just needed to scale the number of FastAPI workers using Gunicorn! This also seems to work using actual models, didn't realise it would be that easy to share the weights between threads.