lhenault / simpleAI

An easy way to host your own AI API and expose alternative models, while being compatible with "open" AI clients.

Home Page:https://pypi.org/project/simple-ai-server/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Parallelism

Nintorac opened this issue · comments

Hey,

Do you know how to set parallelism? I have wrapped a few API's. eg Azure OpenAI endpoint and I can't seem to get it to serve in parallel?

I have tried modifying the number of threads assigned to the server but dont get any speedups i.e like here. any ideas what I'm missing?

I suspect number of threads isn’t giving you any speed ups as it works well with IO-bound tasks, while here you’re probably more CPU-bound (or GPU-bound).

How about using something like Kubernetes deployments and increasing the number of replicas (e.g with kubectl scale)? It’s a fairly more involved process than just increasing the number of threads but might be also more flexible.

na, this is calling out to OpenAI (in Azure) to do the inference

I'm slightly confused: are you forwarding requests to OpenAI through a SimpleAI instance?

Yea, it was easier than rewriting my consumer code to accommodate the config differences to Azure

I was thinking about adding some proxy type of backend which just pass the query to another url, exactly for that kind of use cases. That should be quick to implement and would get rid of the gRPC dependency / bottleneck for this use case. Happy to start working on this if you think it's worth it.

ooh yeh, that would be cool!

Oh, silly me, just needed to scale the number of FastAPI workers using Gunicorn! This also seems to work using actual models, didn't realise it would be that easy to share the weights between threads.