Feature Request: Create a multi thread api server demo

Question

juntao opened this issue 5 months ago · comments

This only works when the available RAM is several times the model. I think we could demo a PoC using TinyLlama.

1 Start several api server instances. Each on a different port.

2 Start an nginx server to proxy incoming requests to these api servers.

3 The nginx server will choose the first api server that does not return 503 to its request.

No response