Feature Request: Create a multi thread api server demo
juntao opened this issue · comments
Michael Yuan commented
Summary
This only works when the available RAM is several times the model. I think we could demo a PoC using TinyLlama.
1 Start several api server instances. Each on a different port.
2 Start an nginx server to proxy incoming requests to these api servers.
3 The nginx server will choose the first api server that does not return 503 to its request.
Appendix
No response