High inference memory usage

Question

High inference memory usage

siddhatiwari opened this issue 3 months ago · comments

If a piper http server comes under heavy load, GPU memory usage can spike up multiple GBs and remain high until the server is stopped. Sometimes requests can get OOM errors if memory usage increases too much.

I'm not sure if these are bugs or expected behaviors:

Why does memory usage permanently remain high and not decrease if inference load decreases?
Initial memory usage for a loaded model is around 500MB, much larger than a low/medium quality model itself (50MB)

To recreate, run the http_server and serve it high requests per second:
python3 -m piper.http_server -m en_US-lessac-medium --cuda --port 6000