fauxpilot / fauxpilot

FauxPilot - an open-source alternative to GitHub Copilot server

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How to run multiple models on one machine (w/ one GPU)

leemgs opened this issue · comments

I am running inference tasks conveniently with CodeGen models, thanks to the FauxPilot community. Thank you again.
Additionally I wonder if it is possible to run multiple models on a single GPU.
Below is the experimental environment I am trying to experiment with.

  • H/W: Intel CPU i7 x1, System RAM 32GB, Nvidia Titan Xp (video ram: 12GB) x1, SSD 1TB x1
  • OS: Ubuntu 22.04 (Linux 5.15)
  • Inferencing Server: FauxPilot (based on Triton and Docker)
  • Models: CodeGen (350MB-multi) model + Fine-tuned CodeGen (350MB-multi) model

How can I run more than 2 Codegen models on 1 machine (with 1 GPU)?
Any hints and comments are welcome. 😄

For example,
In order to run an inferencing tasks simultaneously with (1) an original Codegen 350M model and (2) a fine-tuned Codegen 350M models on 1 machine (including 1 GPU), What would be a recipe or technique?


As we all know, time continues to pass. Yes. You are correct. It is already the year 2023.

In the interim, I pondered the architecture of an inference system capable of running several models on a single GPU.
As depicted in the diagram below, if we create an inference system with multi-model and per-model structures, the system will operate as follows. We believe it is possible to run many models in parallel on a single system equipped with a single GPU, assuming that the GPU's memory has sufficient capacity.

Any and all feedback is welcomed. And I feel such insights will be of tremendous assistance to me.

My proposal: Version 1.1

image

@leemgs looks nice. I think it's possible approach but I have some questions.

  1. In your diagram, I understand each container has each model, not in GPU host. it's right? I guess what you really want to do is splitting models from Nvidia Triton instance.
  2. What is different with using Model Repository features in NVIDIA Triton Server with your suggestion?

I've not been used NVIDIA Triton Server before but according to manual, triton server alerady supports Cloud Object Storage like AWS S3 and it looks that HTTP/GRPC API support passing model name while inference request and this features support this scenario.

@joonjeong , First, I'd want to thank you for your kind words.

A1. You are correct. In order to implement the diagram below, each container must be operated independently with its own model. So, based on your feedback, I modified the original diagram (from ver 1.0 to ver 1.1) more to make it better.

A2. The model repository (e.g., -model-repository) in my proposal is a local repository. However, if we need to execute several changes in the future, we can configure the network as a distributed system. :)

Hi @leemgs I am curious how to setup multiple instances on multiple GPUs? Like I have 4gpus and I hope to run codegen16b over each of them. However, if I just edit .env and launch an extra instance by ./launch.sh I will be told that is not allowed. Do you know any conviently way to run multiple instances (./launch.sh) without reinstalling fauxipilot? Thanks!