Using base model on GPU with no bfloat16

Question

Using base model on GPU with no bfloat16

yichen0104 opened this issue 2 months ago · comments

Hi. I'm trying to run mistral-7B-v0.1 model using mistral-inference with a Nvidia Tesla V100 32GB GPU. Considering that my GPU doesn't have bfloat support, I would like to know if it is possible to configure the runtime code to run under fp16 mode, or it will raise an error identical to that in Issue #160. I've tried both mistral-demo and the sample Python code in README and yielded the same error. Thanks in advance.

the-crypt-keeper · Answer 1 · Thu May 30 2024 04:06:42 GMT+0800 (China Standard Time)

@yichen0104 The library underneath actually supports it, the problem is just that dtype is not exposed via the CLI. I was able to make it work on my 2x3060+2xP100 machine by applying the following patch:

diff --git a/src/mistral_inference/main.py b/src/mistral_inference/main.py
index a5ef3a0..d97c4c9 100644
--- a/src/mistral_inference/main.py
+++ b/src/mistral_inference/main.py
@@ -42,7 +42,7 @@ def load_tokenizer(model_path: Path) -> MistralTokenizer:

 def interactive(
     model_path: str,
-    max_tokens: int = 35,
+    max_tokens: int = 512,
     temperature: float = 0.7,
     num_pipeline_ranks: int = 1,
     instruct: bool = False,
@@ -62,7 +62,7 @@ def interactive(
     tokenizer: Tokenizer = mistral_tokenizer.instruct_tokenizer.tokenizer

     transformer = Transformer.from_folder(
-        Path(model_path), max_batch_size=3, num_pipeline_ranks=num_pipeline_ranks
+        Path(model_path), max_batch_size=3, num_pipeline_ranks=num_pipeline_ranks, dtype=torch.float16
     )

     # load LoRA