lyogavin / Anima

33B Chinese LLM, DPO QLORA, 100K context, AirLLM 70B inference with single 4GB GPU

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Mac Airllm Inference tigerbot-70b-chat-v2

ageorgios opened this issue · comments

from sys import platform
from airllm import AutoModel
import mlx.core as mx

assert platform == "darwin", "this example is supposed to be run on mac os"

# model = AutoModel.from_pretrained("01-ai/Yi-34B")#"garage-bAInd/Platypus2-7B")
model = AutoModel.from_pretrained("/Users/ageorgios/Models/tigerbot-70b-chat-v2")

input_text = [
        'Tell me the purpose of life',
    ]

MAX_LENGTH = 128
input_tokens = model.tokenizer(input_text,
    return_tensors="np", 
    return_attention_mask=False, 
    truncation=True, 
    max_length=MAX_LENGTH, 
    padding=False)

input_tokens

generation_output = model.generate(
    mx.array(input_tokens['input_ids']), 
    max_new_tokens=3,
    use_cache=True,
    return_dict_in_generate=True)

print(generation_output)

This is my code and the output is not correct I think.

(.venv) ageorgios@mac airllm % python main.py
/Users/ageorgios/Models/airllm/.venv/lib/python3.11/site-packages/bitsandbytes/cextension.py:34: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers, 8-bit multiplication, and GPU quantization are unavailable.
  warn("The installed version of bitsandbytes was compiled without GPU support. "
'NoneType' object has no attribute 'cadam32bit_grad_fp32'
saved layers already found in /Users/ageorgios/Models/tigerbot-70b-chat-v2/splitted_model
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
        - Avoid using `tokenizers` before the fork if possible
        - Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
running layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:56<00:00,  1.40it/s]
running layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:55<00:00,  1.44it/s]
running layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 80/80 [00:56<00:00,  1.41it/s]
.</s> 
(.venv) ageorgios@mac airllm %