turboderp / exllamav2

A fast inference library for running LLMs locally on modern consumer-class GPUs

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

dbrx architecture

veryVANYA opened this issue · comments

commented

any way to create custom template?

yes, I want to run this fatboy. Have 3x3090 now so it should handle it. I thought it could run as GPTQ but since it's MOE I'm not sure if it will work at all without changes. Template seems the least of the worry.

also: https://huggingface.co/databricks/dbrx-instruct/discussions/10

I've been working on it and it's coming along, so expect an update soon.

The fused tensors aren't a huge concern, they can just be unfused when loading. Template I'm not sure about yet, since the biggest obstacle is the huge amount of RAM and/or VRAM required for quantizing the expert layers. Figuring out prompting comes after.

Supported with the most recent commit

I've uploaded conversions here:

dbrx-base-exl2
dbrx-instruct-exl2

Note that measuring requires about 90 MB of system RAM because of all the state that has to be recorded for each expert layer. I've included measurements files though, so you can skip that step if you just want more bitrates.

How did you construct the tokenizer.json file or where did you get it from? Because the dbrx huggingface repo doesn't have it in there.

I've updated to the latest commit and have enough ram to do the quantization, I'm starting it up and seeing the following error trying to quantize dbrx:

No supported tokenizer found.

I went to your huggingface links and grabbed the "tokenizer.json" file and stuck it into the fp16 original dbrx instruct folder, and the quantization code is currently running.

Thank you so much for updating for dbrx, I've been very interested in trying it out <3

It's using this tokenizer.

Note that for instruct versions, you'll want to add this to the config.json to make the model work correctly with a ChatML template:

    "pad_token_id": 100277,
    "bos_token_id": 100278,
    "eos_token_id": 100279

Thank for your greate work, I've just try 3.0bpw ,and update to your latest master branch but cause this error:
site-packages/exllamav2/fasttensors.py", line 165, in get_cm f = safe_open(self.filename, framework = "pt", device = device) safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
BTW, I use exl2 directly in text-generation-webui and uninstall previous one

Check that you downloaded all the safetensors files correctly. git clone doesn't work unless you have git-lfs installed.

Thanks, it is incompletely download cause this error, my machine always incorrectly shutdown . I'll try it later .
-rw-r--r-- 1 root root 4854093256 Mar 31 02:09 output-00005-of-00005.safetensors -rw-r--r-- 1 root root 8586669360 Mar 31 03:10 output-00004-of-00005.safetensors -rw-r--r-- 1 root root 2977955840 Mar 31 03:12 output-00003-of-00005.safetensors -rw-r--r-- 1 root root 8590259644 Mar 31 10:26 output-00002-of-00005.safetensors -rw-r--r-- 1 root root 8590167688 Mar 31 10:27 output-00001-of-00005.safetensors

It doesn't have this error, but the "dbrx" model can only load on gpu one even i set the HIP_VISIBLE_DEVICES to 0,1 .

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 26.00 MiB is free. Of the allocated memory 21.98 GiB is allocated by PyTorch, and 1.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

I have no idea with this. I've try auto split mode on text-generation-wui and manual mode.

And I also try to load previous big exl model in two gpu with this version and work.
The hardware env is : Two 7900xtx on the cloud.

It doesn't have this error, but the "dbrx" model can only load on gpu one even i set the HIP_VISIBLE_DEVICES to 0,1 .

torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 18.00 MiB. GPU 0 has a total capacity of 23.98 GiB of which 26.00 MiB is free. Of the allocated memory 21.98 GiB is allocated by PyTorch, and 1.76 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

I have no idea with this. I've try auto split mode on text-generation-wui and manual mode.

And I also try to load previous big exl model in two gpu with this version and work. The hardware env is : Two 7900xtx on the cloud.

Oh I find the reason, maybe it is caused by text-generation it self , i try directly run the command given by exllamav2 successfully split into double gpu.

This model is so huge. I try 2.3 bpw still oom. I'll try to download the smallest one. Thank you for creating this wonderful project!

So cool , i've load it , something strange ,maybe the same question of @veryVANYA
` -- Loaded model in 16.0700 seconds
-- Warmup...
-- Generating...

helloThe post The Future of Work – A New Perspective appeared first on HR Tech World.Talent Management and Recruitment
HR Tech World
HR Tech World
HR Tech World
HR Tech World
HR Tech World
HR Tech World
`

If you're testing the instruct version this way, with a prompt like Hello, the prompt is prefixed with the BOS token, which in this case makes it <|im_start|>Hello. I have noticed that DBRX-Instruct doesn't do well with a partial ChatML prompt like that. It's not unusual for models to be a little picky with prompt formatting.

I get, for the 2.2bpw version with the prompt Hello:

 -- Warmup...
 -- Generating...

Hello$\\$`$?'?$$'?'?$?'?'$?'?'?'?$'? ...

You can add the -pnb argument to avoid adding the BOS token for the test, which in my case gives:

 -- Warmup...
 -- Generating...

Hello! I am a 28 year old female looking for accommodation in Melbourne CBD. ...

I tried in ExUI with the following prompt:

<|im_start|>system
You are a helpful AI assistant.<|im_end|>
<|im_start|>user
Are there trees on the moon?<|im_end|>
<|im_start|>assistant

giving the completion:

No, there are no trees on the moon. The moon is a natural satellite of Earth, and it lacks an atmosphere and life as we know it on Earth. The moon's surface is covered in regolith, a layer of dust and debris that would not support tree growth.<|im_end|>

So it seems to be working correctly.

Thank for your reply ! Really apreciate for your work! I just find the template on dbrx's code and find it use jinja2 chatml template as well. When i'm trying turn python code to what ooboobaga tgw support
My machine have to reboot again for something cause amd gpu segment fault. And i find it always happens when the vram reach fully using or gpu hip core reach fully computation.
When i reboot machine, the command cannot work again 🤯

I'm surely it is the same command as i write before rebooting the machine

python test_inference.py -m ~/models/dbrx-instruct-exl2-2.3/ -p "hello" -gs auto -lm -t 2048 -l 2048
 -- Model: ~/models/dbrx-instruct-exl2-2.3/
 -- Options: ['gpu_split: auto', 'length: 2048', 'low_mem']
 -- Loading tokenizer...
 -- Loading model...
 -- Loaded model in 15.7389 seconds
 -- Warmup...
 -- Generating...

Traceback (most recent call last):
  File "~/exllamav2/test_inference.py", line 197, in <module>
    output = generator.generate_simple(args.prompt, settings, args.tokens, token_healing = True, add_bos = not args.prompt_no_bos)
  File "~/exllamav2/exllamav2/generator/base.py", line 181, in generate_simple
    position_offsets = position_offsets).float().cpu()
AttributeError: 'NoneType' object has no attribute 'float'

amd machine is always not stable. I'm checking for this reason.

I sadly only have one AMD GPU so autosplit is a little untested on ROCm. You could still try with a manual split, something like -gs 18,24

It seems whaterver model i load will cause this error. I try to load mixtral 7B and failed again. I just turn into manual mode. Well it is strange , i load from text-generation-webui, it success.
image
I've update the dependency to exllamav 0.0.17+rocm. So tgw use the same lib as i run from test_inference.py. Something strange...

It's possibly a tokenization issue in the generator. Maybe just try a longer prompt? "Hello" is a single token and might be confusing it. Although it seems unlikely since I can't make it happen here.

I just try "once upon a time" , it happened again.
python test_inference.py -m ~/models/Mixtral-8x7B-instruct-exl2/ -p "Once upon a time," -gs 12,12 -lm -t 2048 -l 2048 -nfa -fst

Traceback (most recent call last):
  File "~/exllamav2/test_inference.py", line 197, in <module>
    output = generator.generate_simple(args.prompt, settings, args.tokens, token_healing = True, add_bos = not args.prompt_no_bos)
  File "~/exllamav2/exllamav2/generator/base.py", line 181, in generate_simple
    position_offsets = position_offsets).float().cpu()
AttributeError: 'NoneType' object has no attribute 'float'

I'm trying debugging , I'll try to make some print to the code under exllama lib.

happen in here
image

Oh, it might be the -t 2048 -l 2048 combination. The test script doesn't do a whole lot of sanity checking, so if you ask it to generate 2048 tokens with a maximum context length of 2048, it truncates the prompt to zero tokens to make room for the response. Try -t 1900 -l 2048 or something.

-t 1900 -l 2048

👍👍👍👍👍Wow ,so cool!
I just ask model " I think turoderp is handsome , do you think so?"
dbrx answered:
image

I'll try the dbrx model with template you give this time, and try to find out why text-generation-webui cannot load. 😁

Make sure to update TGW, since there was an issue with recent EXL2 quants that include quantization_config metadata. It should be fixed, but only just a few hours ago.

OK, I'll update that.

My system's running 2x 7900 xtx 24 - for a total of 48 GB VRAM... it works for some things.
With older models, the auto-split seems to work as expected now...
Using nvtop it will fill up the first card, and then the second.

I have downloaded the 2.2 bit versions of both xl2s...
They're smaller in size than models I have loaded before on my cards.
With DBRX, it loads it on to one card, and then it crashes -
"torch.cuda.OutOfMemoryError: HIP out of memory. Tried to allocate 20.00 MiB. GPU"

Clearly this is new, and thanks so much for your work - it is wonderful...

I am posting here so that I see the updates, and wanted to ask,
if there's anything I can do to help resolve this loading issue.

This is already fixed in TGW, though it's still in the dev branch, it seems. The relevant commit is here so you could either apply that change manually or install the dev branch if you want to test.

So anyone have issues with this model falling apart? It starts out ok and then it begins repeating itself. I add repetition penalty and then it loses coherence, especially as context builds. Starts ignoring the format, outputting code blocks, etc. I have tried sampling it many ways or using presence penalty and it degenerates to the same place. Sometimes reloading the model makes it better but that could just be placebo. 3.75 quant and perplexity is within norms.

The model sure does like code. There's a lot of very code-specific words in the vocabulary too:

      "ĠobjectType": 93902,
      "ĠFileAccess": 93943,
      ".rightBarButtonItem": 94017, 
      ".ImageField": 94061, 
      "LoadIdentity": 94085, 
      "ĠNotSupportedException": 94090,
      "Ġ/****************************************************************": 100089,

I haven't had issues with long prompts myself, though, or with long conversations. If you just let it generate on its own, it does sometimes reach an <|endoftext|> token, after which it just starts generating something random, often code or boring lists. Now, your framework of choice may or may not recognize that as a control token (the config.json doesn't mention it), and the token may or may not be decoded and shown as part of the output stream.

Here's an example (ExUI) after successfully summarizing about 8k tokens of a long news article:

image

So it's conceivable that what you're seeing is a prompt formatting and/or tokenization issue.

Heh, it's not endoftext. Although I did add that to the stopping strings (along with some of the ``` things it outputs). I also double checked the prompt format, just to be sure that it was right. Happens when you gradually build the context more than just shooting it in a single go.

I used both textgen with HF sampling and tabbyAPI. When it really goes nuts, it outputs the whole token limit without an EOS or repeats words over and over again. One thing that's different is that I'm not using the DBRX system prompt but my own.

Rep penalty of any kind is what seems to really trigger it. Starts around 3k context, so after you've been chatting for a little bit. Tabby seems to be working better so I'll look for a reproducible way to break it.

There's a lot going on with the sampler there. I can only see some of the settings but it looks like you have both dynamic temperature and typical sampling enabled? I don't really know how those would play together.

High repetition penalty is usually a bad idea, although it varies from model to model what constitutes a high penalty. Some models just break down at 1.05, others seem to tolerate up to 1.2. It also depends on subtle details like whether the EOS token is retained in the tokenized context and therefore affected by the penalty.

It would be more helpful if you could reproduce the behavior with simpler settings, like simply temp = 1, top-P = 0.8 and everything else disabled, but in any case there isn't one set of sampling parameters that's good for all models.

I tried a few different sampling strategies with this model. The one in the video was temp=1, min_P=.01, typ_P=.95, smoothing=.20. I usually just use smoothing + curve at temp 1. Also tried dynamic temperature and standard textgen presets. It's nothing too ridiculous.

Changing or turning off sampling doesn't really do much to help. It's the first thing I tried. The model starts repeating phrases 10-20 messages in. Even low rep pens like 1.02-1.05 end up in the same place. The quality of the writing simply deteriorates over time. It gets less coherent and starts double replying, outputting stuff from the prompt, runaway replies.

On chat.lmsys.org it begins repeating phrases fairly quickly, same as local. Maybe it's only good for simple questions or summaries and unable to handle chatting. It's a shame too because it's trained on a lot of tokens and very fast.

A lot of it likely comes down to how the model is finetuned. Emergent properties are never guaranteed, and if the instruct tune didn't cover long conversations it could just be essentially undefined behavior past a certain point.

The model seems perfectly capable of summarizing 10-20k token texts, and it does fine on needle-in-haystack and reverse-needle-in-haystack tests on contexts of that length. So I don't think it's the length of the context that confuses it. But I did some more tests and I have noticed now that there is a tendency for the model to repeat itself after enough rounds of back-and-forth between the user and assistant roles. It doesn't become incoherent in my tests but it does seem to prefer just repeating phrases whenever those phrases can be forced into the reply, rather than exploring new possibilities or advancing the narrative.

I guess we could really use a benchmark that targets this, though I have no idea what that would look like.

It replied perfectly to a 10k long conversation, so yea, it's not context alone. Hopefully someone finteunes it for chat, but with the size, I feel that's unlikely. There was another model/lora from v2ray but it's tuned on reddit posts from the chatGPT sub. On miqu based models, I don't even use rep penalty at all anymore so it was strange to run into this.

I thought I'd mention that as of the new updates I was able to load dbrx models of the smallest size with 2 GPUs... it nearly fills up all the graphics memory - but it loads and works. Thanks to @turboderp for the models and loader... it's amazing stuff.

@Ph0rk0z @turboderp instruct - even with 6.5bpw (8 for heads) it still shows loss of coherence on longer output. I had such high hopes for it... If anyone wants I'll upload the quant to hf, maybe I'll try 7bpw first (can't quite fit 8 in 144GB VRam).
btw, it likes llama-2 prompt better than chatml, in my limited testing
On to Mixtral 8X22B

I need to do more testing, I have 4,6,and 8bit quants running in obaboogas textgen with deterministic presets and it has been giving me great results in chat and instruct modes. I'm using 4 experts also. I've noticed a decline in quality between 4bit and 6/8bit with regard to the snake game request. Both 6 and 8 bit donut perfectly in one go, with 4bit making errors.

Yea it seems nothing wrong with the quants. Just the model itself poops itself on back and forths. See it start doing it on hosted instances. CR+ has been much better, especially using their guide for the system prompts. RIP dbrx.

Hey did you see this?!?! https://huggingface.co/databricks/dbrx-instruct/tree/main

Check their repo out again, 18 hours ago they uploaded a bunch of new files? I don't know if they will fix anything for folks, but I'm going to mess around with them and see if the model behaves any different.

All I see them changing is the tokenizer. They remove stuff like imstart and add weirdness: "content": "<|fim_prefix|>",