mit-han-lab / TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

Home Page:https://mit-han-lab.github.io/TinyChatEngine/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

The program crashed when input long context on windows CPU

Laeglaur opened this issue · comments

commented

Hi, I found the program will crash at first forward when input long context..
I just modifed the code to print the length of input tokens, and used some corpus to test the crash border.
And I notice the first talking run will add some instruction, so I use hello to skip it.

When the length of input tokens > 126, the program will crash. And I checked the crash position, it happends at the first loop of https://github.com/mit-han-lab/TinyChatEngine/blob/de720b46327ee3b8cbb20a069799ff2e69908a13/llm/src/nn_modules/non_cuda/LLaMAGenerate.cc#L75C38-L75C38

Have any idea about it?

Here is the test log.

(TinyChatEngine) PS C:\Users\Documents\codes\TinyChatEngine\llm> .\chat.exe

TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT: 
25
Hello! How can I help you today? Is there something specific you would like to know or discuss?
Inference latency, Total time: 7.6 s, 343.8 ms/token, 2.9 token/s, 22 tokens

USER: Context: domains with a joint attention model. The only interaction in a CLIP model between the image and text domain is a single dot product in a learned joint embedding space. We are excited to see CLIP hybridized with this line of work. CLIP is trained on text paired with images on the internet. These image-text pairs are unltered and uncurated and result in CLIP models learning many social biases. This has been previously demonstrated for image caption models(Bhargava & Forsyth, 20
ASSISTANT: 
126
I understand that you're interested in the potential of hybridizing CLIP with joint attention models to improve its performance on tasks such as image-text matching and visual question answering. By incorporating a joint attention model into the CLIP architecture, we can enable the model to better understand the relationships 
between images and text, and generate more accurate and informative captions for images.
Inference latency, Total time: 35.0 s, 442.6 ms/token, 2.3 token/s, 79 tokens

USER: Context: domains with a joint attention model. The only interaction in a CLIP model between the image and text domain is a single dot product in a learned joint embedding space. We are excited to see CLIP hybridized with this line of work. CLIP is trained on text paired with images on the internet. These image-text pairs are unltered and uncurated and result in CLIP models learning many social biases. This has been previously demonstrated for image caption models(Bhargava & Forsyth, 201
ASSISTANT: 
127
commented

Share your HW configuration.

I think this might be memory related.
I am running TinyChatEngine on Codespace with just 2core and 8Gb, and I get crashes.

TinyChatEngine by MIT HAN Lab: https://github.com/mit-han-lab/TinyChatEngine
Using model: LLaMA2_7B_chat
Using AWQ for 4bit quantization: https://github.com/mit-han-lab/llm-awq
Loading model... Finished!
USER: hello
ASSISTANT: 
Hello! How can I help you today? Is there something specific you would like to know or talk about?
Inference latency, Total time: 8.3 s, 360.9 ms/token, 2.8 token/s, 23 tokens
USER: What do you know about the Tensor Virtual Machine framework?
ASSISTANT: 
The Tensor Virtual Machine (TVM) is an open-source framework for building and deploying machine learning models. It was developed by Google and is designed to make it easier to train and deploy machine learning models on a variety of platforms, including cloud, edge, and embedded devices. TVM provides a high-level API that allows developers to write machine learning code in a simple and intuitive way, without worrying about the underlying hardware or software details. This makes it easier for developers to focus on building their models and deploying them in a variety of environments.
TVM supports a wide range of machine learning frameworks and libraries, including TensorFlow, PyTorch, and scikit-learn. It also provides tools for debugging and profiling models, as well as support for distributed training and deployment.
Overall, TVM isTerminated

running the chat in a local machine (WSL2, 12Gb allocated + 4Gb swap) has not produced a crash yet, but as the system swaps chat gets slower and slower.

Later, I will be migrating the Codespace to 4core, 16Gb to check if the crash occurs (with the same prompt).