Request to provide a RAG example?

Question

Request to provide a RAG example?

ChenYuYeh opened this issue 4 months ago · comments

this is a cool project. I can make it run well on my meteorlake system.

btw, would you kindly provide a RAG (Retrieval Augmented Genration) example that can refer to external documents using RAG technique? Soon or later. Thanks.

Reference link: https://github.com/yas-sim/openvino-llm-chatbot-rag

Alessandro Palla · Answer 1 · Wed Mar 06 2024 02:55:47 GMT+0800 (China Standard Time)

I think is a great idea! I was working on some splashy demos like Lora fine-tuning but this is a much lower hanging fruit. Given the nature of this library I think we should aim for lanchain smooth integration

ChenYuYeh · Answer 2 · Wed Mar 06 2024 07:40:13 GMT+0800 (China Standard Time)

Exactly, go with langchain is the most popular RAG solution! Hope I can help validate soon!
https://python.langchain.com/docs/expression_language/cookbook/retrieval

ChenYuYeh · Answer 3 · Wed Mar 06 2024 14:33:19 GMT+0800 (China Standard Time)

@alessandropalla
I managed to verify using NPU as device with this GitHub https://github.com/yas-sim/openvino-llm-chatbot-rag. However it turns out the error logs as it cannot support dynamic shape...

I was using 'dolly2-3b' and 'TinyLlama-1.1B-Chat-v1.0' both. Hence there should be not a memory issue.

File "/opt/intel/openvino/python/openvino/runtime/ie_api.py", line 543, in compile_model
super().compile_model(model, device_name, {} if config is None else config),
RuntimeError: Exception from src/inference/src/core.cpp:113:
[ GENERAL_ERROR ] Exception from src/vpux_plugin/src/plugin.cpp:579:
get_shape was called on a descriptor::Tensor with dynamic shape

Therefore I checked the openvino documents about NPU device. Mentioned it has certain limitations.
https://docs.openvino.ai/2023.3/openvino_docs_OV_UG_supported_plugins_NPU.html

Currently, only the models with static shapes are supported on NPU. <---
Running the Alexnet model with NPU may result in a drop in accuracy. At this moment, the googlenet-v4 model is recommended for classification tasks.

Could you kindly verify whether this limitation is observed from your end? Thanks a lot.

output.log

CPU/iGPU works well.

Alessandro Palla · Answer 4 · Wed Mar 06 2024 14:56:07 GMT+0800 (China Standard Time)

How did you enable the NPU inference on that repo? I suggest to edit Line 59 of openvino-rag-server.py with the following (like the llama.py example)

model = AutoModelForCausalLM.from_pretrained(model_id, use_cache=True).eval()
model = intel_npu_acceleration_library.compile(model, dtype=torch.int8)

You should also import the proper libraries

from transformers import AutoModelForCausalLM
import intel_npu_acceleration_library

ChenYuYeh · Answer 5 · Wed Mar 06 2024 18:12:40 GMT+0800 (China Standard Time)

Appreciate for your suggestions! I can use NPU for inferencing RAG now. Although both CPU and NPU are with extremely high loading.
Do you know what the policy is for the resource assignment on CPU and NPU within your library?

Alessandro Palla · Answer 6 · Wed Mar 06 2024 18:29:08 GMT+0800 (China Standard Time)

I'm trying to reverse eng. the script you provided. From that one it seems that the embedding model runs on CPU while the llm runs on mixed NPU/CPU. Then it really depends on the model used. Library default offload torch.nn.Linear layers and few others to NPU but we have ad-hoc optimizations for some networks. We want to add torch.nn.functional.scaled_dot_product_attention to offload to NPU and I'm working on that. Also the more you offload at the same time the better, as explained very well here.

The NNFactory class can be used to create custom graphs to offload to the NPU (example here for MLP) and is the next logical step for this library performance journey:

torch.compile -> fx graph -> subgraph extraction -> NPU

External contributions are very welcomed by the way if you want to implement some ad-hoc backend on NPU

ChenYuYeh · Answer 7 · Thu Mar 07 2024 18:19:28 GMT+0800 (China Standard Time)

Not sure what I can contribute to this project. Hopefully you can give me some more hints.
btw, I verified multiple times that the LLM powered RAG has performance with CPU (13words/s) > GPU (8words/s) > NPU (3 words/s). I wonder if it is proper to offload more loads to NPU. It seems not as you expected.

Alessandro Palla · Answer 8 · Thu Mar 07 2024 19:11:13 GMT+0800 (China Standard Time)

Yes it is not expected, I'll dig deeper

ChenYuYeh · Answer 9 · Sat Mar 09 2024 11:53:23 GMT+0800 (China Standard Time)

Hi @alessandropalla
I would like to verify this library with CPU, GPU as inferencing devices as well for platforms without NPU. How to configure the compile API?
And even wonder if it can also support AMD platforms. Thanks.

Alessandro Palla · Answer 10 · Thu Mar 14 2024 05:50:14 GMT+0800 (China Standard Time)

Can you clarify what script are you using so I can debug performance? Also I'd like to know more about your usecases for heterogeneous compute so I can adapt the compile API to user's need.

Related to AMD support, it is not in our short term roadmap but I'm very sensitive of community needs so if you have a strong usecase I'd love to hear it

ChenYuYeh · Answer 11 · Thu Mar 14 2024 13:53:21 GMT+0800 (China Standard Time)

Hi @alessandropalla,

Kindly ignore the title for request to support RAG. We shall focus on LLM performance at first.

Therefore no need new script/python code, I recommend just to use the llama.py to benchmark performance with xPU options. So far there is no parameter to choose the device with the API. Or this API/library is meant to use NPU mandatorily?

No hurry for AMD support. Thanks. :)

Alessandro Palla · Answer 12 · Thu Mar 14 2024 14:43:28 GMT+0800 (China Standard Time)

Ok, I'll go to close the issue then. I'll keep you posted in the next releases