Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled.

Question

Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled.

tanghaichen opened this issue a month ago · comments

yin commented a month ago

Search before asking

I had searched in the issues and found no similar issues.

Operating system information

Linux

Python version information

3.10

DB-GPT version

main

Related scenes

Installation Information

Device information

GPU 96G

Models information

bge-large-zh-v1.5

What happened

使用的是bge-large-zh-v1.5模型和chroma向量库，在检索某些词语的时候，召回的切片分数很高但是是和词语完全无关的。但只有某个词语是这样的，其他的绝大部分词语的召回还是比较准的。
目前文档存在pdf、csv和word，切片数量大概6000个左右。
示例：
词语：“水资源”
存在20个文档，900个切片，直接出现了水资源词语。其他文档均未出现这三个字。
但在询问水资源时，召回的切片中出现的均是与其无关的切片
目前未发现其他词语出现这个问题。

What you expected to happen

正常应该是从完全出现这个词语的切片中进行召回才是合理的。

How to reproduce

未知复现方法

Additional context

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Aries-ckt · Answer 1 · Wed Jun 19 2024 22:10:54 GMT+0800 (China Standard Time)

what kind of your document type and could you show some bad cases for us?

github-actions · Answer 2 · Sat Jul 20 2024 05:04:44 GMT+0800 (China Standard Time)

This issue has been marked as stale, because it has been over 30 days without any activity.

github-actions · Answer 3 · Sat Jul 27 2024 05:04:53 GMT+0800 (China Standard Time)

This issue bas been closed, because it has been marked as stale and there has been no activity for over 7 days.