Using the chroma library and the bge-large-zh-v1.5 model, when recalling certain words, completely irrelevant slices are recalled.
tanghaichen opened this issue · comments
Search before asking
- I had searched in the issues and found no similar issues.
Operating system information
Linux
Python version information
3.10
DB-GPT version
main
Related scenes
- Chat Data
- Chat Excel
- Chat DB
- Chat Knowledge
- Model Management
- Dashboard
- Plugins
Installation Information
-
AutoDL Image
-
Other
Device information
GPU 96G
Models information
bge-large-zh-v1.5
What happened
使用的是bge-large-zh-v1.5模型和chroma向量库,在检索某些词语的时候,召回的切片分数很高但是是和词语完全无关的。但只有某个词语是这样的,其他的绝大部分词语的召回还是比较准的。
目前文档存在pdf、csv和word,切片数量大概6000个左右。
示例:
词语:“水资源”
存在20个文档,900个切片,直接出现了水资源词语。其他文档均未出现这三个字。
但在询问水资源时,召回的切片中出现的均是与其无关的切片
目前未发现其他词语出现这个问题。
What you expected to happen
正常应该是从完全出现这个词语的切片中进行召回才是合理的。
How to reproduce
未知复现方法
Additional context
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
what kind of your document type and could you show some bad cases for us?
This issue has been marked as stale
, because it has been over 30 days without any activity.
This issue bas been closed, because it has been marked as stale
and there has been no activity for over 7 days.