infinilabs / analysis-pinyin

🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

【7.10.1】创建文档报错:startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=6,lastStartOffset=1 for field 'name'

bilinxing opened this issue · comments

因为高亮标签不准,在issues里找到方法设置"ignore_pinyin_offset": false,然后大部分文档是没问题的,高亮标签也是好的,但是部分文档,比如一下的文档创建时会报错
创建索引:
PUT /pinyin_test
{
"settings": {
"analysis": {
"analyzer": {
"pinyin": {
"type": "pinyin",
"ignore_pinyin_offset": false
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "pinyin"
}
}

}

}

创建文档:
POST pinyin_test/_create/3
{
"name":"cube5 2"
}

报错:
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=6,lastStartOffset=1 for field 'name'"
}
],
"type" : "illegal_argument_exception",
"reason" : "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=0,endOffset=6,lastStartOffset=1 for field 'name'"
},
"status" : 400
}

“ignore_pinyin_offset after 6.0, offset is strictly constrained, overlapped tokens are not allowed, with this parameter, overlapped token will allowed by ignore offset, please note, all position related query or highlight will become incorrect, you should use multi fields and specify different settings for different query purpose. if you need offset, please set it to false. default: true.”

是因为关闭了ignore_pinyin_offset 之后,分词有重叠的部分么。那么这样就很尴尬了,开启了没有高亮标签,关闭了导致文档无法创建。两者有办法兼顾么?

研究后发现,其实是中文混合上英文和数字的时候,分词的offset不准导致的es报错。知道了原因就有办法解决了,通过ik分词词先对拆中文,再用pinyin过滤器转成pinyin,这样不仅解决了报错和高亮共存的问题,还可以防止拼音都是单字导致的搜索结果过度模糊匹配的问题(你好->ni,hao 会把包含ni、hao单字的都匹配上,通过ik分词器分词后再组合拼音你好->nihao)。索引如下:
PUT /pinyin_test102
{
"settings": {
"analysis": {
"analyzer": {
"ik_pinyin": {
"type": "custom",
"tokenizer": "ik_smart",
"filter": [
"pinyin",
"word_delimiter"
]
}
},
"filter": {
"pinyin": {
"type": "pinyin",
"ignore_pinyin_offset": false,
"keep_none_chinese":false,
"keep_none_chinese_together":false,
"keep_full_pinyin":false,
"keep_joined_full_pinyin":true
}
}
}
},
"mappings": {
"properties": {
"name": {
"type": "text",
"analyzer": "ik_pinyin"
}
}
}
}