infinilabs / analysis-pinyin

🛵 This Pinyin Analysis plugin is used to do conversion between Chinese characters and Pinyin.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

拼音首字母查询问题,当第二个字的拼音首字母为第一个字的韵母时查询不到结果

Jiangtao976 opened this issue · comments

{
"settings":{
"number_of_shards":3,
"number_of_replicas":1,
"default_pipeline":"biz_timestamp_pipeline",
"analysis":{
"analyzer":{
"pinyin_analyzer":{
"tokenizer":"my_pinyin"
}
},
"tokenizer":{
"my_pinyin":{
"type":"pinyin",
"keep_separate_first_letter":true,
"keep_full_pinyin":true,
"keep_joined_full_pinyin":false,
"keep_original":true,
"limit_first_letter_length":16,
"lowercase":true,
"remove_duplicated_term":true,
"ignore_pinyin_offset":false
}
}
}
},
"mappings":{
"properties":{
"vendorName":{
"type":"text",
"analyzer":"pinyin_analyzer",
"search_analyzer":"pinyin_analyzer",
"fields":{
"keyword":{
"type":"keyword",
"ignore_above":256
}
}
}
}
}
}

示例一:
中文:刘德华阿里巴巴
分词结果:
{
"tokens": [
{
"token": "l",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "liu",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "刘德华阿里巴巴",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "ldhalbb",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "d",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "de",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "h",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "hua",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "li",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "b",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 5
},
{
"token": "ba",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 5
}
]
}

查询:
{
"query": {
"match_phrase": {
"vendorName": {
"query": "ldha"
}
}
}
}

可以看到分词结果中包含了首字母ldha,但查询不到结果,"阿"的首字母a,感觉是受到,"华"(hua)字中的a影响查不到。

示例二:
中文:深圳健安医药有限公司
{
"tokens": [
{
"token": "s",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "shen",
"start_offset": 0,
"end_offset": 1,
"type": "word",
"position": 0
},
{
"token": "深圳健安医药有限公司",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
},
{
"token": "szjayyyxgs",
"start_offset": 0,
"end_offset": 10,
"type": "word",
"position": 0
},
{
"token": "z",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "zhen",
"start_offset": 1,
"end_offset": 2,
"type": "word",
"position": 1
},
{
"token": "j",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "jian",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 2
},
{
"token": "a",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "an",
"start_offset": 3,
"end_offset": 4,
"type": "word",
"position": 3
},
{
"token": "y",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "yi",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 4
},
{
"token": "yao",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 5
},
{
"token": "you",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 6
},
{
"token": "x",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "xian",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 7
},
{
"token": "g",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "gong",
"start_offset": 8,
"end_offset": 9,
"type": "word",
"position": 8
},
{
"token": "si",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 9
}
]
}

查询:
{
"query": {
"match_phrase": {
"vendorName": {
"query": "szja"
}
}
}
}

可以看到分词结果中包含了首字母szja,但查询不到结果,"安"的首字母a,感觉是受到,"健"(jian)字中的a影响查不到。

其它中文,例如:深圳恩,使用sze同样查询不到,恩的首字母e 受到深(shen)字中的e影响查不到。

我调了很多参数都无法解决这个问题,有大佬救救我吗

查询:
{
"query": {
"match_phrase": {
"vendorName": {
"query": "ldha"
}
}
}
}

可以看到分词结果中包含了首字母ldha,但查询不到结果,"阿"的首字母a,感觉是受到,"华"(hua)字中的a影响查不到。

分词结果并没有把 ldha 分成一个词,所以匹配不上, 你换成 liudehua 就可以查了