Elasticsearch 2.2.0 分词篇：中文分词

日期：2020-06-19 栏目：程序人生浏览：次

在Elasticsearch中，内置了很多分词器（analyzers），但默认的分词器对中文的支持都不是太好。所以需要单独安装插件来支持，比较常用的是中科院 ICTCLAS的smartcn和IKAnanlyzer效果还是不错的，但是目前IKAnanlyzer还不支持最新的Elasticsearch2.2.0版本，但是smartcn中文分词器默认官方支持，它提供了一个中文或混合中文英文文本的分析器。支持最新的2.2.0版本版本。但是smartcn不支持自定义词库，作为测试可先用一下。后面的部分介绍如何支持最新的版本。

smartcn

安装分词：plugin install analysis-smartcn

卸载：plugin remove analysis-smartcn

测试：

请求：POST :9200/_analyze/

{
"analyzer": "smartcn",
"text": "联想是全球最大的笔记本厂商"
}

返回结果：

{
"tokens": [
{
"token": "联想",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "是",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1
},
{
"token": "全球",
"start_offset": 3,
"end_offset": 5,
"type": "word",
"position": 2
},
{
"token": "最",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 3
},
{
"token": "大",
"start_offset": 6,
"end_offset": 7,
"type": "word",
"position": 4
},
{
"token": "的",
"start_offset": 7,
"end_offset": 8,
"type": "word",
"position": 5
},
{
"token": "笔记本",
"start_offset": 8,
"end_offset": 11,
"type": "word",
"position": 6
},
{
"token": "厂商",
"start_offset": 11,
"end_offset": 13,
"type": "word",
"position": 7
}
]
}

作为对比，我们看一下标准的分词的结果，在请求中巴smartcn，换成standard

然后看返回结果：

{
"tokens": [
{
"token": "联",
"start_offset": 0,
"end_offset": 1,
"type": "<IDEOGRAPHIC>",
"position": 0
},
{
"token": "想",
"start_offset": 1,
"end_offset": 2,
"type": "<IDEOGRAPHIC>",
"position": 1
},
{
"token": "是",
"start_offset": 2,
"end_offset": 3,
"type": "<IDEOGRAPHIC>",
"position": 2
},
{
"token": "全",
"start_offset": 3,
"end_offset": 4,
"type": "<IDEOGRAPHIC>",
"position": 3
},
{
"token": "球",
"start_offset": 4,
"end_offset": 5,
"type": "<IDEOGRAPHIC>",
"position": 4
},
{
"token": "最",
"start_offset": 5,
"end_offset": 6,
"type": "<IDEOGRAPHIC>",
"position": 5
},
{
"token": "大",
"start_offset": 6,
"end_offset": 7,
"type": "<IDEOGRAPHIC>",
"position": 6
},
{
"token": "的",
"start_offset": 7,
"end_offset": 8,
"type": "<IDEOGRAPHIC>",
"position": 7
},
{
"token": "笔",
"start_offset": 8,
"end_offset": 9,
"type": "<IDEOGRAPHIC>",
"position": 8
},
{
"token": "记",
"start_offset": 9,
"end_offset": 10,
"type": "<IDEOGRAPHIC>",
"position": 9
},
{
"token": "本",
"start_offset": 10,
"end_offset": 11,
"type": "<IDEOGRAPHIC>",
"position": 10
},
{
"token": "厂",
"start_offset": 11,
"end_offset": 12,
"type": "<IDEOGRAPHIC>",
"position": 11
},
{
"token": "商",
"start_offset": 12,
"end_offset": 13,
"type": "<IDEOGRAPHIC>",
"position": 12
}
]
}

从中可以看出，基本上不能使用，就是一个汉字变成了一个词了。

本文由赛克蓝德(secisland)原创，转载请标明作者和出处。

IKAnanlyzer支持2.2.0版本

转载注明出处：https://www.heiqu.com/c6c6b11859582739dc13adee467a56cf.html

Elasticsearch 2.2.0 分词篇：中文分词

相关推荐