Elasticsearch实践（四）：IK分词

日期：2021-06-12 栏目：程序人生浏览：次

环境：Elasticsearch 6.2.4 + Kibana 6.2.4 + ik 6.2.4

Elasticsearch默认也能对中文进行分词。

我们先来看看自带的中文分词效果：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d '{"analyzer": "default","text": "今天天气真好"}' GET /_analyze { "analyzer": "default", "text": "今天天气真好" }

结果：

{ "tokens": [ { "token": "今", "start_offset": 0, "end_offset": 1, "type": "<IDEOGRAPHIC>", "position": 0 }, { "token": "天", "start_offset": 1, "end_offset": 2, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "天", "start_offset": 2, "end_offset": 3, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "气", "start_offset": 3, "end_offset": 4, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "真", "start_offset": 4, "end_offset": 5, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "好", "start_offset": 5, "end_offset": 6, "type": "<IDEOGRAPHIC>", "position": 5 } ] }

我们发现，是按照每个字进行分词的。这种在实际应用里肯定达不到想要的效果。当然，如果是日志搜索，使用自带的就足够了。

analyzer=default其实调用的是standard分词器。

接下来，我们安装IK分词插件进行分词。

安装IK

IK项目地址：https://github.com/medcl/elasticsearch-analysis-ik

首先需要说明的是，IK插件必须和 ElasticSearch 的版本一直，否则不兼容。

安装方法1：
从 https://github.com/medcl/elasticsearch-analysis-ik/releases 下载压缩包，然后在ES的plugins目录创建analysis-ik子目录，把压缩包的内容复制到这个目录里面即可。最终plugins/analysis-ik/目录里面的内容：

plugins/analysis-ik/ commons-codec-1.9.jar commons-logging-1.2.jar elasticsearch-analysis-ik-6.2.4.jar httpclient-4.5.2.jar httpcore-4.4.4.jar plugin-descriptor.properties

然后重启 ElasticSearch。

安装方法2：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v6.2.4/elasticsearch-analysis-ik-6.2.4.zip

如果已下载压缩包，直接使用：

./usr/local/elk/elasticsearch-6.2.4/bin/elasticsearch-plugin install file:///tmp/elasticsearch-analysis-ik-6.2.4.zip

然后重启 ElasticSearch。

IK分词

IK支持两种分词模式：

ik_max_word: 会将文本做最细粒度的拆分，会穷尽各种可能的组合

ik_smart: 会做最粗粒度的拆分

接下来，我们测算IK分词效果和自带的有什么不同：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "今天天气真好"}'

结果：

{ "tokens": [ { "token": "今天天气", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "真好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 1 } ] }

再试一下ik_max_word的效果：

{ "tokens": [ { "token": "今天天气", "start_offset": 0, "end_offset": 4, "type": "CN_WORD", "position": 0 }, { "token": "今天", "start_offset": 0, "end_offset": 2, "type": "CN_WORD", "position": 1 }, { "token": "天天", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 2 }, { "token": "天气", "start_offset": 2, "end_offset": 4, "type": "CN_WORD", "position": 3 }, { "token": "真好", "start_offset": 4, "end_offset": 6, "type": "CN_WORD", "position": 4 } ] } 设置mapping默认分词器

示例：

{ "properties": { "content": { "type": "text", "analyzer": "ik_max_word", "search_analyzer": "ik_max_word" } } }

注：这里设置 search_analyzer 与 analyzer 相同是为了确保搜索时和索引时使用相同的分词器，以确保查询中的术语与反向索引中的术语具有相同的格式。如果不设置 search_analyzer，则 search_analyzer 与 analyzer 相同。详细请查阅：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-analyzer.html

自定义分词词典

我们也可以定义自己的词典供IK使用。比如：

curl -XGET "http://localhost:9200/_analyze" -H 'Content-Type: application/json' -d'{"analyzer": "ik_smart","text": "去朝阳公园"}'

结果：

{ "tokens": [ { "token": "去", "start_offset": 0, "end_offset": 1, "type": "CN_CHAR", "position": 0 }, { "token": "朝阳", "start_offset": 1, "end_offset": 3, "type": "CN_WORD", "position": 1 }, { "token": "公园", "start_offset": 3, "end_offset": 5, "type": "CN_WORD", "position": 2 } ] }

我们希望朝阳公园作为一个整体，这时候可以把该词加入到自己的词典里。

新建自己的词典只需要简单几步就可以完成：
1、在elasticsearch-6.2.4/config/analysis-ik/目录增加一个my.dic:

$ touch my.dic $ echo 朝阳公园 > my.dic $ cat my.dic 朝阳公园

.dic为词典文件，其实就是简单的文本文件，词语与词语直接需要换行。注意是UTF8编码。我们看一下自带的分词文件：

$ head -n 5 main.dic 一一列举一一对应一一道来一丁一丁不识

转载注明出处：https://www.heiqu.com/wppfwp.html

Elasticsearch实践（四）：IK分词

相关推荐