ik分词器使用

- IK分词器测试

IK提供了两个分词算法ik_smart 和 ik_max_word
其中 ik_smart 为最少切分,ik_max_word为最细粒度划分

(1).最小切分(ik_smart)
使用postman测试 post方式提交 http://127.0.0.1:9200/testindex/_analyze

{
"analyzer":"ik_smart",
"text":"黑马程序员"
}

输出的结果为:

{
"tokens": [
    {
        "token": "黑马",
        "start_offset": 0,
        "end_offset": 2,
        "type": "CN_WORD",
        "position": 0
    },
    {
        "token": "程序员",
        "start_offset": 2,
        "end_offset": 5,
        "type": "CN_WORD",
        "position": 1
    }
]
}

(2)最细切分(ik_max_word):
使用postman测试 post方式提交 http://127.0.0.1:9200/testindex/_analyze
{

"analyzer":"ik_max_word",
"text":"黑马程序员"
}

输出的结果为:

{
"tokens": [
    {
        "token": "黑马",
        "start_offset": 0,
        "end_offset": 2,
        "type": "CN_WORD",
        "position": 0
    },
    {
        "token": "程序员",
        "start_offset": 2,
        "end_offset": 5,
        "type": "CN_WORD",
        "position": 1
    },
    {
        "token": "程序",
        "start_offset": 2,
        "end_offset": 4,
        "type": "CN_WORD",
        "position": 2
    },
    {
        "token": "员",
        "start_offset": 4,
        "end_offset": 5,
        "type": "CN_CHAR",
        "position": 3
    }
]
}

自定义分词词库

{
"tokens": [
    {
        "token": "传",
        "start_offset": 0,
        "end_offset": 1,
        "type": "CN_CHAR",
        "position": 0
    },
    {
        "token": "智",
        "start_offset": 1,
        "end_offset": 2,
        "type": "CN_CHAR",
        "position": 1
    },
    {
        "token": "播",
        "start_offset": 2,
        "end_offset": 3,
        "type": "CN_CHAR",
        "position": 2
    },
    {
        "token": "客",
        "start_offset": 3,
        "end_offset": 4,
        "type": "CN_CHAR",
        "position": 3
    }
]
}

默认的分词并没有识别“传智播客”是一个词。如果我们想让系统识别“传智播客”是一个 词,需要编辑自定义词库。
步骤:
(1)进入elasticsearch/plugins/ik/config目录
(2)新建一个my.dic文件,编辑内容(文件必须要UTF-8编码否则会乱码):

   <?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>IK Analyzer 扩展配置</comment>
<!--用户可以在这里配置自己的扩展字典 -->
<entry key="ext_dict">my.dic</entry>
 <!--用户可以在这里配置自己的扩展停止词字典-->
<entry key="ext_stopwords"></entry>
<!--用户可以在这里配置远程扩展字典 -->
<!-- <entry key="remote_ext_dict">words_location</entry> -->
<!--用户可以在这里配置远程扩展停止词字典-->
<!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

重新启动elasticsearch,通过浏览器测试分词效果:

{
"tokens": [
    {
        "token": "传智播客",
        "start_offset": 0,
        "end_offset": 4,
        "type": "CN_WORD",
        "position": 0
    }
]
}
Last modification:December 3rd, 2019 at 05:26 pm

Leave a Comment