tonglin0325的个人主页

Elasticsearch学习笔记——分词

1.测试Elasticsearch的分词

Elasticsearch有多种分词器(参考:https://www.jianshu.com/p/d57935ba514b)

Set the shape to semi-transparent by calling set_trans(5)

(1)standard analyzer:标准分词器(默认是这种)

set,the,shape,to,semi,transparent by,calling,set_trans,5

(2)simple analyzer:简单分词器

set, the, shape, to, semi, transparent, by, calling, set, trans

(3)whitespace analyzer:空白分词器。大小写,下划线等都不会转换

Set, the, shape, to, semi-transparent, by, calling, set_trans(5)

(4)language analyzer:(特定语言分词器,比如说English英语分词器)

set, shape, semi, transpar, call, set_tran, 5

 

2.为Elasticsearch的index设置分词

这样就将这个index里面的所有type的分词设置成了simple

1
2
3
4
5
6
7
8
9
PUT my_index
{
"settings": {
"analysis": {
"analyzer": {"default":{"type":"simple"}}
}
}
}

 

标准分词器 : standard analyzer

es5进行分词测试

1
2
http://localhost:9200/_analyze?analyzer=standard&pretty=true&text=test测试

es6进行分词测试

1
2
curl -H 'Content-Type: application/json' http://localhost:9200/_analyze?pretty=true -d@data.json

data.json

1
2
3
4
5
{
"analyzer":"standard",
"text": "test测试"
}

分词结果都是

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
{
"tokens" : [
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "<ALPHANUM>",
"position" : 0
},
{
"token" : "测",
"start_offset" : 4,
"end_offset" : 5,
"type" : "<IDEOGRAPHIC>",
"position" : 1
},
{
"token" : "试",
"start_offset" : 5,
"end_offset" : 6,
"type" : "<IDEOGRAPHIC>",
"position" : 2
}
]
}

简单分词器 : simple analyzer

1
2
http://localhost:9200/_analyze?analyzer=simple&amp;pretty=true&amp;text=test_测试

 结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
{
"tokens" : [
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
},
{
"token" : "测试",
"start_offset" : 5,
"end_offset" : 7,
"type" : "word",
"position" : 1
}
]
}

**IK分词器 : ik_max_word <strong>analyzer **和 <code>ik_smart&nbsp;**analyzer**

首先需要安装

1
2
https://github.com/medcl/elasticsearch-analysis-ik

下zip包,然后使用install plugin进行安装,我机器上的es版本是5.6.10,所以安装的就是5.6.10

1
2
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v5.6.10/elasticsearch-analysis-ik-5.6.10.zip

然后重新启动Elasticsearch就可以了

进行测试,es5

1
2
http://localhost:9200/_analyze?analyzer=ik_max_word&amp;pretty=true&amp;text=test_tes_te测试

es6

1
2
curl -H 'Content-Type: application/json' http://localhost:9200/_analyze?pretty=true -d@data.json

data.json

1
2
3
4
5
{
"analyzer":"ik_max_word",
"text": "test测试"
}

结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
{
"tokens" : [
{
"token" : "test_tes_te",
"start_offset" : 0,
"end_offset" : 11,
"type" : "LETTER",
"position" : 0
},
{
"token" : "test",
"start_offset" : 0,
"end_offset" : 4,
"type" : "ENGLISH",
"position" : 1
},
{
"token" : "tes",
"start_offset" : 5,
"end_offset" : 8,
"type" : "ENGLISH",
"position" : 2
},
{
"token" : "te",
"start_offset" : 9,
"end_offset" : 11,
"type" : "ENGLISH",
"position" : 3
},
{
"token" : "测试",
"start_offset" : 11,
"end_offset" : 13,
"type" : "CN_WORD",
"position" : 4
}
]
}