Elasticsearch Query DSL 整理总结（二）—— 要搞懂 Match Query，看这篇就够了 (2)

日期：2021-05-15 栏目：程序人生浏览：次

来，验证下

GET matchtest/_search { "query": { "match": { "hobbies": { "query": "footba22", "fuzziness": "AUTO" } } } } GET matchtest/_search { "query": { "match": { "name": { "query": "jiO", "fuzziness": "AUTO" } } } } GET matchtest/_search { "query": { "match": { "name": { "query": "jOO", "fuzziness": "AUTO" } } } } prefix_length

prefix_length 表示不能没模糊化的初始字符数。由于大部分的拼写错误发生在词的结尾，而不是词的开始，使用 prefix_length 就可以完成优化。注意 prefix_length 必须结合 fuzziness 参数使用。

例如，在查询 hobbies 中的 football 时，将 prefix_length 参数设置为 3,这时 foatball 将不能被匹配。

GET matchtest/_search { "query": { "match": { "hobbies": { "query": "foatball", "fuzziness": "AUTO", "prefix_length": 3 } } } }

TODO（max_expansions 参数对于 match 查询而言，没理解表示的意义，如果您知道这个参数的用法，请给我留言告知，不胜感谢！）

Zero terms Query

先看例子, 先创建一个文档 zero_terms_query_test 其中 message 字段使用 stop 分析器，这个分析器会将 stop words 停用词在索引时全都去掉。

PUT matchtest1 PUT matchtest1/_mapping/zero_terms_query_test { "properties": { "message": { "type": "text", "analyzer": "stop" } } } PUT matchtest1/zero_terms_query_test/1 { "message": "to be or not to be" } GET matchtest1/_search { "query": { "match": { "message": { "query": "to be or not to be", "operator": "and", "zero_terms_query": "none" } } } }

那么就像 message 字段中的 to be or not to be 这个短语中全部都是停止词，一过滤，就什么也没有了,得不到任何 tokens, 那搜索时岂不什么都搜不到。

POST _analyze { "analyzer": "stop", "text": "to be or not to be" }

zero_terms_query 就是为了解决这个问题而生的。它的默认值是 none ,就是搜不到停止词（对 stop 分析器字段而言）,如果设置成 all ，它的效果就和 match_all 类似，就可以搜到了。

GET matchtest1/_search { "query": { "match": { "message": { "query": "to be or not to be", "operator": "and", "zero_terms_query": "all" } } } } Cutoff frequency

查询字符串时的词项会分成低频词（更重要）和高频词（次重要）两类，像前面所说的停用词（stop word）就属于高频词，它虽然出现频率较高，但在匹配时可能并不太相关。实际上，我们往往是想要文档能尽可能的匹配那些低频词，也就是更重要的词项。

要实现这个需求，只要在查询时配置 cutoff_frequency 参数就可以了。假设我们将 cutoff_frequency 设置成 0.01 就表示

任何词项在文档中超过 1%，被认为是高频词

其他的词项会被认为低频词

从而将高频词（次重要的词）挪到可选子查询中，让它们只参与评分，而不参与匹配；高频词（更重要的词）参与匹配和评分。

这样一来，就不再需要 stopwords 停用词文件了，从而变成了动态生成停用词: 高频词就会被看做是停用词。这种配置只是对于词项比较多的场合如 email body，文章等适用，文字太少， cutoff_frequency 选项设置的意义就不大了。

cutoff_frequency 配置有两种形式

指定为一个分数（ 0.01 ）表示出现频率

指定为一个正整数（ 5 ）则表示出现次数

下面给个例子, 在创建的 3 个文档中都包含 "be " 的单词，在查询时将 cutoff_frequency 参数设置为 2，表示 "be" 就是高频词，只会参与评分，但在匹配时不做考虑。

此时查询的内容为 "to be key" ，由于 "be" 词项是高频词，因为在文档中必须要存在 "to" 或者 "key" 才能匹配，因此文档 3 不能匹配。

PUT /matchtest2 PUT matchtest2/_mapping/cutoff_frequency_test { "properties": { "message": { "type": "text" } } } PUT matchtest2/cutoff_frequency_test/1 { "message": "to be or not to be to be or" } PUT matchtest2/cutoff_frequency_test/2 { "message": "be key or abc" } PUT matchtest2/cutoff_frequency_test/3 { "message": "or to be or to to be or" } GET matchtest2/_search { "query": { "match": { "message": { "query": "to be key", "cutoff_frequency": 2 } } } } synonyms

转载注明出处：https://www.heiqu.com/wpwwfy.html

Elasticsearch Query DSL 整理总结（二）—— 要搞懂 Match Query，看这篇就够了 (2)

相关推荐