Elasticsearch7.6学习笔记1 Getting start with Elasticsearch (3)

返回值:

{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 26, "sum_other_doc_count" : 928, "buckets" : [ { "key" : "MD", "doc_count" : 28 }, { "key" : "ID", "doc_count" : 23 }, { "key" : "TX", "doc_count" : 21 } ] } } }

hits命中查询条件的记录,因为设置了size=0, 返回[]. total是本次查询命中了1000条记录

aggregations 是聚合指标结果

group_by_state 是我们查询中命名的变量名

doc_count_error_upper_bound 没有在这次聚合中返回、但是可能存在的潜在聚合结果.键名有「上界」的意思,也就是表示在预估的最坏情况下沒有被算进最终结果的值,当然doc_count_error_upper_bound的值越大,最终数据不准确的可能性越大,能确定的是,它的值为 0 表示数据完全正确,但是它不为 0,不代表这次聚合的数据是错误的.

sum_other_doc_count 聚合中没有统计到的文档数

值得注意的是, top3是否是准确的呢. 我们看到doc_count_error_upper_bound是有错误数量的, 即统计结果很可能不准确, 并且得到的top3分别是28,23,21. 我们再来添加另个查询参数来比较结果:

GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3, "shard_size": 60 } } } } ----------------------------------------- "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 915, "buckets" : [ { "key" : "TX", "doc_count" : 30 }, { "key" : "MD", "doc_count" : 28 }, { "key" : "ID", "doc_count" : 27 } ] } }

shard_size 表示每个分片计算的数量. 因为agg聚合运算是每个分片计算出一个结果,然后最后聚合计算最终结果. 数据在分片分布不均衡, 每个分片的topN并不是一样的, 就有可能最终聚合结果少算了一部分. 从而导致doc_count_error_upper_bound不为0. es默认shard_size的值是size*1.5+10, size=3对应就是14.5, 验证shar_size=14.5时返回值确实和不传一样. 而设置为60时, error终于为0了, 即, 可以保证这个3个绝对是最多的top3. 也就是说, 聚合运算要设置shard_size尽可能大, 比如size的20倍.

按省份统计人数并计算平均薪酬

我们想要查看每个省的平均薪酬, sql可能是

select state, avg(balance) AS average_balance, count(*) AS group_by_state from tbl_bank group by state limit 3

在es可以这样查询:

GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3, "shard_size": 60 }, "aggs": { "average_balance": { "avg": { "field": "balance" } }, "sum_balance": { "sum": { "field": "balance" } } } } } }

第二个aggs是计算每个state的聚合指标

average_balance 自定义的变量名称, 值为相同state的balance avg运算

sum_balance 自定义的变量名称, 值为相同state的balancesum运算

结果如下:

{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 915, "buckets" : [ { "key" : "TX", "doc_count" : 30, "sum_balance" : { "value" : 782199.0 }, "average_balance" : { "value" : 26073.3 } }, { "key" : "MD", "doc_count" : 28, "sum_balance" : { "value" : 732523.0 }, "average_balance" : { "value" : 26161.535714285714 } }, { "key" : "ID", "doc_count" : 27, "sum_balance" : { "value" : 657957.0 }, "average_balance" : { "value" : 24368.777777777777 } } ] } } } 按省份统计人数并按照平均薪酬排序

agg terms默认排序是count降序, 如果我们想用其他方式, sql可能是这样:

select state, avg(balance) AS average_balance, count(*) AS group_by_state from tbl_bank group by state order by average_balance limit 3

对应es可以这样查询:

GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" }, "size": 3 }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }

返回结果的top3就不是之前的啦:

"aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : -1, "sum_other_doc_count" : 983, "buckets" : [ { "key" : "DE", "doc_count" : 2, "average_balance" : { "value" : 39040.5 } }, { "key" : "RI", "doc_count" : 5, "average_balance" : { "value" : 36035.4 } }, { "key" : "NE", "doc_count" : 10, "average_balance" : { "value" : 35648.8 } } ] } } 参考

中文社区:https://elasticsearch.cn/

es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html

es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html

terms 聚合计算不准确: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/

内容版权声明:除非注明,否则皆为本站原创文章。

转载注明出处:https://www.heiqu.com/wpspyf.html