返回值:
{ "took" : 5, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 26, "sum_other_doc_count" : 928, "buckets" : [ { "key" : "MD", "doc_count" : 28 }, { "key" : "ID", "doc_count" : 23 }, { "key" : "TX", "doc_count" : 21 } ] } } }hits命中查询条件的记录,因为设置了size=0, 返回[]. total是本次查询命中了1000条记录
aggregations 是聚合指标结果
group_by_state 是我们查询中命名的变量名
doc_count_error_upper_bound 没有在这次聚合中返回、但是可能存在的潜在聚合结果.键名有「上界」的意思,也就是表示在预估的最坏情况下沒有被算进最终结果的值,当然doc_count_error_upper_bound的值越大,最终数据不准确的可能性越大,能确定的是,它的值为 0 表示数据完全正确,但是它不为 0,不代表这次聚合的数据是错误的.
sum_other_doc_count 聚合中没有统计到的文档数
值得注意的是, top3是否是准确的呢. 我们看到doc_count_error_upper_bound是有错误数量的, 即统计结果很可能不准确, 并且得到的top3分别是28,23,21. 我们再来添加另个查询参数来比较结果:
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3, "shard_size": 60 } } } } ----------------------------------------- "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 915, "buckets" : [ { "key" : "TX", "doc_count" : 30 }, { "key" : "MD", "doc_count" : 28 }, { "key" : "ID", "doc_count" : 27 } ] } }shard_size 表示每个分片计算的数量. 因为agg聚合运算是每个分片计算出一个结果,然后最后聚合计算最终结果. 数据在分片分布不均衡, 每个分片的topN并不是一样的, 就有可能最终聚合结果少算了一部分. 从而导致doc_count_error_upper_bound不为0. es默认shard_size的值是size*1.5+10, size=3对应就是14.5, 验证shar_size=14.5时返回值确实和不传一样. 而设置为60时, error终于为0了, 即, 可以保证这个3个绝对是最多的top3. 也就是说, 聚合运算要设置shard_size尽可能大, 比如size的20倍.
按省份统计人数并计算平均薪酬我们想要查看每个省的平均薪酬, sql可能是
select state, avg(balance) AS average_balance, count(*) AS group_by_state from tbl_bank group by state limit 3在es可以这样查询:
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "size": 3, "shard_size": 60 }, "aggs": { "average_balance": { "avg": { "field": "balance" } }, "sum_balance": { "sum": { "field": "balance" } } } } } }第二个aggs是计算每个state的聚合指标
average_balance 自定义的变量名称, 值为相同state的balance avg运算
sum_balance 自定义的变量名称, 值为相同state的balancesum运算
结果如下:
{ "took" : 12, "timed_out" : false, "_shards" : { "total" : 3, "successful" : 3, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1000, "relation" : "eq" }, "max_score" : null, "hits" : [ ] }, "aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : 0, "sum_other_doc_count" : 915, "buckets" : [ { "key" : "TX", "doc_count" : 30, "sum_balance" : { "value" : 782199.0 }, "average_balance" : { "value" : 26073.3 } }, { "key" : "MD", "doc_count" : 28, "sum_balance" : { "value" : 732523.0 }, "average_balance" : { "value" : 26161.535714285714 } }, { "key" : "ID", "doc_count" : 27, "sum_balance" : { "value" : 657957.0 }, "average_balance" : { "value" : 24368.777777777777 } } ] } } } 按省份统计人数并按照平均薪酬排序agg terms默认排序是count降序, 如果我们想用其他方式, sql可能是这样:
select state, avg(balance) AS average_balance, count(*) AS group_by_state from tbl_bank group by state order by average_balance limit 3对应es可以这样查询:
GET /bank/_search { "size": 0, "aggs": { "group_by_state": { "terms": { "field": "state.keyword", "order": { "average_balance": "desc" }, "size": 3 }, "aggs": { "average_balance": { "avg": { "field": "balance" } } } } } }返回结果的top3就不是之前的啦:
"aggregations" : { "group_by_state" : { "doc_count_error_upper_bound" : -1, "sum_other_doc_count" : 983, "buckets" : [ { "key" : "DE", "doc_count" : 2, "average_balance" : { "value" : 39040.5 } }, { "key" : "RI", "doc_count" : 5, "average_balance" : { "value" : 36035.4 } }, { "key" : "NE", "doc_count" : 10, "average_balance" : { "value" : 35648.8 } } ] } } 参考中文社区:https://elasticsearch.cn/
es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/documents-indices.html
es官方文档: https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-index.html
terms 聚合计算不准确: https://www.dongwm.com/post/elasticsearch-terms-agg-is-not-accurate/