用具有唯一ID的组返回具有最新时间戳的日志

用具有唯一ID的组返回具有最新时间戳的日志

问题描述:

我们在Elasticsearch中有一组日志,每个组包含共享一个唯一ID(名为transactionId)的1-7个日志。每个组中的每个日志都有一个唯一的时间戳(eventTimestamp)。用具有唯一ID的组返回具有最新时间戳的日志

例如:

{ 
    "transactionId": "id111", 
    "eventTimestamp": "1505864112047", 
    "otherfieldA": "fieldAvalue", 
    "otherfieldB": "fieldBvalue" 
} 

{ 
    "transactionId": "id111", 
    "eventTimestamp": "1505864112051", 
    "otherfieldA": "fieldAvalue", 
    "otherfieldB": "fieldBvalue" 
} 

{ 
    "transactionId": "id222", 
    "eventTimestamp": "1505863719467", 
    "otherfieldA": "fieldAvalue", 
    "otherfieldB": "fieldBvalue" 
} 

{ 
    "transactionId": "id222", 
    "eventTimestamp": "1505863719478", 
    "otherfieldA": "fieldAvalue", 
    "otherfieldB": "fieldBvalue" 
} 

我需要编写返回所有所有transactionIds的最新时间戳在一定日期范围内的查询。

我简单的例子继续,查询的结果应该返回这些日志:

{ 
    "transactionId": "id111", 
    "eventTimestamp": "1505864112051", 
    "otherfieldA": "fieldAvalue", 
    "otherfieldB": "fieldBvalue" 
} 

{ 
    "transactionId": "id222", 
    "eventTimestamp": "1505863719478", 
    "otherfieldA": "fieldAvalue", 
    "otherfieldB": "fieldBvalue" 
} 

如何建立,完成此查询任何想法?

您可以不通过查询本身获得所需结果,但可以使用terms aggregation和嵌套top hits aggregation的组合。

术语聚合负责构建桶,其中所有具有相同术语的项目都在同一个桶中。根据transactionId,这可以生成您的群组。然后,*聚合聚合是一个度量聚合,可以根据给定的排序顺序将其配置为返回一个存储桶的x顶点。这使您可以检索每个存储桶中时间戳最大的日志事件。

假设你的样本数据的默认映射(其中字符串索引为thekey(文本)和thekey.keyword(非分析的文本))此查询:

GET so-logs/_search 
{ 
    "size": 0, 
    "query": { 
    "bool": { 
     "must": [ 
     { 
      "range": { 
      "eventTimestamp.keyword": { 
       "gte": 1500000000000, 
       "lte": 1507000000000 
      } 
      } 
     } 
     ] 
    } 
    }, 
    "aggs": { 
    "by_transaction_id": { 
     "terms": { 
     "field": "transactionId.keyword", 
     "size": 10 
     }, 
     "aggs": { 
     "latest": { 
      "top_hits": { 
      "size": 1, 
      "sort": [ 
       { 
       "eventTimestamp.keyword": { 
        "order": "desc" 
       } 
       } 
      ] 
      } 
     } 
     } 
    } 
    } 
} 

会产生以下输出:

{ 
    "took": 7, 
    "timed_out": false, 
    "_shards": { 
    "total": 5, 
    "successful": 5, 
    "failed": 0 
    }, 
    "hits": { 
    "total": 4, 
    "max_score": 0, 
    "hits": [] 
    }, 
    "aggregations": { 
    "by_transaction_id": { 
     "doc_count_error_upper_bound": 0, 
     "sum_other_doc_count": 0, 
     "buckets": [ 
     { 
      "key": "id111", 
      "doc_count": 2, 
      "latest": { 
      "hits": { 
       "total": 2, 
       "max_score": null, 
       "hits": [ 
       { 
        "_index": "so-logs", 
        "_type": "entry", 
        "_id": "AV6z9Yj4QYbhNp_FoXa1", 
        "_score": null, 
        "_source": { 
        "transactionId": "id111", 
        "eventTimestamp": "1505864112051", 
        "otherfieldA": "fieldAvalue", 
        "otherfieldB": "fieldBvalue" 
        }, 
        "sort": [ 
        "1505864112051" 
        ] 
       } 
       ] 
      } 
      } 
     }, 
     { 
      "key": "id222", 
      "doc_count": 2, 
      "latest": { 
      "hits": { 
       "total": 2, 
       "max_score": null, 
       "hits": [ 
       { 
        "_index": "so-logs", 
        "_type": "entry", 
        "_id": "AV6z9ZlOQYbhNp_FoXa4", 
        "_score": null, 
        "_source": { 
        "transactionId": "id222", 
        "eventTimestamp": "1505863719478", 
        "otherfieldA": "fieldAvalue", 
        "otherfieldB": "fieldBvalue" 
        }, 
        "sort": [ 
        "1505863719478" 
        ] 
       } 
       ] 
      } 
      } 
     } 
     ] 
    } 
    } 
} 

在这里你可以找到聚集的内部预期的效果根据查询中定义的聚合名称结果by_transaction_id.latest

请注意,聚合条款对返回的桶数有限制,并且从性能的角度来看,将此设置为> 10.000可能不是一个聪明的想法。详情请参阅the section on size of the terms aggregation。如果你想处理大量不同的事务id,我建议按事务ID做一些冗余的“top”条目存储。

另外,您应该将eventTimestamp字段切换为date以获得更好的性能,a wider set of query possibilities