Elasticsearch电子邮件的域名聚合
问题描述:
我是Elasticsearch的新手,我试图计算字段子字符串的不同位置。Elasticsearch电子邮件的域名聚合
我有电子邮件收件人作为邮件日志索引的一部分,我想计算索引中不同域的数量。
因此,例如,如果在我的索引中有3个邮件日志,它们来自以下地址:[email protected]
,[email protected]
和[email protected]
;我希望看到2个邮件来自b.com
域,1个邮件来自e.com
域。
答
你需要一个pattern_capture
filter,应该只能捕获@
后面的内容。此外,不要与文本的原始分析一塌糊涂,我建议增加一个子场到原来的email
场,并与工作只针对这一特定聚集:
PUT /test
{
"settings": {
"analysis": {
"filter": {
"email_domains": {
"type": "pattern_capture",
"preserve_original" : 0,
"patterns": [
"@(.+)"
]
}
},
"analyzer": {
"email": {
"tokenizer": "uax_url_email",
"filter": [
"email_domains",
"lowercase",
"unique"
]
}
}
}
},
"mappings": {
"emails": {
"properties": {
"email": {
"type": "string",
"fields": {
"domain": {
"type": "string",
"analyzer": "email"
}
}
}
}
}
}
}
尝试一些测试数据:
POST /test/emails/_bulk
{"index":{"_id":"1"}}
{"email": "[email protected]"}
{"index":{"_id":"2"}}
{"email": "[email protected], [email protected]"}
{"index":{"_id":"3"}}
{"email": "[email protected]"}
{"index":{"_id":"4"}}
{"email": "[email protected]"}
{"index":{"_id":"5"}}
{"email": "[email protected]"}
并为您的具体使用情况下,类似下面的简单聚合应该这样做:
GET /test/emails/_search
{
"size": 0,
"aggs": {
"by_domain": {
"terms": {
"field": "email.domain",
"size": 10
}
}
}
}
,结果是这样的:
"aggregations": {
"by_domain": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "outlook.com",
"doc_count": 3
},
{
"key": "gmail.com",
"doc_count": 2
},
{
"key": "yahoo.com",
"doc_count": 1
}
]
}
}