Group By Aggregation - database

I have searched for similar questions and have not found a question that allows me to group the count of an attribute with another attribute.
I have a SQL Query in this format:
SELECT weight, COUNT(age)
FROM patient_table
GROUP BY weight
ORDER BY COUNT(age) DESC;
In an elastic search database I have this information:
{"height": 170, "weight": 200, "age": 18}, {"height": 180, "weight": 250, "age": 25},...
I want to translate this SQL query to a string query in elastic search. Therefore, I want the count of the age grouped by the weight and then to return the top value.
I do not know how to pass a select statement to the elastic search query, but I have figured out how to use terms aggregation to group by weight. I assume that I can just grab the top value after it has been ordered and this will be the answer I need.
This has been my attempt thus far:
GET /patient_table/_search
{
"aggs": {
"patient": {
"terms": {"field": "weight.keyword"},
"order": {"_count": "desc"}
}
}
}
EDIT: YD9's solution works, but is there anyway to create a max sub-aggregation to obtain the max value for the previous aggregations? I ask because when I try to create a sub-aggregation after the count, I get an incorrect value of null for the max_value. This is my attempt:
{
"size": 0,
"aggs": {
"weigth": {
"terms": {
"field": "weight.keyword",
"size": 10,
"order": {
"age_count_by_weight": "desc"
}
},
"aggs": {
"age_count_by_weight": {
"value_count": {"field": "age"}
}
}
}
},
"aggs": {
"max_age": {
"max": {"field": "age"}
}
}
}
Any help would be appreciated. Thank you in advance.

If you want to group by weight and order by the total number of ages for each weight, following query should work.
{
"size": 0,
"aggs": {
"weight": {
"terms": {
"field": "weight.keyword",
"size": 10,
"order": {
"age_count_by_weight": "desc"
}
},
"aggs": {
"age_count_by_weight": {
"cardinality": {
"field": "age"
}
}
}
}
}
}
Edit: cardinalty aggregations counts each age once. If you want to count total number of documents, this query should work:
"size": 0,
"aggs": {
"weigth": {
"terms": {
"field": "weight.keyword",
"size": 10,
"order": {
"age_count_by_weight": "desc"
}
},
"aggs": {
"age_count_by_weight": {
"value_count": {
"field": "age"
}
}
}
}
}
}
Edit 2: To get max of age, you can use max_buckets aggregation. This is the sample query
{
"size": 0,
"aggs": {
"weigth": {
"terms": {
"field": "weight.keyword",
"size": 10,
"order": {
"age_count_by_weight": "desc"
}
},
"aggs": {
"age_count_by_weight": {
"value_count": {
"field": "age"
}
}
}
},
"max_age": {
"max_bucket": {
"buckets_path": "weigth>age_count_by_weight"
}
}
}
}

Related

find overlapping dates within mongoDB array objects

I have a MongoDB document collection with multiple arrays that looks like this :
{
"_id": "1235847",
"LineItems": [
{
"StartDate": ISODate("2017-07-31T00:00:00.000+00:00"),
"EndDate": ISODate("2017-09-19T00:00:00.000+00:00"),
"Amount": {"$numberDecimal": "0.00"}
},
{
"StartDate": ISODate("2022-03-20T00:00:00.000+00:00"),
"EndDate": ISODate("2022-10-21T00:00:00.000+00:00"),
"Amount": {"$numberDecimal": "6.38"}
},
{
"StartDate": ISODate("2022-09-20T00:00:00.000+00:00"),
"EndDate": ISODate("9999-12-31T00:00:00.000+00:00"),
"Amount": {"$numberDecimal": "6.17"}
}
]
}
Is there a simple way to find documents where the startdate has overlapped with previously startdate, enddate?
The startdate can not be before previous end dates within the array
The start/end can not be between previous start/end dates within the array
The below works but I don't want to hardcode the array index to find all the documents
{
$match: {
$expr: {
$gt: [
'LineItems.3.EndDate',
'LineItems.2.StartDate'
]
}
}
}
Here's one way you could find docs where "StartDate" is earlier than the immediately previous "EndDate".
db.collection.find({
"$expr": {
"$getField": {
"field": "overlapped",
"input": {
"$reduce": {
"input": {"$slice": ["$LineItems", 1, {"$size": "$LineItems"}]},
"initialValue": {
"overlapped": false,
"prevEnd": {"$first": "$LineItems.EndDate"}
},
"in": {
"overlapped": {
"$or": [
"$$value.overlapped",
{"$lt": ["$$this.StartDate", "$$value.prevEnd"]}
]
},
"prevEnd": "$$this.EndDate"
}
}
}
}
}
})
Try it on mongoplayground.net.

elasticsearch aggregates some values in a single field

I have some raw data
{
{
"id":1,
"message":"intercept_log,UDP,0.0.0.0,68,255.255.255.255,67"
},
{
"id":2,
"message":"intercept_log,TCP,172.22.96.4,52085,239.255.255.250,3702,1:"
},
{
"id":3,
"message":"intercept_log,UDP,1.0.0.0,68,255.255.255.255,67"
},
{
"id":4,
"message":"intercept_log,TCP,173.22.96.4,52085,239.255.255.250,3702,1:"
}
}
Demand
I want to group this data by the value of the message part of the message.
Output value like that
{
{
"GroupValue":"TCP",
"DocCount":"2"
},
{
"GroupValue":"UDP",
"DocCount":"2"
}
}
Try
I have tried with these codes but failed
GET systemevent*/_search
{
"size": 0,
"aggs": {
"tags": {
"terms": {
"field": "message.keyword",
"include": " intercept_log[,,](.*?)[,,].*?"
}
}
},
"track_total_hits": true
}
Now I try to use pipelines to meet this need.
"aggs" seems to only group fields.
Does anyone have a better idea?
Link
Terms aggregation
Update
My scene is a little special. I collect logs from many different servers, and then import the logs into es. Therefore, there is a big difference between message fields. If you directly use script statements for grouping statistics, it will result in group failure or inaccurate grouping. I try to filter out some data according to the conditions, and then use script to group the operation code (comment code 1), but this code can't group the correct results.
This is my scene to add:
Our team uses es to analyze the server log, uses rsyslog to forward the data to the server center, and then uses logstash to filter and extract the data to es. At this time, there is a field called message in ES, and the value of message is the detailed log information. At this time, we need to count the data containing some values in the message.
comment code 1
POST systemevent*/_search
{
"size": 0,
"query": {
"bool": {
"must": [
{
"match_phrase": {
"message": {
"query": "intercept_log"
}
}
}
]
}
},
"aggs": {
"protocol": {
"terms": {
"script": "def values = /,/.split(doc['message.keyword'].value); return values.length > 1 ? values[1] : 'N/A'",
"size": 10
}
}
},
"track_total_hits": true
}
comment code 2
POST test2/_search
{
"size": 0,
"aggs": {
"protocol": {
"terms": {
"script": "def values = /.*,.*/.matcher( doc['host.keyword'].value ); if( name.matches() ) {return values.group(1) } else { return 'N/A' }",
"size": 10
}
}
}
}
The easiest way to solve this is by leveraging scripts in the terms aggregation. The script would simply split on commas and take the second value.
POST systemevent*/_search
{
"size": 0,
"aggs": {
"protocol": {
"terms": {
"script": "def values = /,/.split(doc['message.keyword'].value); return values.length > 1 ? values[1] : 'N/A';",
"size": 10
}
}
}
}
Use Regex
POST test2/_search
{
"size": 0,
"aggs": {
"protocol": {
"terms": {
"script": "def m = /.*proto='(.*?)'./.matcher(doc['message.keyword'].value ); if( m.matches() ) { return m.group(1) } else { return 'N/A' }"
}
}
}
}
The results would look like
"buckets" : [
{
"key" : "TCP",
"doc_count" : 2
},
{
"key" : "UDP",
"doc_count" : 2
}
]
A better and more efficient way would be to split the message field into new fields using an ingest pipeline or Logstash.

Multiple match_phrase conditions with another bool in a single ElasticSearch query?

I am trying to conduct an Elasticsearch query that searched a text field ("body") and returns items that match at least one of two multi-word phrases I provide (ie: "stack overflow" OR "the stackoverflow"). I would also like the query to only provide results that occur after a given timestamp, with the results ordered by time.
My current solution is below. I believe the MUST is working correctly (gte a timestamp), but the BOOL + SHOULD with two match_phrases is not correct. I am getting the following error:
Unexpected character ('{' (code 123)): was expecting double-quote to start field name
Which I think is because I have two match_phrases in there?
This is the ES mapping and the details of the ES API I am using details are here.
{"query":
{"bool":
{"should":
[{"match_phrase":
{"body":"a+phrase"}
},
{"match_phrase":
{"body":"another+phrase"}
}
]
},
{"bool":
{"must":
[{"range":
{"created_at:
{"gte":"thispage"}
}
}
]}
}
},"size":10000,
"sort":"created_at"
}
I think you were just missing a single " after created_at.
{
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
Also, you are allowed to have both must and should as properties of a bool object, so this is also worth trying.
{
"query": {
"bool": {
"must": {
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
On a side note, Postman or any JSON formatter/validator would really help in determining where the error is.

Query only for numbers in nested array

I am trying to get an average number of an key in a nested array inside a document, but not sure how to accomplish this.
Here is how my document looks like:
{
"_id": {
"$oid": "XXXXXXXXXXXXXXXXX"
},
"data": {
"type": "PlayerRoundData",
"playerId": "XXXXXXXXXXXXX",
"groupId": "XXXXXXXXXXXXXX",
"holeScores": [
{
"type": "RoundHoleData",
"points": 2
},
{
"type": "RoundHoleData",
"points": 13
},
{
"type": "RoundHoleData",
"points": 3
},
{
"type": "RoundHoleData",
"points": 1
},
{
"type": "RoundHoleData",
"points": 21
}
]
}
}
Now, the tricky part of this is that I only want the average of points for holeScores[0] of all documents with this playerid and this groupid.
Actually, the best solution would be collecting all documents with playerid and groupid and create a new array with the average of holeScores[0], holeScores[1], holeScores[2]... But if I only can get one array key at the time, that would be OK to :-)
Here is what I am thinking but not quit sure how to put it together:
var allScores = dbCollection('scores').aggregate(
{$match: {"data.groupId": groupId, "playerId": playerId}},
{$group: {
_id: playerId,
rounds: { $sum: 1 }
result: { $sum: "$data.scoreTotals.points" }
}}
);
Really hoping for help with this issue and thanks in advance :-)
You can use $unwind with includeArrayIndex to get index and then use $group to group by that index
dbCollection('scores').aggregate(
{
$match: { "data.playerId": "XXXXXXXXXXXXX", "data.groupId": "XXXXXXXXXXXXXX" }
},
{
$unwind: {
path: "$data.holeScores",
includeArrayIndex: "index"
}
},
{
$group: {
_id: "$index",
playerId: { $first: "data.playerId" },
avg: { $avg: "$data.holeScores.points" }
}
}
)
You can try below aggregation
db.collection.aggregate(
{ "$match": { "data.groupId": groupId, "data.playerId": playerId }},
{ "$group": {
"_id": null,
"result": {
"$sum": {
"$arrayElemAt": [
"$data.holeScores.points",
0
]
}
}
}}
)

Elasticsearch: sort by max value in array

Let's say I have 2 documents:
{
"id": "1234",
"things": [
{
"datetime": "2016-01-01T12:00:00+03:00"
},
{
"datetime": "2016-01-06T12:00:00+03:00"
},
{
"datetime": "2100-01-01T12:00:00+03:00"
}
]
}
and
{
"id": "5678",
"things": [
{
"datetime": "2016-01-03T12:00:00+03:00"
},
{
"datetime": "2100-01-06T12:00:00+03:00"
}
]
}
things.datetime is mapped as { "type": "date", "format": "date_time_no_millis" }.
I want to sort these documents based on the latest things.datetime value that is not in the future.
I.e. sorted by simply the max things.datetime would use the dates 2100-01-01T12:00:00+03:00 and 2100-01-06T12:00:00+03:00. I want the sorting to be based on the values 2016-01-06T12:00:00+03:00 and 2016-01-03T12:00:00+03:00.
How can I achieve this, using ElasticSearch 2.x?
I've tried:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max"
}
}
But that doesn't seem to sort even by the 2100 dates.
I also tried to use nested_filter like so:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max",
"nested_filter": {
"range": {
"things.datetime": { "lte": "now" }
}
}
}
}
But it doesn't work as I'd expect.
Also the "sort" value in the response is a negative number. So for a document with dates:
"2015-10-24T05:50:00+03:00",
"2015-10-26T22:05:48+02:00",
"2015-10-24T08:05:43+03:00"
gets a negative sort value:
"sort": [
-9223372036854775808
]
The correct way to achieve this seems to be:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max",
"nested_path": "things",
"nested_filter": {
"range": {
"things.datetime": { "lte": "now" }
}
}
}
}
When there are no more dates left after the nested_filter, the sort value becomes a negative number to ensure the correct order.

Resources