Elasticsearch query array field across documents - arrays

I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance

You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.

Related

Delimit records by recurring value

I have documents that contain an object array. Within that array are pulses in a dataset. For example:
samples: [{"time":1224960,"flow":0,"temp":null},{"time":1224970,"flow":0,"temp":null},
{"time":1224980,"flow":23,"temp":null},{"time":1224990,"flow":44,"temp":null},
{"time":1225000,"flow":66,"temp":null},{"time":1225010,"flow":0,"temp":null},
{"time":1225020,"flow":650,"temp":null},{"time":1225030,"flow":40,"temp":null},
{"time":1225040,"flow":60,"temp":null},{"time":1225050,"flow":0,"temp":null},
{"time":1225060,"flow":0,"temp":null},{"time":1225070,"flow":0,"temp":null},
{"time":1225080,"flow":0,"temp":null},{"time":1225090,"flow":0,"temp":null},
{"time":1225100,"flow":0,"temp":null},{"time":1225110,"flow":67,"temp":null},
{"time":1225120,"flow":23,"temp":null},{"time":1225130,"flow":0,"temp":null},
{"time":1225140,"flow":0,"temp":null},{"time":1225150,"flow":0,"temp":null}]
I would like to construct an aggregate pipeline to act on each collection of consecutive 'samples.flow' values above zero. As in, the sample pulses are delimited by one or more zero flow values. I can use an $unwind stage to flatten the data but I'm at a loss as to how to subsequently group each pulse. I have no objections to this being a multistep process. But I'd rather not have to loop through it in code on the client side. The data will comprise fields from a number of documents and could total in the hundreds of thousands of entries.
From the example above I'd like to be able to extract:
[{"time":1224980,"total_flow":123,"temp":null},
{"time":1225020,"total_flow":750,"temp":null},
{"time":1225110,"total_flow":90,"temp":null}]
or variations thereof.
If you are not looking for specific values to be on the time field, then you can use this pipeline with $bucketAuto.
[
{
"$bucketAuto": {
"groupBy": "$time",
"buckets": 3,
"output": {
total_flow: {
$sum: "$flow"
},
temp: {
$first: "$temp"
},
time: {
"$min": "$time"
}
}
}
},
{
"$project": {
"_id": 0
}
}
]
If you are looking for some specific values for time, then you will need to use $bucket and provide it a boundaries argument with precalculated lower bounds. I think this solution should do your job

MongoDB filter a sub-array of Objects

The document structure has a round collection, which has an array of holes Objects embedded within it, with each hole played/scored entered.
The structure looks like this (there are more fields, but this summarises):
{
"_id": {
"$oid": "60701a691c071256e4f0d0d6"
},
"schema": {
"$numberDecimal": "1.0"
},
"playerName": "T Woods",
"comp": {
"id": {
"$oid": "607019361c071256e4f0d0d5"
},
"name": "US Open",
"tees": "Pro Tees",
"roundNo": {
"$numberInt": "1"
},
"scoringMethod": "Stableford"
},
"holes": [
{
"holeNo": {
"$numberInt": "1"
},
"holePar": {
"$numberInt": "4"
},
"holeSI": {
"$numberInt": "3"
},
"holeGross": {
"$numberInt": "4"
},
"holeStrokes": {
"$numberInt": "1"
},
"holeNett": {
"$numberInt": "3"
},
"holeGrossPoints": {
"$numberInt": "2"
},
"holeNettPoints": {
"$numberInt": "3"
}
}
]
}
In the Atlas web UI, it shows as (note there are 9 holes in this particular round of golf - limited to 3 for brevity):
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf (i.e. a birdie on par 3 or better).
Being new to MongoDB, and NoSQL constructs, I am stuck with this. Reading around the aggregation pipeline framework, I have tried to break down the stages I will need as:
Filter by the comp.id and comp.roundNo
Filter this result with any hole within the holes array of Objects
Maybe I have approached this wrong, and should filter or structure this pipeline differently?
So far, using the Atlas web UI, I can apply these filters individually as:
{
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"comp.roundNo": 2
}
And:
{ "holes.0.holeGross": 2 }
But I have 2 problems:
The second filter query, I have hard-coded the array index to get this value. I would need to search across all the sub-elements of every document that matches this comp.id && comp.roundNo
How do I combine these? I presuming this is where the aggregation comes in, as well as enumerating across the whole array (as above).
I note in particular it is the extra ".0." part of the second query that I am not seeing from various other online postings trying to do the same thing. Is my data structure incorrect? Do I need the [0]...[17] Objects for an 18-hole round of golf?
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf
if that is the goal, a simple $lte search inside the holes array like the following would do:
db.collection.find({ "holes.holeGross": { $lte: 2 } })
you simply have to not specify an array index such as 0 in the property path in order to search each element of the array.
https://mongoplayground.net/p/KhZLnj9mJe5

How to get the total word count per document in SOLR?

I would like to retrieve some summary statistics from the text documents I have indexed in Solr. In particular, the word count per document.
For example, I have the following three documents indexed:
{
"id":"1",
"text":["This is the text in document 1"]},
{
"id":"2",
"text":["some text in document 2"]},
{
"id":"3",
"text":["and document 3"]}
I would like to get the total number of words per each individual document:
"1",7,
"2",5,
"3",3,
What query can I use to get such a result?
I am new to Solr and I am aware that I can use facets to get the count of the individual words over all documents using something like:
http://localhost:8983/solr/corename/select?q=*&facet=true&facet.field=text&facet.mincount=1
But how to get the total word count per document is not clear to me.
I appreciate your help!
If you do a faceted search over id and an inner facet over text, the inner facet count will give the number of words in that document with that id. But text field type must be text_general or something equivalent (tokenized).
If you only want to count "distinct" words per document id, it is actually much easier:
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": "unique(message)"
}
}
}
}
Gives distinct word count per document. Following gives all words and all counts per document but it's up to you to sum them to get total amount (also it's an expensive call)
{
"query": "*:*",
"facet": {
"document": {
"type": "terms",
"field": "id",
"facet": {
"wordCount": {
"type": "terms",
"field": "message",
"limit": -1
}
}
}
}
}
#MatsLindth's comment is something to consider too. Solr and you might not agree on what's a "word". Tokenizer is configurable to a point but depending on your needs it might not be very easy.

Cosmos DB SQL query in single embed document

I am working with Cosmos DB and I want to write a SQL query that returns multiple document in one single embed documents.
To elaborate, imagine you have the following two document types in one container. OrderId of Order document has reference in OrderDetail document.
1.Order
{
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name": "ABC",
"Type": "Order",
"DeptName": "ABC",
"TotalAmount": 100.05
}
2.OrderDetail
{
"OrderDetailId": "689bdc38-9849-4a11-b856-53f8628b76c9",
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Type": "OrderDetail",
"ItemNo": 202,
"Quantity": 10,
"UnitPrice": 10.05
},
{
"OrderDetailId": "789bdc38-9849-4a11-b856-53f8628b76c9",
"OrderId": "31d4c08b-ee59-4ede-b801-3cacaea38808",
"Type": "OrderDetail",
"ItemNo": 200,
"Quantity": 11,
"UnitPrice": 15.05
}
I want to write a query that will return all entries of OrderDetail in one array based on reference OrderId="31d4c08b-ee59-4ede-b801-3cacaea38808"
Output should be like below
{
"OrderId":"31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name":"ABC",
"Type":"Order",
"OrderDetail":[
{
"OrderDetailId":"689bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":202,
"Quantity":10,
"UnitPrice":10.05
},
{
"OrderDetailId":"789bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":200,
"Quantity":11,
"UnitPrice":15.05
}
]
}
I have no idea how to query in Cosmosdb to get the above result.
Your desired output should be applied in relational database,Cosmos db is non-relational db which is not appropriate for your scenario. Per my knowledge, no query sql could produce above output directly.
I suggest you executing 2 sqls, one produces:
{"OrderId":"31d4c08b-ee59-4ede-b801-3cacaea38808",
"Name":"ABC",
"Type":"Order"}
other one produces:
"OrderDetail":[
{
"OrderDetailId":"689bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":202,
"Quantity":10,
"UnitPrice":10.05
},
{
"OrderDetailId":"789bdc38-9849-4a11-b856-53f8628b76c9",
"Type":"OrderDetail",
"ItemNo":200,
"Quantity":11,
"UnitPrice":15.05
}
]
Then combine them. Surely,you could do such process in Stored Procedure.

Why do multiple nested objects appear in this ElasticSearch query?

In Case 4 of this page, the query searches for all chairs less than 70 units in height:
curl localhost:9200/example/product/_search -d '{
"query": {
"filtered": {
"query": {
"match": {
"name": "chair"
}
},
"filter": {
"numeric_range": {
"size.height": {
"lt": 70
}
}
}
}
}
}'
Result:
"hits": [
{
"_id": "0",
"_source": {
"product": "chair",
"size": [
{
"width": 50,
"height": 50,
"depth": 50
},
{
"width": 75,
"height": 75,
"depth": 75
}
]
}
}
]
1) why is the ID 0 for both chair sizes?
2) why does the response show dimensions for the other chair that is 75 units in height?
1) The writer wanted to show 1 to N relation. Meaning there are 2 (In this case) types of chairs in his repository: A chair with dimensions of 50 and a chair with dimensions of 75. But both of them are still chairs and the id of a chair is 0.
2) Because by default ES doesn't return partial results, it returns documents. In our case we have a chair document with a size array which holds 2 objects: One for the 50 dimension and one for the 75 dimension. The supplied query can either select the whole document or not.
If you want to convert the query to English you may say: Bring me all the documents which have the value "chair" in the product field and at least one of its size.height values is lower than 70.
Even though the writer of the article is knowledgeable, I must say I don't like this kind of articles that trying to draw a direct flow between the SQL world to NOSQL implementation. If it was so easy, some big company would have write an automatic script that converts your SQL schemas to various NOSQL formats. In order to model your data correctly in NOSQL you must understand your products, understand the factors that should influence on your decision, understand the use case and the data. There is no one universal solution that will tell you: If you did it this way in a RDBMS do it like this in ES.

Resources