Elasticsearch constant score sort - database

I have a pretty simple elasticsearch query where I filter some items by category. It's a constant score query, something like this:
"query": {
"constant_score": {
"filter": {
"term": {
"category": "[category-id]"
}
}
}
}
The problem is that having no score to sort these result by they don't always come back in the same order. And this is an issue, because it messes up my pagination.
An example. I request the first 5 items and I receive back let's say the following ids: [4, 7, 8, 10, 3]. I then want the next 5 items to display the next page, but I may get some items repeated, like this: [12, 15, 7, 13, 9].
The problem is that all my fields are string fields, and I wouldn't want to sort by any of them. The sort order is not important, it's just important to keep the same order every time.
Any ideas? Thanks!

Try this:
GET _search
{
"query": {
"bool": {
"filter": {
"term": {
"category": "[category-id]"
}
}
}
}
}
Since this is what used to be known as a filtered query no scores are calculated and the score field will have value of 0.

Related

Delimit records by recurring value

I have documents that contain an object array. Within that array are pulses in a dataset. For example:
samples: [{"time":1224960,"flow":0,"temp":null},{"time":1224970,"flow":0,"temp":null},
{"time":1224980,"flow":23,"temp":null},{"time":1224990,"flow":44,"temp":null},
{"time":1225000,"flow":66,"temp":null},{"time":1225010,"flow":0,"temp":null},
{"time":1225020,"flow":650,"temp":null},{"time":1225030,"flow":40,"temp":null},
{"time":1225040,"flow":60,"temp":null},{"time":1225050,"flow":0,"temp":null},
{"time":1225060,"flow":0,"temp":null},{"time":1225070,"flow":0,"temp":null},
{"time":1225080,"flow":0,"temp":null},{"time":1225090,"flow":0,"temp":null},
{"time":1225100,"flow":0,"temp":null},{"time":1225110,"flow":67,"temp":null},
{"time":1225120,"flow":23,"temp":null},{"time":1225130,"flow":0,"temp":null},
{"time":1225140,"flow":0,"temp":null},{"time":1225150,"flow":0,"temp":null}]
I would like to construct an aggregate pipeline to act on each collection of consecutive 'samples.flow' values above zero. As in, the sample pulses are delimited by one or more zero flow values. I can use an $unwind stage to flatten the data but I'm at a loss as to how to subsequently group each pulse. I have no objections to this being a multistep process. But I'd rather not have to loop through it in code on the client side. The data will comprise fields from a number of documents and could total in the hundreds of thousands of entries.
From the example above I'd like to be able to extract:
[{"time":1224980,"total_flow":123,"temp":null},
{"time":1225020,"total_flow":750,"temp":null},
{"time":1225110,"total_flow":90,"temp":null}]
or variations thereof.
If you are not looking for specific values to be on the time field, then you can use this pipeline with $bucketAuto.
[
{
"$bucketAuto": {
"groupBy": "$time",
"buckets": 3,
"output": {
total_flow: {
$sum: "$flow"
},
temp: {
$first: "$temp"
},
time: {
"$min": "$time"
}
}
}
},
{
"$project": {
"_id": 0
}
}
]
If you are looking for some specific values for time, then you will need to use $bucket and provide it a boundaries argument with precalculated lower bounds. I think this solution should do your job

MongoDB filter a sub-array of Objects

The document structure has a round collection, which has an array of holes Objects embedded within it, with each hole played/scored entered.
The structure looks like this (there are more fields, but this summarises):
{
"_id": {
"$oid": "60701a691c071256e4f0d0d6"
},
"schema": {
"$numberDecimal": "1.0"
},
"playerName": "T Woods",
"comp": {
"id": {
"$oid": "607019361c071256e4f0d0d5"
},
"name": "US Open",
"tees": "Pro Tees",
"roundNo": {
"$numberInt": "1"
},
"scoringMethod": "Stableford"
},
"holes": [
{
"holeNo": {
"$numberInt": "1"
},
"holePar": {
"$numberInt": "4"
},
"holeSI": {
"$numberInt": "3"
},
"holeGross": {
"$numberInt": "4"
},
"holeStrokes": {
"$numberInt": "1"
},
"holeNett": {
"$numberInt": "3"
},
"holeGrossPoints": {
"$numberInt": "2"
},
"holeNettPoints": {
"$numberInt": "3"
}
}
]
}
In the Atlas web UI, it shows as (note there are 9 holes in this particular round of golf - limited to 3 for brevity):
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf (i.e. a birdie on par 3 or better).
Being new to MongoDB, and NoSQL constructs, I am stuck with this. Reading around the aggregation pipeline framework, I have tried to break down the stages I will need as:
Filter by the comp.id and comp.roundNo
Filter this result with any hole within the holes array of Objects
Maybe I have approached this wrong, and should filter or structure this pipeline differently?
So far, using the Atlas web UI, I can apply these filters individually as:
{
"comp.id": ObjectId("607019361c071256e4f0d0d5"),
"comp.roundNo": 2
}
And:
{ "holes.0.holeGross": 2 }
But I have 2 problems:
The second filter query, I have hard-coded the array index to get this value. I would need to search across all the sub-elements of every document that matches this comp.id && comp.roundNo
How do I combine these? I presuming this is where the aggregation comes in, as well as enumerating across the whole array (as above).
I note in particular it is the extra ".0." part of the second query that I am not seeing from various other online postings trying to do the same thing. Is my data structure incorrect? Do I need the [0]...[17] Objects for an 18-hole round of golf?
I would like to find the players who have a holeGross of 2, or less, somewhere in their round of golf
if that is the goal, a simple $lte search inside the holes array like the following would do:
db.collection.find({ "holes.holeGross": { $lte: 2 } })
you simply have to not specify an array index such as 0 in the property path in order to search each element of the array.
https://mongoplayground.net/p/KhZLnj9mJe5

fetch the array element from nested mongo object [duplicate]

Is it possible to wildcard the key in a query? For instance, given the following record, I'd like to do a .find({'a.*': 4})
This was discussed here https://jira.mongodb.org/browse/SERVER-267 but it looks like it's not been resolved.
{
'a': {
'b': [1, 2],
'c': [3, 4]
}
}
As asked, this is not possible. The server issue you linked to is still under "issues we're not sure of".
MongoDB has some intelligence surrounding the use of arrays, and I think that's part of the complexity surrounding such a feature.
Take the following query db.foo.find({ 'a.b' : 4 } ). This query will match the following documents.
{ a: { b: 4 } }
{ a: [ { b: 4 } ] }
So what does "wildcard" do here? db.foo.find( { a.* : 4 } ) Does it match the first document? What about the second?
Moreover, what does this mean semantically? As you've described, the query is effectively "find documents where any field in that document has a value of 4". That's a little unusual.
Is there a specific semantic that you're trying to achieve? Maybe a change in the document structure will get you the query you want.
I've came across this question because I faced the same issue. The accepted answer provider here does explains why this is not supported but not really solves the issue itself.
I've ended up with a solution that makes the wildcard usage purposed here redundant and share here just in case someone will find this post some day
Why I wanted to use wildcards in my MongoDB queries?
In my case, I needed this "feature" in order to be able to find a match inside a dictionary (just as the question's code demonstrates).
What's the alternatives?
Use a reversed map (very similar to how DNS works) and simply use it. So, in our case we can use something similar to this:
{
"a": {
"map": {
"b": [1, 2, 3],
"c": [3, 4]
},
"reverse-map": {
"1": [ "b" ],
"2": [ "b" ],
"3": [ "b", "c" ],
"4": [ "c" ]
}
}
}
I know, it takes more memory and insert / update operations should validate this set is always symmetric and yet - it solves the problem. Now, instead of making an imaginary query like
db.foo.find( { a.map.* : 4 } )
I can make an actual query
db.foo.find( { a.reverse-map.4 : {$exists: true} } )
Which will return all items that have a specific value (in our example 4)
I know - this approach takes more memory and you need to manage indexes properly if you want to gain good performance (read the docs) and still - it's good for my use-case. Hope this helps someone else someday as well
Starting from MongoDB v3.4+, you can use $objectToArray to convert a into an array of k-v tuples for querying.
db.collection.aggregate([
{
"$addFields": {
"a": {
"$objectToArray": "$a"
}
}
},
{
$match: {
"a.v": 4
}
},
{
"$addFields": {
// cosmetics to revert back to original structure
"a": {
"$arrayToObject": "$a"
}
}
}
])
Here is the Mongo playground for your reference.

Elasticsearch query array field across documents

I want to query the array field from elasticsearch. I have an array field that contains one or several node numbers of a gpu that were allocated to a job. Different people may be using the same node at the same time given that some people may be sharing the same gpu node with others. I want get the total number of distinct nodes that were used at a specific time.
Say I have three rows of data which fall in the same time interval. I want to plot a histogram showing that there are three nodes occupied in that period. Can I achieve this on Kibana?
Example :
[3]
[3,4,5]
[4,5]
I am expecting an output of 3 since there were only 3 distinct nodes used.
Thanks in advance
You can accomplish this using a combination of a date histogram aggregation along with either a terms aggregation (if the exact number of nodes is important) or a cardinality aggregation (if you can accept some inaccuracy at higher cardinalities).
Full example:
# Start with a clean slate
DELETE test-index
# Create the index
PUT test-index
{
"mappings": {
"event": {
"properties": {
"nodes": {
"type": "integer"
},
"timestamp": {
"type": "date"
}
}
}
}
}
# Index a few events (using the rows from your question)
POST test-index/event/_bulk
{"index":{}}
{"timestamp": "2018-06-10T00:00:00Z", "nodes":[3]}
{"index":{}}
{"timestamp": "2018-06-10T00:01:00Z", "nodes":[3,4,5]}
{"index":{}}
{"timestamp": "2018-06-10T00:02:00Z", "nodes":[4,5]}
# STRATEGY 1: Cardinality aggregation (scalable, but potentially inaccurate)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"cardinality": {
"field": "nodes"
}
}
}
}
}
}
# STRATEGY 2: Terms aggregation (exact, but potentially much more expensive)
POST test-index/event/_search
{
"size": 0,
"aggs": {
"active_nodes_histo": {
"date_histogram": {
"field": "timestamp",
"interval": "hour"
},
"aggs": {
"active_nodes": {
"terms": {
"field": "nodes",
"size": 10
}
}
}
}
}
}
Notes:
Terms vs. cardinality aggregation: Use the cardinality agg unless you need to know WHICH nodes are in use. It is significantly more scalable, and until you get into cardinality of 1000s, you likely won't see any inaccuracy.
Date histogram interval: You can play with the interval such that it's something that makes sense for you. If you run through the example above, you'll only see one histogram bucket, however if you change hour to minute, you'll see the histogram build itself out with more data points.

Multiple sorting in ArangoDB

My webapp needs to display several sorted lists of document attributes in a graph. These are hours, cycles, and age.
I have an AQL query that beautifully traverses the graph and gets me all the data my app needs in 2 ms. I'm very impressed! But I need it sorted for each graph. The query currently returns an array of json objects that contain all three of the attributes and the id for which they apply. Awesome. The query also very easily sorts on one of the attributes.
My problem is: I need to have a sorted list of all three, and would prefer not to query the database three times since the data is all in the same documents my traversal returned.
I would like to return three sorted arrays of json objects: one containing hours and the id, one containing cycles and the id, and one containing age and the id. This way, my graphs can easily display all three graphs without client-side sorting.
HTTP requests themselves are time consuming although the database is very fast, which is why I'd like to pull all three at once, as the data itself is small.
My current query is a simple graph traversal:
for v, e, p in outbound startNode graph 'myGraph'
filters & definitions...
sort v.hours desc
return {"hours": v.hours, "cycles": v.cycles, "age": v.age, "id": v.id}
Is there an easy way I can tell Arango to return me this structure?
{
[
{
"id": 47,
"hours": 123
},
{
"id": 23,
"hours": 105
}...
],
[
{
"id": 47,
"cycles": 18
},
{
"id": 23,
"cycles": 5
}...
],
[
{
"id": 47,
"age": 4.2
},
{
"id": 23,
"age": 0.9
}
]
}
Although the traversal is fast, I would prefer if I didn't have to re-traverse the graph three times to do it, if possible.
My solution:
let data = (for v, e, p in outbound startNode graph 'myGraph'
filters & definitions...
return {"hours": v.hours, "cycles": v.cycles, "age": v.age, "id": v.id})
let byHours = (for thing in data
sort thing.hours desc
return {"hours": thing.hours, "id": thing.id})
let byCycles = (for thing in data
sort thing.cycles desc
return {"cycles": thing.cycles, "id": thing.id})
let byAge = (for thing in data
sort thing.age desc
return {"age": thing.age, "id": thing.id})
return {"hours": byHours, "cycles": byCycles, "age": byAge}
I'm not sure how this compares against your solution performance-wise, but the most obvious solution would be to traverse once and then create three sorted results like this:
LET nodes = (
FOR v, e, p IN OUTBOUND startNode GRAPH 'myGraph'
FILTER ...
RETURN v
)
RETURN {
hours: (
FOR n IN nodes
SORT n.hours DESC
RETURN KEEP(n, ['hours', 'id'])
),
cycles: (
FOR n IN nodes
SORT n.cycles DESC
RETURN KEEP(n, ['cycles', 'id'])
),
age: (
FOR n IN nodes
SORT n.age DESC
RETURN KEEP(n, ['age', 'id'])
)
}
This would traverse the graph only once but sort the result three times.

Resources