ElasticSearch sort array size incoherent results - arrays

I am trying to sort by array size in ElasticSearch 7.1.
I indexed the following data without creating any custom mapping:
{
"myarray": [{
"field": {
"value": "test"
}
}]
}
When I look at the mapping, it is giving me:
{
"properties": {
"myarray": {
"properties": {
"field": {
"properties": {
"value": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
Now I want to query the index and sort by the highest number of elements in myarray. I have tried doing:
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value') ? doc['myarray.field.value'].values.size() : 0"
}
}
}
which gives me an error like Fielddata is disabled on text fields by default.[...] Alternatively use a keyword field instead. So I try with
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value.keyword') ? doc['myarray.field.value.keyword'].values.size() : 0"
}
}
}
which gives me the error Illegal list shortcut value [values].. So then I tried with (removing the values keyword):
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value.keyword') ? doc['myarray.field.value.keyword'].size() : 0"
}
}
}
and it works, however I have some results that are sorted nicely and suddenly an element that should be at the top appears in the middle.
Is that because it is sorting by the length of the value as a string and not the length of myarray?

This is because text type mapping does not provide sorting, to add sorting you must map the array field with keyword type.
For more info and syntax please refer this : https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-sort.html

Related

Multikey partial index not used with elemMatch

Consider the following document format which has an array field tasks holding embedded documents
{
"foo": "bar",
"tasks": [
{
"status": "sleep",
"id": "1"
},
{
"status": "active",
"id": "2"
}
]
}
There exists a partial index on key tasks.id
{
"v": 2,
"unique": true,
"key": {
"tasks.id": 1
},
"name": "tasks.id_1",
"partialFilterExpression": {
"tasks.id": {
"$exists": true
}
},
"ns": "zardb.quxcollection"
}
The following $elemMatch query with multiple conditions on the same array element
db.quxcollection.find(
{
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
does not seem to use the index
"winningPlan": {
"stage": "COLLSCAN",
"filter": {
"tasks": {
"$elemMatch": {
"$and": [{
"id": {
"$eq": "1"
}
},
{
"status": {
"$not": {
"$eq": "active"
}
}
}
]
}
}
},
"direction": "forward"
}
How can I make the above query use the index? The index does seem to be used via dot notation
db.quxcollection.find({"tasks.id": "1"})
however I need the same array element to match multiple conditions which includes the status field, and the following does not seem to be equivalent to the above $elemMatch based query
db.quxcollection.find({
"tasks.id": "1",
"tasks.status": { "$nin": ["active"] }
})
The way the partial indexes work is it uses the path as a key. With $elemMatch you don't have the path explicitly in the query. If you check it with .explain("allPlansExecution") it is not even considered by the query planner.
To benefit from the index you can specify the path in the query:
db.quxcollection.find(
{
"tasks.id": "1",
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
It duplicates part of the elemMatch condition, so the index will be used to get all documents containing tasks of specific id, then it will filter out documents with "active" tasks at fetch stage. I must admit the query doesn't look nice, so may be add some comments to the code with explanations.

Logic app- how to retrieve json data from dynamic property name

Here's my json - Here I want to retrieve json content from "Property - Dynamic content". Where, dynamic content part might vary for every json request. How do I filter this by a dynamic name?
{
"Attributes":
{
"Property1": {
"Data1": {
"Value": "50"
}
},
"Property2": {
"Data2": {
"Value": "50"
}
},
"Property - Dynamic content": {
"Data3": {
"Value": "50"
},
"Data4": {
"Value": "50"
}
}
}
}
For your requirement, please refer to my logic app below:
1. I initialized a variable and store the json same with yours to simulate your situation.
2. Then use "Parse JSON".
Please notice the schema of "Parse JSON" show as:
{
"properties": {
"Attributes": {
"properties": {
"Property - Dynamic content": {
"type": [
"object",
"array"
]
},
"Property1": {
"properties": {
"Data1": {
"properties": {
"Value": {
"type": "string"
}
},
"type": "object"
}
},
"type": "object"
},
"Property2": {
"properties": {
"Data2": {
"properties": {
"Value": {
"type": "string"
}
},
"type": "object"
}
},
"type": "object"
}
},
"type": "object"
}
},
"type": "object"
}
Please pay attention to the type of Property - Dynamic content in schema above. Since the content of Property - Dynamic content is either "object" or "array", so I set both "object" and "array" as the type of Property - Dynamic content.
3. Then I initialized a variable named "result" to get the value which you want.
As we use both type "object" and "array" in the schema for Property - Dynamic content, so you may not find it in the "Dynamic content" selection. You can input its value by expression as the screenshot above. The whole expression is: body('Parse_JSON')?['Attributes']?['Property - Dynamic content']
I was able to get what I need using inline code - javascript - If anyone else is looking for the same - here it is - This will give json from Property - dynamic content element.
var data = Object.keys(workflowContext.trigger.outputs.body.Attributes);
var key = data.filter(s => s.includes('Property')).toString(); // to get element - Property - dynamic content
return workflowContext.trigger.outputs.body.Attributes[key];

Multiple match_phrase conditions with another bool in a single ElasticSearch query?

I am trying to conduct an Elasticsearch query that searched a text field ("body") and returns items that match at least one of two multi-word phrases I provide (ie: "stack overflow" OR "the stackoverflow"). I would also like the query to only provide results that occur after a given timestamp, with the results ordered by time.
My current solution is below. I believe the MUST is working correctly (gte a timestamp), but the BOOL + SHOULD with two match_phrases is not correct. I am getting the following error:
Unexpected character ('{' (code 123)): was expecting double-quote to start field name
Which I think is because I have two match_phrases in there?
This is the ES mapping and the details of the ES API I am using details are here.
{"query":
{"bool":
{"should":
[{"match_phrase":
{"body":"a+phrase"}
},
{"match_phrase":
{"body":"another+phrase"}
}
]
},
{"bool":
{"must":
[{"range":
{"created_at:
{"gte":"thispage"}
}
}
]}
}
},"size":10000,
"sort":"created_at"
}
I think you were just missing a single " after created_at.
{
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
Also, you are allowed to have both must and should as properties of a bool object, so this is also worth trying.
{
"query": {
"bool": {
"must": {
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
On a side note, Postman or any JSON formatter/validator would really help in determining where the error is.

Remove elements/objects From Array in ElasticSearch Followed by Matching Query

I'm having issues trying to remove elements/objects from an array in elasticsearch.
This is the mapping for the index:
{
"example1": {
"mappings": {
"doc": {
"properties": {
"locations": {
"type": "geo_point"
},
"postDate": {
"type": "date"
},
"status": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
And this is an example document.
{
"_index": "example1",
"_type": "doc",
"_id": "8036",
"_score": 1,
"_source": {
"user": "kimchy8036",
"postDate": "2009-11-15T13:12:00",
"locations": [
[
72.79887719999999,
21.193036000000003
],
[
-1.8262150000000001,
51.178881999999994
]
]
}
}
Using the query below, I can add multiple locations.
POST /example1/_update_by_query
{
"query": {
"match": {
"_id": "3"
}
},
"script": {
"lang": "painless",
"inline": "ctx._source.locations.add(params.newsupp)",
"params": {
"newsupp": [
-74.00,
41.12121
]
}
}
}
But I'm not able to remove array objects from locations. I have tried the query below but it's not working.
POST example1/doc/3/_update
{
"script": {
"lang": "painless",
"inline": "ctx._source.locations.remove(params.tag)",
"params": {
"tag": [
-74.00,
41.12121
]
}
}
}
Kindly let me know where i am doing wrong here. I am using elastic version 5.5.2
In painless scripts, Array.remove() method removes by index, not by value.
Here's a working example that removes array elements by value in Elasticsearch script:
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
if (ctx._source[params.array_attribute] != null) {
for (int i=ctx._source[params.array_attribute].length-1; i>=0; i--) {
if (ctx._source[params.array_attribute][i] == params.value_to_remove) {
ctx._source[params.array_attribute].remove(i);
}
}
}
""",
"params": {
"array_attribute": "<NAME_OF_ARRAY_PROPERTY_TO_REMOVE_VALUE_IN>",
"value_to_remove": "<VALUE_TO_REMOVE_FROM_ARRAY>",
}
}
}
You might want to simplify script, if your script shall only remove values from one specific array attribute. For example, removing "green" from document's .color_list array:
_doc/001 = {
"color_list": ["red", "blue", "green"]
}
Script to remove "green":
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
for (int i=ctx._source.color_list.length-1; i>=0; i--) {
if (ctx._source.color_list[i] == params.color_to_remove) {
ctx._source.color_list.remove(i);
}
}
""",
"params": {
"color_to_remove": "green"
}
}
}
Unlike add(), remove() takes the index of the element and remove it.
Your ctx._source.locations in painless is an ArrayList. It has List's remove() method:
E remove(int index)
Removes the element at the specified position in this list (optional operation). ...
See Painless API - List for other methods.
See this answer for example code.
"script" : {
"lang":"painless",
"inline":"ctx._source.locations.remove(params.tag)",
"params":{
"tag":indexToRemove
}
}
If with ctx._source.locations.add(elt) You add the element, with ctx._source.locations.remove(indexToRemove), you remove by the index of element in the array.

Elasticsearch: sort by max value in array

Let's say I have 2 documents:
{
"id": "1234",
"things": [
{
"datetime": "2016-01-01T12:00:00+03:00"
},
{
"datetime": "2016-01-06T12:00:00+03:00"
},
{
"datetime": "2100-01-01T12:00:00+03:00"
}
]
}
and
{
"id": "5678",
"things": [
{
"datetime": "2016-01-03T12:00:00+03:00"
},
{
"datetime": "2100-01-06T12:00:00+03:00"
}
]
}
things.datetime is mapped as { "type": "date", "format": "date_time_no_millis" }.
I want to sort these documents based on the latest things.datetime value that is not in the future.
I.e. sorted by simply the max things.datetime would use the dates 2100-01-01T12:00:00+03:00 and 2100-01-06T12:00:00+03:00. I want the sorting to be based on the values 2016-01-06T12:00:00+03:00 and 2016-01-03T12:00:00+03:00.
How can I achieve this, using ElasticSearch 2.x?
I've tried:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max"
}
}
But that doesn't seem to sort even by the 2100 dates.
I also tried to use nested_filter like so:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max",
"nested_filter": {
"range": {
"things.datetime": { "lte": "now" }
}
}
}
}
But it doesn't work as I'd expect.
Also the "sort" value in the response is a negative number. So for a document with dates:
"2015-10-24T05:50:00+03:00",
"2015-10-26T22:05:48+02:00",
"2015-10-24T08:05:43+03:00"
gets a negative sort value:
"sort": [
-9223372036854775808
]
The correct way to achieve this seems to be:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max",
"nested_path": "things",
"nested_filter": {
"range": {
"things.datetime": { "lte": "now" }
}
}
}
}
When there are no more dates left after the nested_filter, the sort value becomes a negative number to ensure the correct order.

Resources