Remove elements/objects From Array in ElasticSearch Followed by Matching Query - arrays

I'm having issues trying to remove elements/objects from an array in elasticsearch.
This is the mapping for the index:
{
"example1": {
"mappings": {
"doc": {
"properties": {
"locations": {
"type": "geo_point"
},
"postDate": {
"type": "date"
},
"status": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
And this is an example document.
{
"_index": "example1",
"_type": "doc",
"_id": "8036",
"_score": 1,
"_source": {
"user": "kimchy8036",
"postDate": "2009-11-15T13:12:00",
"locations": [
[
72.79887719999999,
21.193036000000003
],
[
-1.8262150000000001,
51.178881999999994
]
]
}
}
Using the query below, I can add multiple locations.
POST /example1/_update_by_query
{
"query": {
"match": {
"_id": "3"
}
},
"script": {
"lang": "painless",
"inline": "ctx._source.locations.add(params.newsupp)",
"params": {
"newsupp": [
-74.00,
41.12121
]
}
}
}
But I'm not able to remove array objects from locations. I have tried the query below but it's not working.
POST example1/doc/3/_update
{
"script": {
"lang": "painless",
"inline": "ctx._source.locations.remove(params.tag)",
"params": {
"tag": [
-74.00,
41.12121
]
}
}
}
Kindly let me know where i am doing wrong here. I am using elastic version 5.5.2

In painless scripts, Array.remove() method removes by index, not by value.
Here's a working example that removes array elements by value in Elasticsearch script:
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
if (ctx._source[params.array_attribute] != null) {
for (int i=ctx._source[params.array_attribute].length-1; i>=0; i--) {
if (ctx._source[params.array_attribute][i] == params.value_to_remove) {
ctx._source[params.array_attribute].remove(i);
}
}
}
""",
"params": {
"array_attribute": "<NAME_OF_ARRAY_PROPERTY_TO_REMOVE_VALUE_IN>",
"value_to_remove": "<VALUE_TO_REMOVE_FROM_ARRAY>",
}
}
}
You might want to simplify script, if your script shall only remove values from one specific array attribute. For example, removing "green" from document's .color_list array:
_doc/001 = {
"color_list": ["red", "blue", "green"]
}
Script to remove "green":
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
for (int i=ctx._source.color_list.length-1; i>=0; i--) {
if (ctx._source.color_list[i] == params.color_to_remove) {
ctx._source.color_list.remove(i);
}
}
""",
"params": {
"color_to_remove": "green"
}
}
}

Unlike add(), remove() takes the index of the element and remove it.
Your ctx._source.locations in painless is an ArrayList. It has List's remove() method:
E remove(int index)
Removes the element at the specified position in this list (optional operation). ...
See Painless API - List for other methods.
See this answer for example code.

"script" : {
"lang":"painless",
"inline":"ctx._source.locations.remove(params.tag)",
"params":{
"tag":indexToRemove
}
}
If with ctx._source.locations.add(elt) You add the element, with ctx._source.locations.remove(indexToRemove), you remove by the index of element in the array.

Related

Multikey partial index not used with elemMatch

Consider the following document format which has an array field tasks holding embedded documents
{
"foo": "bar",
"tasks": [
{
"status": "sleep",
"id": "1"
},
{
"status": "active",
"id": "2"
}
]
}
There exists a partial index on key tasks.id
{
"v": 2,
"unique": true,
"key": {
"tasks.id": 1
},
"name": "tasks.id_1",
"partialFilterExpression": {
"tasks.id": {
"$exists": true
}
},
"ns": "zardb.quxcollection"
}
The following $elemMatch query with multiple conditions on the same array element
db.quxcollection.find(
{
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
does not seem to use the index
"winningPlan": {
"stage": "COLLSCAN",
"filter": {
"tasks": {
"$elemMatch": {
"$and": [{
"id": {
"$eq": "1"
}
},
{
"status": {
"$not": {
"$eq": "active"
}
}
}
]
}
}
},
"direction": "forward"
}
How can I make the above query use the index? The index does seem to be used via dot notation
db.quxcollection.find({"tasks.id": "1"})
however I need the same array element to match multiple conditions which includes the status field, and the following does not seem to be equivalent to the above $elemMatch based query
db.quxcollection.find({
"tasks.id": "1",
"tasks.status": { "$nin": ["active"] }
})
The way the partial indexes work is it uses the path as a key. With $elemMatch you don't have the path explicitly in the query. If you check it with .explain("allPlansExecution") it is not even considered by the query planner.
To benefit from the index you can specify the path in the query:
db.quxcollection.find(
{
"tasks.id": "1",
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
It duplicates part of the elemMatch condition, so the index will be used to get all documents containing tasks of specific id, then it will filter out documents with "active" tasks at fetch stage. I must admit the query doesn't look nice, so may be add some comments to the code with explanations.

ElasticSearch sort array size incoherent results

I am trying to sort by array size in ElasticSearch 7.1.
I indexed the following data without creating any custom mapping:
{
"myarray": [{
"field": {
"value": "test"
}
}]
}
When I look at the mapping, it is giving me:
{
"properties": {
"myarray": {
"properties": {
"field": {
"properties": {
"value": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
Now I want to query the index and sort by the highest number of elements in myarray. I have tried doing:
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value') ? doc['myarray.field.value'].values.size() : 0"
}
}
}
which gives me an error like Fielddata is disabled on text fields by default.[...] Alternatively use a keyword field instead. So I try with
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value.keyword') ? doc['myarray.field.value.keyword'].values.size() : 0"
}
}
}
which gives me the error Illegal list shortcut value [values].. So then I tried with (removing the values keyword):
{
"sort": {
"_script": {
"type": "number",
"order": "desc",
"script": "doc.containsKey('myarray.field.value.keyword') ? doc['myarray.field.value.keyword'].size() : 0"
}
}
}
and it works, however I have some results that are sorted nicely and suddenly an element that should be at the top appears in the middle.
Is that because it is sorting by the length of the value as a string and not the length of myarray?
This is because text type mapping does not provide sorting, to add sorting you must map the array field with keyword type.
For more info and syntax please refer this : https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-request-sort.html

Multiple match_phrase conditions with another bool in a single ElasticSearch query?

I am trying to conduct an Elasticsearch query that searched a text field ("body") and returns items that match at least one of two multi-word phrases I provide (ie: "stack overflow" OR "the stackoverflow"). I would also like the query to only provide results that occur after a given timestamp, with the results ordered by time.
My current solution is below. I believe the MUST is working correctly (gte a timestamp), but the BOOL + SHOULD with two match_phrases is not correct. I am getting the following error:
Unexpected character ('{' (code 123)): was expecting double-quote to start field name
Which I think is because I have two match_phrases in there?
This is the ES mapping and the details of the ES API I am using details are here.
{"query":
{"bool":
{"should":
[{"match_phrase":
{"body":"a+phrase"}
},
{"match_phrase":
{"body":"another+phrase"}
}
]
},
{"bool":
{"must":
[{"range":
{"created_at:
{"gte":"thispage"}
}
}
]}
}
},"size":10000,
"sort":"created_at"
}
I think you were just missing a single " after created_at.
{
"query": {
"bool": {
"must": [
{
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
{
"bool": {
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
Also, you are allowed to have both must and should as properties of a bool object, so this is also worth trying.
{
"query": {
"bool": {
"must": {
"range": {
"created_at": {
"gte": "1534004694"
}
}
},
"should": [
{
"match_phrase": {
"body": "a+phrase"
}
},
{
"match_phrase": {
"body": "another+phrase"
}
}
]
}
},
"size": 10,
"sort": "created_at"
}
On a side note, Postman or any JSON formatter/validator would really help in determining where the error is.

ElasticSearch Painless script: How to iterate in an array of Nested Objects

I am trying to create a script using the script_score of the function_score.
I have several documents whose rankings field is type="nested".
The mapping for the field is:
"rankings": {
"type": "nested",
"properties": {
"rank1": {
"type": "long"
},
"rank2": {
"type": "float"
},
"subject": {
"type": "text"
}
}
}
A sample document is:
"rankings": [
{
"rank1": 1051,
"rank2": 78.5,
"subject": "s1"
},
{
"rank1": 45,
"rank2": 34.7,
"subject": "s2"
}]
What I want to achieve is to iterate over the nested objects of rankings. Actually, I need to use i.e. a for loop in order to find a particular subject and use the rank1, rank2 to compute something.
So far, I use something like this but it does not seem to work (throwing a Compile error):
"function_score": {
"script_score": {
"script": {
"lang": "painless",
"inline":
"sum = 0;"
"for (item in doc['rankings_cug']) {"
"sum = sum + doc['rankings_cug.rank1'].value;"
"}"
}
}
}
I have also tried the following options:
for loop using : instead of in: for (item:doc['rankings']) with no success.
for loop using in but trying to iterate over a specific element of the object, i.e. the rank1: for (item in doc['rankings.rank1'].values), which actually compile but it seems that it finds a zero-length array of rank1.
I have read that _source element is the one which can return JSON-like objects, but as far as I found out it is not supported in Search queries.
Can you please give me some ideas of how to proceed with that?
Thanks a lot.
You can access _source via params._source. This one will work:
PUT /rankings/result/1?refresh
{
"rankings": [
{
"rank1": 1051,
"rank2": 78.5,
"subject": "s1"
},
{
"rank1": 45,
"rank2": 34.7,
"subject": "s2"
}
]
}
POST rankings/_search
POST rankings/_search
{
"query": {
"match": {
"_id": "1"
}
},
"script_fields": {
"script_score": {
"script": {
"lang": "painless",
"inline": "double sum = 0.0; for (item in params._source.rankings) { sum += item.rank2; } return sum;"
}
}
}
}
DELETE rankings
Unfortunately, ElasticSearch scripting in general does not support the ability to access nested documents in this way (including Painless). Perhaps, consider a different structure to your mappings where rankings are stored in multi-valued fields if you need to be able to iterate across them in such a way. Ultimately, the nested data will need to de-normalized and put into the parent documents to be able to gets scores in the way described here.
For Nested objects in an array, iterated over the items and it worked.
Following is my sample data in elasticsearch index:
{
"_index": "activity_index",
"_type": "log",
"_id": "AVjx0UTvgHp45Y_tQP6z",
"_version": 4,
"found": true,
"_source": {
"updated": "2016-12-11T22:56:13.548641",
"task_log": [
{
"week_end_date": "2016-12-11",
"log_hours": 16,
"week_start_date": "2016-12-05"
},
{
"week_start_date": "2016-03-21",
"log_hours": 0,
"week_end_date": "2016-03-27"
},
{
"week_start_date": "2016-04-24",
"log_hours": 0,
"week_end_date": "2016-04-30"
}
],
"created": "2016-12-11T22:56:13.548635",
"userid": 895,
"misc": {
},
"current": false,
"taskid": 1023829
}
}
Here is the "Painless" script to iterate over nested objects:
{
"script": {
"lang": "painless",
"inline":
"boolean contains(def x, def y) {
for (item in x) {
if (item['week_start_date'] == y){
return true
}
}
return false
}
if(!contains(ctx._source.task_log, params.start_time_param) {
ctx._source.task_log.add(params.week_object)
}",
"params": {
"start_time_param": "2016-04-24",
"week_object": {
"week_start_date": "2016-04-24",
"week_end_date": "2016-04-30",
"log_hours": 0
}
}
}
}
Used above script for update: /activity_index/log/AVjx0UTvgHp45Y_tQP6z/_update
In the script, created a function called 'contains' with two arguments. Called the function.
The old groovy style: ctx._source.task_log.contains() will not work since ES 5.X stores nested objects in a separate document. Hope this helps!`

Get object hit when searching array in Elasticsearch

I am trying to get an object out of a JSON array that is stored in elasticsearch. The layout is like this:
[
object{}
object{}
object{}
]
What I need for when I do a search and it hits on one of these objects, to get the specific object it matches to. Currently, using the java API I am searching with:
QueryBuilder qb = QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("text", "pottery").boost(5)
.minimumShouldMatch("1"));
SearchResponse response = client.prepareSearch("stuff")
.setTypes("things")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(qb)
.setPostFilter(filter)//.setHighlighterQuery(qb)
.addField("places.numbers")
.addField("name")
.addField("city")
.setFrom(0).setSize(60).setExplain(true)
.execute()
.actionGet();
But this will just return the whole object that I hit or when I tell it to return the field "places.numbers" it will only return the first object in the "palces" array, not the one that was matched in the query.
Thank you for any help!
There are a couple of ways to handle this. I would probably do it with a nested type and inner hits, given what you've shown in your question, but it could also probably be done with the parent/child relationship.
Here is an example with nested docs. I set up a simple index like this:
PUT /test_index
{
"mappings": {
"parent_doc": {
"properties": {
"parent_name": {
"type": "string"
},
"nested_docs": {
"type": "nested",
"properties": {
"nested_name": {
"type": "string"
}
}
}
}
}
}
}
Then added a couple of simple documents:
POST /test_index/parent_doc/_bulk
{"index":{"_id":1}}
{"parent_name":"p1","nested_docs":[{"nested_name":"n1"},{"nested_name":"n2"}]}
{"index":{"_id":2}}
{"parent_name":"p2","nested_docs":[{"nested_name":"n3"},{"nested_name":"n4"}]}
And now I can search like this, using "inner_hits":
POST /test_index/_search
{
"query": {
"nested": {
"path": "nested_docs",
"query": {
"match": {
"nested_docs.nested_name": "n3"
}
},
"inner_hits" : {}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.098612,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_score": 2.098612,
"_source": {
"parent_name": "p2",
"nested_docs": [
{
"nested_name": "n3"
},
{
"nested_name": "n4"
}
]
},
"inner_hits": {
"nested_docs": {
"hits": {
"total": 1,
"max_score": 2.098612,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_nested": {
"field": "nested_docs",
"offset": 0
},
"_score": 2.098612,
"_source": {
"nested_name": "n3"
}
}
]
}
}
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/ef7debf436fec2a10097ba2106d5ff30ff8d7c77

Resources