Get object hit when searching array in Elasticsearch

Get object hit when searching array in Elasticsearch - arrays

I am trying to get an object out of a JSON array that is stored in elasticsearch. The layout is like this:
[
object{}
object{}
object{}
]
What I need for when I do a search and it hits on one of these objects, to get the specific object it matches to. Currently, using the java API I am searching with:
QueryBuilder qb = QueryBuilders.boolQuery()
.should(QueryBuilders.matchQuery("text", "pottery").boost(5)
.minimumShouldMatch("1"));
SearchResponse response = client.prepareSearch("stuff")
.setTypes("things")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(qb)
.setPostFilter(filter)//.setHighlighterQuery(qb)
.addField("places.numbers")
.addField("name")
.addField("city")
.setFrom(0).setSize(60).setExplain(true)
.execute()
.actionGet();
But this will just return the whole object that I hit or when I tell it to return the field "places.numbers" it will only return the first object in the "palces" array, not the one that was matched in the query.
Thank you for any help!

There are a couple of ways to handle this. I would probably do it with a nested type and inner hits, given what you've shown in your question, but it could also probably be done with the parent/child relationship.
Here is an example with nested docs. I set up a simple index like this:
PUT /test_index
{
"mappings": {
"parent_doc": {
"properties": {
"parent_name": {
"type": "string"
},
"nested_docs": {
"type": "nested",
"properties": {
"nested_name": {
"type": "string"
}
}
}
}
}
}
}
Then added a couple of simple documents:
POST /test_index/parent_doc/_bulk
{"index":{"_id":1}}
{"parent_name":"p1","nested_docs":[{"nested_name":"n1"},{"nested_name":"n2"}]}
{"index":{"_id":2}}
{"parent_name":"p2","nested_docs":[{"nested_name":"n3"},{"nested_name":"n4"}]}
And now I can search like this, using "inner_hits":
POST /test_index/_search
{
"query": {
"nested": {
"path": "nested_docs",
"query": {
"match": {
"nested_docs.nested_name": "n3"
}
},
"inner_hits" : {}
}
}
}
which returns:
{
"took": 4,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 2.098612,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_score": 2.098612,
"_source": {
"parent_name": "p2",
"nested_docs": [
{
"nested_name": "n3"
},
{
"nested_name": "n4"
}
]
},
"inner_hits": {
"nested_docs": {
"hits": {
"total": 1,
"max_score": 2.098612,
"hits": [
{
"_index": "test_index",
"_type": "parent_doc",
"_id": "2",
"_nested": {
"field": "nested_docs",
"offset": 0
},
"_score": 2.098612,
"_source": {
"nested_name": "n3"
}
}
]
}
}
}
}
]
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/ef7debf436fec2a10097ba2106d5ff30ff8d7c77

Related

useQuery's onCompleted being called with cached value

Hopefully I can articulate this question clearly without too much code as it's difficult to extract the pieces from my codebase.
I was observing odd behavior yesterday with useQuery that I can't seem to understand. I think I understand Apollo's cache pretty well but this particular behavior doesn't make sense to me. I have a query that looks something like this:
query {
reservations {
priceBreakdown {
sections {
id
name
total
}
}
}
}
The schema is something like:
type Query {
reservations: [Reservation]
}
type Reservation {
priceBreakdown: PriceBreakdown
}
type PriceBreakdown {
sections: [Section]
}
type Section {
id: String
name: String
total: Float
}
That id on Section is not a proper ID and, in fact, is not unique. It's just a string and all PriceBreakdowns have a list of Sections that contain the same ID. I've pointed this out to the backend folks and it's being fixed but I realize this causes incorrect caching with Apollo since there will be collisions w.r.t. __typename and id. My confusion comes from how onCompleted is called. I noticed when doing
const { data } = useQuery(myQuery, {
onCompleted: console.log
})
that when the network call returns, all PriceBreakdowns are unique and correct, as they should be. But when onCompleted is called with what I thought would be that same API data, it's different and seems to reflect the cached values. In case that's confusing, here are the two results. First is straight from the API and second is the log from onCompleted:
// api results
"data": [
{
"id": "92267",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$60.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$51.00",
"id": "HOST_TOTAL"
}
]
}
},
{
"id": "92266",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$30.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$25.50",
"id": "HOST_TOTAL"
}
]
}
}
]
// onCompleted log
"data": [
{
"id": "92267",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$60.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$51.00",
"id": "HOST_TOTAL"
}
]
}
},
{
"id": "92266",
"price_breakdown": {
"sections": [
{
"name": "Reservation",
"total": "$60.00",
"id": "RESERVATION"
},
{
"name": "Promotions and Fees",
"total": null,
"id": "PROMOTIONS_AND_FEES"
},
{
"name": "Total",
"total": "$51.00",
"id": "HOST_TOTAL"
}
]
}
}
]
As you can see, in the onCompleted log, the Sections that had the same ID as Sections from the previous record are duplicated, suggesting Apollo is rebuilding the payload from cache and calling onCompleted with that. Is that what's happening? If I set the fetchPolicy to no-cache, the results are correct, but of course that's just a patch for the problem. I want to better understand Apollo because I thought I understood and now I see something unintuitive. I wouldn't have expected onCompleted to be called with something built from the cache. Thanks in advance.

Multikey partial index not used with elemMatch

Consider the following document format which has an array field tasks holding embedded documents
{
"foo": "bar",
"tasks": [
{
"status": "sleep",
"id": "1"
},
{
"status": "active",
"id": "2"
}
]
}
There exists a partial index on key tasks.id
{
"v": 2,
"unique": true,
"key": {
"tasks.id": 1
},
"name": "tasks.id_1",
"partialFilterExpression": {
"tasks.id": {
"$exists": true
}
},
"ns": "zardb.quxcollection"
}
The following $elemMatch query with multiple conditions on the same array element
db.quxcollection.find(
{
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
does not seem to use the index
"winningPlan": {
"stage": "COLLSCAN",
"filter": {
"tasks": {
"$elemMatch": {
"$and": [{
"id": {
"$eq": "1"
}
},
{
"status": {
"$not": {
"$eq": "active"
}
}
}
]
}
}
},
"direction": "forward"
}
How can I make the above query use the index? The index does seem to be used via dot notation
db.quxcollection.find({"tasks.id": "1"})
however I need the same array element to match multiple conditions which includes the status field, and the following does not seem to be equivalent to the above $elemMatch based query
db.quxcollection.find({
"tasks.id": "1",
"tasks.status": { "$nin": ["active"] }
})

The way the partial indexes work is it uses the path as a key. With $elemMatch you don't have the path explicitly in the query. If you check it with .explain("allPlansExecution") it is not even considered by the query planner.
To benefit from the index you can specify the path in the query:
db.quxcollection.find(
{
"tasks.id": "1",
"tasks": {
"$elemMatch": {
"id": {
"$eq": "1"
},
"status": {
"$nin": ["active"]
}
}
}
}).explain()
It duplicates part of the elemMatch condition, so the index will be used to get all documents containing tasks of specific id, then it will filter out documents with "active" tasks at fetch stage. I must admit the query doesn't look nice, so may be add some comments to the code with explanations.

Remove elements/objects From Array in ElasticSearch Followed by Matching Query

I'm having issues trying to remove elements/objects from an array in elasticsearch.
This is the mapping for the index:
{
"example1": {
"mappings": {
"doc": {
"properties": {
"locations": {
"type": "geo_point"
},
"postDate": {
"type": "date"
},
"status": {
"type": "long"
},
"user": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
And this is an example document.
{
"_index": "example1",
"_type": "doc",
"_id": "8036",
"_score": 1,
"_source": {
"user": "kimchy8036",
"postDate": "2009-11-15T13:12:00",
"locations": [
[
72.79887719999999,
21.193036000000003
],
[
-1.8262150000000001,
51.178881999999994
]
]
}
}
Using the query below, I can add multiple locations.
POST /example1/_update_by_query
{
"query": {
"match": {
"_id": "3"
}
},
"script": {
"lang": "painless",
"inline": "ctx._source.locations.add(params.newsupp)",
"params": {
"newsupp": [
-74.00,
41.12121
]
}
}
}
But I'm not able to remove array objects from locations. I have tried the query below but it's not working.
POST example1/doc/3/_update
{
"script": {
"lang": "painless",
"inline": "ctx._source.locations.remove(params.tag)",
"params": {
"tag": [
-74.00,
41.12121
]
}
}
}
Kindly let me know where i am doing wrong here. I am using elastic version 5.5.2

In painless scripts, Array.remove() method removes by index, not by value.
Here's a working example that removes array elements by value in Elasticsearch script:
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
if (ctx._source[params.array_attribute] != null) {
for (int i=ctx._source[params.array_attribute].length-1; i>=0; i--) {
if (ctx._source[params.array_attribute][i] == params.value_to_remove) {
ctx._source[params.array_attribute].remove(i);
}
}
}
""",
"params": {
"array_attribute": "<NAME_OF_ARRAY_PROPERTY_TO_REMOVE_VALUE_IN>",
"value_to_remove": "<VALUE_TO_REMOVE_FROM_ARRAY>",
}
}
}
You might want to simplify script, if your script shall only remove values from one specific array attribute. For example, removing "green" from document's .color_list array:
_doc/001 = {
"color_list": ["red", "blue", "green"]
}
Script to remove "green":
POST objects/_update_by_query
{
"query": {
... // use regular ES query to remove only in relevant documents
},
"script": {
"source": """
for (int i=ctx._source.color_list.length-1; i>=0; i--) {
if (ctx._source.color_list[i] == params.color_to_remove) {
ctx._source.color_list.remove(i);
}
}
""",
"params": {
"color_to_remove": "green"
}
}
}

Unlike add(), remove() takes the index of the element and remove it.
Your ctx._source.locations in painless is an ArrayList. It has List's remove() method:
E remove(int index)
Removes the element at the specified position in this list (optional operation). ...
See Painless API - List for other methods.
See this answer for example code.

"script" : {
"lang":"painless",
"inline":"ctx._source.locations.remove(params.tag)",
"params":{
"tag":indexToRemove
}
}
If with ctx._source.locations.add(elt) You add the element, with ctx._source.locations.remove(indexToRemove), you remove by the index of element in the array.

Elasticsearch - value in array filter

I want to filter out all documents which contain a specific value in an array field. I.e. the value is an element of that array field.
To be specific - I want to select all documents which names contains test-name, see the example below.
So when I do an empty search with
curl -XGET localhost:9200/test-index/_search
the result is
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 50,
"max_score": 1,
"hits": [
{
"_index": "test-index",
"_type": "test",
"_id": "34873ae4-f394-42ec-b2fc-41736e053c69",
"_score": 1,
"_source": {
"names": [
"test-name"
],
"age": 100,
...
}
},
...
}
}
But in case of a more specific query
curl -XPOST localhost:9200/test-index/_search -d '{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"names": "test-name"
}
}
}
}
}'
I don't get any results
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
There are some questions similar to this one. Although, I cannot get any of the answers to work for me.
System specs: Elasticsearch 5.1.1, Ubuntu 16.04
EDIT
curl -XGET localhost:9200/test-index
...
"names": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
...

That's because the names field is analyzed and test name gets indexed as two tokens test and name.
Searching for the test name term will hence not yield anything. If you use match instead, you'll get the document.
If you want to check for the exact value test name (i.e. the two tokens one after another), then you need to change your names field to a keyword type instead of text
UPDATE
According to your mapping, the names field is analyzed, you need to use the names.keyword field instead and it will work, like this:
curl -XPOST localhost:9200/test-index/_search -d '{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"term": {
"names.keyword": "test-name"
}
}
}
}
}'

Elasticsearch: sort by max value in array

Let's say I have 2 documents:
{
"id": "1234",
"things": [
{
"datetime": "2016-01-01T12:00:00+03:00"
},
{
"datetime": "2016-01-06T12:00:00+03:00"
},
{
"datetime": "2100-01-01T12:00:00+03:00"
}
]
}
and
{
"id": "5678",
"things": [
{
"datetime": "2016-01-03T12:00:00+03:00"
},
{
"datetime": "2100-01-06T12:00:00+03:00"
}
]
}
things.datetime is mapped as { "type": "date", "format": "date_time_no_millis" }.
I want to sort these documents based on the latest things.datetime value that is not in the future.
I.e. sorted by simply the max things.datetime would use the dates 2100-01-01T12:00:00+03:00 and 2100-01-06T12:00:00+03:00. I want the sorting to be based on the values 2016-01-06T12:00:00+03:00 and 2016-01-03T12:00:00+03:00.
How can I achieve this, using ElasticSearch 2.x?
I've tried:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max"
}
}
But that doesn't seem to sort even by the 2100 dates.
I also tried to use nested_filter like so:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max",
"nested_filter": {
"range": {
"things.datetime": { "lte": "now" }
}
}
}
}
But it doesn't work as I'd expect.
Also the "sort" value in the response is a negative number. So for a document with dates:
"2015-10-24T05:50:00+03:00",
"2015-10-26T22:05:48+02:00",
"2015-10-24T08:05:43+03:00"
gets a negative sort value:
"sort": [
-9223372036854775808
]

The correct way to achieve this seems to be:
"sort": {
"things.datetime": {
"order": "desc",
"mode": "max",
"nested_path": "things",
"nested_filter": {
"range": {
"things.datetime": { "lte": "now" }
}
}
}
}
When there are no more dates left after the nested_filter, the sort value becomes a negative number to ensure the correct order.