ElasticSearch: search multiple elements in array of object - arrays

I'm on Elastic Search 6.8.22
I have multiple users and each one has multiple papers ("valid" or not):
{"name":"Amy",
"papers":[
{"type":"idcard", "country":"fr", "valid":"no"},
{"type":"idcard", "country":"us", "valid":"yes"}
]}
{"name":"Brittany",
"papers":[
{"type":"idcard", "country":"fr", "valid":"no"},
{"type":"idcard", "country":"us", "valid":"no"}
]}
{"name":"Chloe",
"papers":[
{"type":"idcard", "country":"fr", "valid":"yes"},
{"type":"idcard", "country":"us", "valid":"no"}
]}
I'm trying to find only user with a paper: "valid" for "fr":
{"query": {
"bool": {
"filter": [
{"match":{"papers.valid": "yes"}},
{"match":{"papers.country": "fr"}}
]}}}
It returns Chloe, which is fine (she has a paper which is both "valid" and "fr").
But it also returns Amy; because she has one "valid" paper and another one which is "fr".
This is due to the fact that ES doesn't understand array of objects and flattens everything into values with arrays (as far as I understand).
I've tried using "combined term queries" from this link, but I guess it only works for arrays of "primitive" (not complex objects).
I've seen that I can transform arrays into nested objects to do what I need, but it seems to be overcomplicated and would slow down the queries (because of hidden joins).
My question is:
Is there any way I can search if a document has in its array of objects, one that match multiple criteria at the same time ?
(Originally, I wanted a query that checks if every "papers" in the array matched criteria, but that seems impossible, ex. all papers of type "idcard" must be "valid")

You need to define papers as a nested field in the mapping, then you can run a nested search on it
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html
So if for example, your mapping will be this:
{
"mappings": {
"properties": {
"name": {
"type": "keyword"
},
"papers": {
"type": "nested",
"properties": {
"type": {
"type": "keyword"
},
"country": {
"type": "keyword"
},
"valid": {
"type": "keyword"
}
}
}
}
}
}
this query will work
{
"query": {
"nested": {
"path": "papers",
"query": {
"bool": {
"filter": [
{
"term": {
"papers.valid": "yes"
}
},
{
"term": {
"papers.country": "fr"
}
}
]
}
}
}
}
}

Related

ElasticSearch Painless script: How to iterate in an array of Nested Objects

I am trying to create a script using the script_score of the function_score.
I have several documents whose rankings field is type="nested".
The mapping for the field is:
"rankings": {
"type": "nested",
"properties": {
"rank1": {
"type": "long"
},
"rank2": {
"type": "float"
},
"subject": {
"type": "text"
}
}
}
A sample document is:
"rankings": [
{
"rank1": 1051,
"rank2": 78.5,
"subject": "s1"
},
{
"rank1": 45,
"rank2": 34.7,
"subject": "s2"
}]
What I want to achieve is to iterate over the nested objects of rankings. Actually, I need to use i.e. a for loop in order to find a particular subject and use the rank1, rank2 to compute something.
So far, I use something like this but it does not seem to work (throwing a Compile error):
"function_score": {
"script_score": {
"script": {
"lang": "painless",
"inline":
"sum = 0;"
"for (item in doc['rankings_cug']) {"
"sum = sum + doc['rankings_cug.rank1'].value;"
"}"
}
}
}
I have also tried the following options:
for loop using : instead of in: for (item:doc['rankings']) with no success.
for loop using in but trying to iterate over a specific element of the object, i.e. the rank1: for (item in doc['rankings.rank1'].values), which actually compile but it seems that it finds a zero-length array of rank1.
I have read that _source element is the one which can return JSON-like objects, but as far as I found out it is not supported in Search queries.
Can you please give me some ideas of how to proceed with that?
Thanks a lot.
You can access _source via params._source. This one will work:
PUT /rankings/result/1?refresh
{
"rankings": [
{
"rank1": 1051,
"rank2": 78.5,
"subject": "s1"
},
{
"rank1": 45,
"rank2": 34.7,
"subject": "s2"
}
]
}
POST rankings/_search
POST rankings/_search
{
"query": {
"match": {
"_id": "1"
}
},
"script_fields": {
"script_score": {
"script": {
"lang": "painless",
"inline": "double sum = 0.0; for (item in params._source.rankings) { sum += item.rank2; } return sum;"
}
}
}
}
DELETE rankings
Unfortunately, ElasticSearch scripting in general does not support the ability to access nested documents in this way (including Painless). Perhaps, consider a different structure to your mappings where rankings are stored in multi-valued fields if you need to be able to iterate across them in such a way. Ultimately, the nested data will need to de-normalized and put into the parent documents to be able to gets scores in the way described here.
For Nested objects in an array, iterated over the items and it worked.
Following is my sample data in elasticsearch index:
{
"_index": "activity_index",
"_type": "log",
"_id": "AVjx0UTvgHp45Y_tQP6z",
"_version": 4,
"found": true,
"_source": {
"updated": "2016-12-11T22:56:13.548641",
"task_log": [
{
"week_end_date": "2016-12-11",
"log_hours": 16,
"week_start_date": "2016-12-05"
},
{
"week_start_date": "2016-03-21",
"log_hours": 0,
"week_end_date": "2016-03-27"
},
{
"week_start_date": "2016-04-24",
"log_hours": 0,
"week_end_date": "2016-04-30"
}
],
"created": "2016-12-11T22:56:13.548635",
"userid": 895,
"misc": {
},
"current": false,
"taskid": 1023829
}
}
Here is the "Painless" script to iterate over nested objects:
{
"script": {
"lang": "painless",
"inline":
"boolean contains(def x, def y) {
for (item in x) {
if (item['week_start_date'] == y){
return true
}
}
return false
}
if(!contains(ctx._source.task_log, params.start_time_param) {
ctx._source.task_log.add(params.week_object)
}",
"params": {
"start_time_param": "2016-04-24",
"week_object": {
"week_start_date": "2016-04-24",
"week_end_date": "2016-04-30",
"log_hours": 0
}
}
}
}
Used above script for update: /activity_index/log/AVjx0UTvgHp45Y_tQP6z/_update
In the script, created a function called 'contains' with two arguments. Called the function.
The old groovy style: ctx._source.task_log.contains() will not work since ES 5.X stores nested objects in a separate document. Hope this helps!`

Cloudant find Query with $and and $or elements

I'm using the following json to find results in a Cloudant
{
"selector": {
"$and": [
{
"type": {
"$eq": "sensor"
}
},
{
"v": {
"$eq": 2355
}
},
{
"$or": [
{
"p": "#401000103"
},
{
"p": "#401000114"
}
]
},
{
"t_max": {
"$gte": 1459554894
}
},
{
"t_min": {
"$lte": 1459509591
}
}
]
},
"fields": [
"_id",
"p"
],
"limit": 200
}
If I run this againt my cloudant database I get the following error:
{
"error": "unknown_error",
"reason": "function_clause",
"ref": 3379914628
}
If I remove one the $or elements I get the results for query.
(,{"p":"#401000114"})
Also i get a result if I replace #401000114 with #401000114 I get result.
But when I want to use both element I get the error code above.
Can anybody tell what this error_reason: function_clause mean?
error_reason: function_clause means there was a problem on the server, you should probably reach out to Cloudant Support and see if they can help you with your issue.
I had contact with the Cloudant support.
This is there answer:
The issue affects Cloudant generally
It affects both mult-tenant and dedicated clusters.
There are working on the sollution.
A workaround is in the array to which the $or operator applies has two elements, you can get the correct result by repeating one of the items in the array.

"There is no index available for this selector" despite the fact I made one

In my data, I have two fields that I want to use as an index together. They are sensorid (any string) and timestamp (yyyy-mm-dd hh:mm:ss).
So I made an index for these two using the Cloudant index generator. This was created successfully and it appears as a design document.
{
"index": {
"fields": [
{
"name": "sensorid",
"type": "string"
},
{
"name": "timestamp",
"type": "string"
}
]
},
"type": "text"
}
However, when I try to make the following query to find all documents with a timestamp newer than some value, I am told there is no index available for the selector:
{
"selector": {
"timestamp": {
"$gt": "2015-10-13 16:00:00"
}
},
"fields": [
"_id",
"_rev"
],
"sort": [
{
"_id": "asc"
}
]
}
What have I done wrong?
It seems to me like cloudant query only allows sorting on fields that are part of the selector.
Therefore your selector should include the _id field and look like:
"selector":{
"_id":{
"$gt":0
},
"timestamp":{
"$gt":"2015-10-13 16:00:00"
}
}
I hope this works for you!

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

elastic search array score

I have a document has an array like
doc1
{
"item_type":"bag",
"color":["red","blue","green","orange"]
}
doc2
{
"item_type":"shirt",
"color":["red"]
}
when I do a multi_match search like
{
"query": {
"multi_match": {
"query": "red bag",
"type": "cross_fields",
"fields": ["item_type","color"]
}
}
}
The doc2 has much higher score, I understand color filed has less items get higher score and it get worse if I have more colors in doc1.
So is there a way I can ask Elasticsearch to score the same for an array field no matter how many items are there?
If you do not want to account for field length (fieldNorm) during the scoring you could disable norms for a field in the mapping.
For example the mapping for the above example would be
{
"properties": {
"item_type": {
"type": "string"
},
"color": {
"type": "string",
"norms": {
"enabled": false
}
}
}
}
This article from elasticsearch definitive guide gives a good insight into field-length-norms.

Resources