Aggregating array of values in elasticsearch - arrays

I need to aggregate an array as follows
Two document examples:
{
"_index": "log",
"_type": "travels",
"_id": "tnQsGy4lS0K6uT3Hwzzo-g",
"_score": 1,
"_source": {
"state": "saopaulo",
"date": "2014-10-30T17",
"traveler": "patrick",
"registry": "123123",
"cities": {
"saopaulo": 1,
"riodejaneiro": 2,
"total": 2
},
"reasons": [
"Entrega de encomenda"
],
"from": [
"CompraRapida"
]
}
},
{
"_index": "log",
"_type": "travels",
"_id": "tnQsGy4lS0K6uT3Hwzzo-g",
"_score": 1,
"_source": {
"state": "saopaulo",
"date": "2014-10-31T17",
"traveler": "patrick",
"registry": "123123",
"cities": {
"saopaulo": 1,
"curitiba": 1,
"total": 2
},
"reasons": [
"Entrega de encomenda"
],
"from": [
"CompraRapida"
]
}
},
I want to aggregate the cities array, to find out all the cities the traveler has gone to. I want something like this:
{
"traveler":{
"name":"patrick"
},
"cities":{
"saopaulo":2,
"riodejaneiro":2,
"curitiba":1,
"total":3
}
}
Where the total is the length of the cities array minus 1. I tried the terms aggregation and the sum, but couldn't output the desired output.
Changes in the document structure can be made, so if anything like that would help me, I'd be pleased to know.

in the document posted above "cities" is not a json array , it is a json object.
If changing the document structure is a possibility I would change cities in the document to be an array of object
example document:
cities : [
{
"name" :"saopaulo"
"visit_count" :"2",
},
{
"name" :"riodejaneiro"
"visit_count" :"1",
}
]
You would then need to set cities to be of type nested in the index mapping
"mappings": {
"<type_name>": {
"properties": {
"cities": {
"type": "nested",
"properties": {
"city": {
"type": "string"
},
"count": {
"type": "integer"
},
"value": {
"type": "long"
}
}
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"registry": {
"type": "string"
},
"state": {
"type": "string"
},
"traveler": {
"type": "string"
}
}
}
}
After which you could use nested aggregation to get the city count per user.
The query would look something on these lines :
{
"query": {
"match": {
"traveler": "patrick"
}
},
"aggregations": {
"city_travelled": {
"nested": {
"path": "cities"
},
"aggs": {
"citycount": {
"cardinality": {
"field": "cities.city"
}
}
}
}
}
}

Related

MongoDB How to search in nested objects?

How i can search for the value "20044" in all fields "Barcode" and just in field "Barcode" in a nested object without specifying an absolute path e.g. "Item.Item.Item.Barcode" in MongoDB?
My current solutions:
Search in all text fields, not only in "Barcode" fields
find({$text: {$search: '20044'}})
Search in one specifying absolute path and not in all "Barcode" fields
find({'Item.Item.Item.Barcode': '20044'})
This is my databse object:
{
"_id": {
"$oid": "633d7cc238d7f8dafeace6f5"
},
"Number": "2",
"Item": [
{
"Type": "FrameElement",
"Item": [
{
"Type": "Frame",
"Barcode": "20011"
},
{
"Type": "Frame",
"Barcode": "20012"
},
{
"Type": "SashElement",
"Item": [
{
"Type": "Sash",
"Barcode": "20021"
},
{
"Type": "Sash",
"Barcode": "20022"
},
{
"Type": "GlassBarElement",
"Item": [
{
"Type": "GlassBar",
"Barcode": "20031"
},
{
"Type": "GlassBar",
"Barcode": "20032"
}
]
}
]
},
{
"Type": "Glass",
"Barcode": "20016"
},
{
"Type": "GlassBarElement",
"Item": [
{
"Type": "GlassBar",
"Barcode": "20043"
},
{
"Type": "GlassBar",
"Barcode": "20044"
}
]
}
]
}
]
}

ElasticSearch sort by field in NestedObject At First Index Of Array

I am trying to sort a field inside the first object of an array in the following docs
each docs has an array i want to retrieve the docs sorted by they first objects by there city name lets name that in the following result I want to have first the third documents because the name of the city its start by "L" ('london') then the second "M" ('Moscow') then the third "N" ('NYC')
the structure is a record that:
has an array
the array contains an object (called 'address')
the object has a field (called 'city')
i want to sort the docs by the first address.cities
get hello/_mapping
{
"hello": {
"mappings": {
"jack": {
"properties": {
"houses": {
"type": "nested",
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
}
}
Thos are the document that i indexed
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "hello",
"_type": "jack",
"_id": "2",
"_score": 1,
"_source": {
"houses": [
{
"address": {
"city": "moscow"
}
},
{
"address": {
"city": "belgrade"
}
},
{
"address": {
"city": "Sacramento"
}
}
]
}
},
{
"_index": "hello",
"_type": "jack",
"_id": "1",
"_score": 1,
"_source": {
"houses": [
{
"address": {
"city": "NYC"
}
},
{
"address": {
"city": "PARIS"
}
},
{
"address": {
"city": "TLV"
}
}
]
}
},
{
"_index": "hello",
"_type": "jack",
"_id": "3",
"_score": 1,
"_source": {
"houses": [
{
"address": {
"city": "London"
}
}
]
}
}
]
}
}
Try this (of course, add some test inside the script if field could be empty. Note it could be pretty slow, because elastic wont have this value indexed. Add a main address field would be faster (and really faster) for sure and would be the good way to do it.
{
"sort" : {
"_script" : {
"script" : "params._source.houses[0].address.city",
"type" : "string",
"order" : "asc"
}
}
}
You have to use _source instead of doc[yourfield] because you dont know in witch order elastic store your array.
EDIT: test if field exist
{
"query": {
"nested": {
"path": "houses",
"query": {
"bool": {
"must": [
{
"exists": {
"field": "houses.address"
}
}
]
}
}
}
},
"sort": {
"_script": {
"script" : "params._source.houses[0].address.city",
"type": "string",
"order": "asc"
}
}
}

ElasticSearch-Kibana : filter array by key

I have data with one parameter which is an array. I know that objects in array are not well supported in Kibana, however I would like to know if there is a way to filter that array with only one value for the key. I mean :
This is a json for exemple :
{
"_index": "index",
"_type": "data",
"_id": "8",
"_version": 2,
"_score": 1,
"_source": {
"envelope": {
"version": "0.0.1",
"submitter": "VF12RBU1D53087510",
"MetaData": {
"SpecificMetaData": [
{
"key": "key1",
"value": "94"
},
{
"key": "key2",
"value": "0"
}
]
}
}
}
}
And I would like to only have the data which contains key1 in my SpecificMetaData array in order to plot them. For now, when I plot SpecificMetaData.value it takes all the values of the array (value of key1 and key2) and doesn't propose SpecificMetaData.value1 and SpecificMetaData.value2.
If you need more information, tell me. Thank you.
you may need to map your data to mappings so as SpecificMetaData should act as nested_type and inner_hits of nested filter can supply you with objects which have key1.
PUT envelope_index
{
"mappings": {
"document_type": {
"properties": {
"envelope": {
"type": "object",
"properties": {
"version": {
"type": "text"
},
"submitter": {
"type": "text"
},
"MetaData": {
"type": "object",
"properties": {
"SpecificMetaData": {
"type": "nested"
}
}
}
}
}
}
}
}
}
POST envelope_index/document_type
{
"envelope": {
"version": "0.0.1",
"submitter": "VF12RBU1D53087510",
"MetaData": {
"SpecificMetaData": [{
"key": "key1",
"value": "94"
},
{
"key": "key2",
"value": "0"
}
]
}
}
}
POST envelope_index/_search
{
"query": {
"bool": {
"must": [
{
"nested": {
"inner_hits": {},
"path": "envelope.MetaData.SpecificMetaData",
"query": {
"bool": {
"must": [
{
"term": {
"envelope.MetaData.SpecificMetaData.key": {
"value": "key1"
}
}
}
]
}
}
}
}
]
}
}
}

How to score by max relevance match in array elements in ElasticSearch?

I have an autocomplete analyser for a field("keywords"). This field is an array of strings. When I query with a search string I want to show first the documents where a single element of the array keywords matches best. The problem is that if a part of the string matches with more elements of the array "keywords", then this document appears before another that has less but better matches. For example, if I have a query with the word "gas station" the returning documents' keywords are these:
"hits": [
{
"_index": "locali_v3",
"_type": "categories",
"_id": "5810767ddc536a03b4761acd",
"_score": 3.1974547,
"_source": {
"keywords": [
"Radio Station",
"Radio Station"
]
}
},
{
"_index": "locali_v3",
"_type": "categories",
"_id": "581076d8dc536a03b4761cc3",
"_score": 3.0407648,
"_source": {
"keywords": [
"Stationery Store",
"Stationery Store"
]
}
},
{
"_index": "locali_v3",
"_type": "categories",
"_id": "5810767ddc536a03b4761ace",
"_score": 2.903595,
"_source": {
"keywords": [
"TV Station",
"TV Station"
]
}
},
{
"_index": "locali_v3",
"_type": "categories",
"_id": "581076cddc536a03b4761c87",
"_score": 2.517158,
"_source": {
"keywords": [
"Praktoreio Ugrwn Kausimwn/Gkaraz",
"Praktoreio Ygrwn Kaysimwn/Gkaraz",
"Praktoreio Ugron Kausimon/Gkaraz",
"Praktoreio Ygron Kaysimon/Gkaraz",
"Πρακτορείο Υγρών Καυσίμων/Γκαράζ",
"Gas Station"
]
}
}
The "Gas Station" is fourth, although it has the best single element matching. Is there a way to tell ElasticSearch that I do not care about how many times "gas" or "station" appears in keywords? I want the max element of the array keywords match as the score factor.
My settings are:
{
"locali": {
"settings": {
"index": {
"creation_date": "1480937810266",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"keywords": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"char_filter": [
"my_char_filter"
],
"type": "custom",
"tokenizer": "standard"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ί => ι",
"Ί => Ι",
"ή => η",
"Ή => Η",
"ύ => υ",
"Ύ => Υ",
"ά => α",
"Ά => Α",
"έ => ε",
"Έ => Ε",
"ό => ο",
"Ό => Ο",
"ώ => ω",
"Ώ => Ω",
"ϊ => ι",
"ϋ => υ",
"ΐ => ι",
"ΰ => υ"
]
}
}
},
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "TJjOt9L9QE2HrsUFHM6zJg",
"version": {
"created": "2040099"
}
}
}
}
}
And the mappings:
{
"locali": {
"mappings": {
"places": {
"properties": {
"formattedCategories": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"keywords": {
"type": "string",
"analyzer": "keywords"
},
"loc": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"location": {
"properties": {
"formattedAddress": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"locality": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"neighbourhood": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
}
}
},
"name": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"rating": {
"properties": {
"rating": {
"type": "long"
}
}
},
"seenDetails": {
"type": "long"
},
"verified": {
"type": "long"
}
}
},
"regions": {
"properties": {
"keywords": {
"type": "string",
"analyzer": "keywords"
},
"loc": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"name": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"type": {
"type": "long"
},
"weight": {
"type": "long"
}
}
},
"categories": {
"properties": {
"keywords": {
"type": "string",
"analyzer": "keywords"
},
"name": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"weight": {
"type": "long"
}
}
}
}
}
}
Can you post your query here that you are trying here as well.
I tried your example with the following query
{
"query": {"match": {
"keywords": "gas station"
}
}
}
And i got your desired result.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.081366636,
"hits": [
{
"_index": "stack",
"_type": "type",
"_id": "AVjP6QnpdNp-z_ybGd-L",
"_score": 0.081366636,
"_source": {
"keywords": [
"Praktoreio Ugrwn Kausimwn/Gkaraz",
"Praktoreio Ygrwn Kaysimwn/Gkaraz",
"Praktoreio Ugron Kausimon/Gkaraz",
"Praktoreio Ygron Kaysimon/Gkaraz",
"Πρακτορείο Υγρών Καυσίμων/Γκαράζ",
"Gas Station"
]
}
},
{
"_index": "stack",
"_type": "type",
"_id": "AVjP5-u5dNp-z_ybGd-I",
"_score": 0.03182549,
"_source": {
"keywords": [
"Radio Station",
"Radio Station"
]
}
},
{
"_index": "stack",
"_type": "type",
"_id": "AVjP6KiKdNp-z_ybGd-K",
"_score": 0.03182549,
"_source": {
"keywords": [
"TV Station",
"TV Station"
]
}
}
]
}
}
Try this query to see if you are getting desired result. Also you can reply with your mappings, query and ES version if this does't work for you.
Hope this solves your problem. Thanks

ElasticSearch: Retrieve string concatenation, or partial array

I have many indexed documents such as this one:
{
"_index":"myindex",
"_type":"somedata",
"_id":"31d3255d-67b4-40e6-b9d4-637383eb72ad",
"_version":1,
"_score":1,
"_source":{
"otherID":"b4c95332-daed-49ae-99fe-c32482696d1c",
"data":[
{
"data":"d2454d41-a74e-43af-b3b0-0febeaf67a99",
"iD":"9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
},
{
"data":"some text",
"iD":"c554b8ce-c873-4fef-b306-ec65d2f40394"
},
{
"data":"5256983c-ef69-4363-9787-97074297c646",
"iD":"8c90e2be-6042-4450-b0fd-0732900f8f65"
},
{
"data":"other text",
"iD":"8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
},
{
"data":"3",
"iD":"c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
}
],
"iD":"31d3255d-67b4-40e6-b9d4-637383eb72ad"
}
}
It's type mapping is configured this way:
{
"somedata":{
"dynamic_templates":[
{
"defaultIDs":{
"match_pattern":"regex",
"mapping":{
"index":"not_analyzed",
"type":"string"
},
"match":".*(id|ID|iD)"
}
}
],
"properties":{
"otherID":{
"index":"not_analyzed",
"type":"string"
},
"data":{
"properties":{
"data":{
"type":"string"
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
}
}
I wish to be able to retrieve a string concatenation of data based on it's ID.
For example, given the id c554b8ce-c873-4fef-b306-ec65d2f40394, and the id 8d8f8a61-02d6-4d3e-9912-9ebb5d213c15, I would like to retrieve some text other text.
These IDs repeat in other documents of the same type with different data.
If this is not possible (which I suspect this is the case), I would like to at least retrieve a partial array containing my requested data.
Those arrays can become large (and so is the number of documents) and I would only need one or two elements from each hit.
If both my requests are not possible, how would you suggest changing my mappings in order to facilitate my needs?
Thanks in advance, Jonathan.
I have found a way to do exactly what I needed without changing my data structure.
(I actually did end up changing my data structure, but for reasons of space and efficiency).
All you have to do is enjoy the groovy goodness ElasticSearch has to offer:
{
"query" : { "term" : { "otherID" : "b4c95332-daed-49ae-99fe-c32482696d1c" } },
"script_fields" : { "requestedFields" : { "script" : "_source.data.findAll({ it.iD == 'c554b8ce-c873-4fef-b306-ec65d2f40394' || it.iD == '8d8f8a61-02d6-4d3e-9912-9ebb5d213c15'}) data.join(' ') " } }
}
Just goes to show how strong ElasticSearch really is.
I cannot help you with the field concatenation (maybe it's possible with scripting but I'm not experienced enough with it. I would assume a new field would have to be generated, etc.) but how to only retrieve the partial data.
It requires at least ES 1.5 because it uses inner_hits and you need to change the mapping.
I added type and include_in_parent to your data type:
DELETE somedata
PUT somedata
PUT somedata/sometype/_mapping
{
"sometype":{
"dynamic_templates":[
{
"defaultIDs":{
"match_pattern":"regex",
"mapping":{
"index":"not_analyzed",
"type":"string"
},
"match":".*(id|ID|iD)"
}
}
],
"properties":{
"otherID":{
"index":"not_analyzed",
"type":"string"
},
"data":{
"type": "nested",
"include_in_parent": true,
"properties":{
"data":{
"type":"string"
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
}
}
Now indexing your document:
PUT somedata/sometype/1
{
"otherID":"b4c95332-daed-49ae-99fe-c32482696d1c",
"data":[
{
"data":"d2454d41-a74e-43af-b3b0-0febeaf67a99",
"iD":"9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
},
{
"data":"some text",
"iD":"c554b8ce-c873-4fef-b306-ec65d2f40394"
},
{
"data":"5256983c-ef69-4363-9787-97074297c646",
"iD":"8c90e2be-6042-4450-b0fd-0732900f8f65"
},
{
"data":"other text",
"iD":"8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
},
{
"data":"3",
"iD":"c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
}
],
"iD":"31d3255d-67b4-40e6-b9d4-637383eb72ad"
}
And here's how you can match and retrieve with inner_hits:
POST somedata/sometype/_search
{
"query": {
"nested": {
"path": "data",
"query": {
"bool": {
"should": [
{
"term": {
"data.iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
}
},
{
"term": {
"data.iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
}
}
]
}
},
"inner_hits": {}
}
}
}
In the result now look at this path: hits.hits[0].inner_hits.data.hits.hits[0]._source.data; it only contains your two requested matches:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5986179,
"hits": [
{
"_index": "somedata",
"_type": "sometype",
"_id": "1",
"_score": 0.5986179,
"_source": {
"otherID": "b4c95332-daed-49ae-99fe-c32482696d1c",
"data": [
{
"data": "d2454d41-a74e-43af-b3b0-0febeaf67a99",
"iD": "9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
},
{
"data": "some text",
"iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
},
{
"data": "5256983c-ef69-4363-9787-97074297c646",
"iD": "8c90e2be-6042-4450-b0fd-0732900f8f65"
},
{
"data": "other text",
"iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
},
{
"data": "3",
"iD": "c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
}
],
"iD": "31d3255d-67b4-40e6-b9d4-637383eb72ad"
},
"inner_hits": {
"data": {
"hits": {
"total": 2,
"max_score": 0.5986179,
"hits": [
{
"_index": "somedata",
"_type": "sometype",
"_id": "1",
"_nested": {
"field": "data",
"offset": 3
},
"_score": 0.5986179,
"_source": {
"data": "other text",
"iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
}
},
{
"_index": "somedata",
"_type": "sometype",
"_id": "1",
"_nested": {
"field": "data",
"offset": 1
},
"_score": 0.5986179,
"_source": {
"data": "some text",
"iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
}
}
]
}
}
}
}
]
}
}
Now, inner_hits is fairly new and the documentation also states:
Warning: This functionality is experimental and may be changed or removed completely in a future release.
YMMV.
Another thing to watch out: the inner_hits are sorted by score. In your original document they're in an array which is ordered but that information is lost in the actual result. If you require to have them in the same order in the inner_hits, I think you need to add a separate field for sorting (could just be the array index...) and sort the inner_hits by it.

Resources