Elasticsearch terms aggregation by strings in an array - arrays

How can I write an Elasticsearch terms aggregation that splits the buckets by the entire term rather than individual tokens? For example, I would like to aggregate by state, but the following returns new, york, jersey and california as individual buckets, not New York and New Jersey and California as the buckets as expected:
curl -XPOST "http://localhost:9200/my_index/_search" -d'
{
"aggs" : {
"states" : {
"terms" : {
"field" : "states",
"size": 10
}
}
}
}'
My use case is like the one described here
https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html
with just one difference:
the city field is an array in my case.
Example object:
{
"states": ["New York", "New Jersey", "California"]
}
It seems that the proposed solution (mapping the field as not_analyzed) does not work for arrays.
My mapping:
{
"properties": {
"states": {
"type":"object",
"fields": {
"raw": {
"type":"object",
"index":"not_analyzed"
}
}
}
}
}
I have tried to replace "object" by "string" but this is not working either.

I think all you're missing is "states.raw" in your aggregation (note that, since no analyzer is specified, the "states" field is analyzed with the standard analyzer; the sub-field "raw" is "not_analyzed"). Though your mapping might bear looking at as well. When I tried your mapping against ES 2.0 I got some errors, but this worked:
PUT /test_index
{
"mappings": {
"doc": {
"properties": {
"states": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
Then I added a couple of docs:
POST /test_index/doc/_bulk
{"index":{"_id":1}}
{"states":["New York","New Jersey","California"]}
{"index":{"_id":2}}
{"states":["New York","North Carolina","North Dakota"]}
And this query seems to do what you want:
POST /test_index/_search
{
"size": 0,
"aggs" : {
"states" : {
"terms" : {
"field" : "states.raw",
"size": 10
}
}
}
}
returning:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0,
"hits": []
},
"aggregations": {
"states": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "New York",
"doc_count": 2
},
{
"key": "California",
"doc_count": 1
},
{
"key": "New Jersey",
"doc_count": 1
},
{
"key": "North Carolina",
"doc_count": 1
},
{
"key": "North Dakota",
"doc_count": 1
}
]
}
}
}
Here's the code I used to test it:
http://sense.qbox.io/gist/31851c3cfee8c1896eb4b53bc1ddd39ae87b173e

Related

Creating 1-1 mapping of Array results

I am trying to use data coming back from an ElasticSearch query, and get it into an Array in jsonata in a certain format.
Essentially, I need my result set to be like this:
{
"userName": [
"david2#david2.com",
"david2#david2.com",
"david2#david2.com",
"david2#david2.com"
],
"label": [
"Dealer",
"Inquiry",
"DD Test Skill1",
"_11DavidTest"
],
"value": [
3,
5,
2,
1
]
}
However, what I am getting is this:
{
"userName": "david2#david2.com",
"label": [
"Dealer",
"Inquiry",
"DD Test Skill1",
"_11DavidTest"
],
"value": [
3,
5,
2,
1
]
}
I am using the following to map the results:
(
$data := $map(data.hits.hits._source.item."Prod::User", function($v) {
{
"userName": $v.userName,
"label": $v.userSkillLevels.skill.name,
"value": $v.userSkillLevels.level
}
});
)
And my overall dataset returned form ElasticSearch is as follows:
{
"data": {
"took": 3,
"timed_out": false,
"_shards": {
"total": 15,
"successful": 15,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 2,
"relation": "eq"
},
"max_score": 1.002851,
"hits": [
{
"_index": "items_latest_production1_user",
"_type": "_doc",
"_id": "63d000766f67d40a73073d5d_f6144acf2b3ff31209ef9f6d461cd849",
"_score": 1.002851,
"_source": {
"item": {
"Prod::User": {
"userSkillLevels": [
{
"level": 3,
"skill": {
"name": "Dealer"
}
},
{
"level": 5,
"skill": {
"name": "Inquiry"
}
},
{
"level": 2,
"skill": {
"name": "DD Test Skill1"
}
},
{
"level": 1,
"skill": {
"name": "_11DavidTest"
}
}
],
"userName": "david2#david2.com"
}
}
}
}
]
}
}
}
I can see that each user that comes back from Elastic, then as all the skills/levels in an array associated to 1 username.
I need to have the same number of userNames, as their are skills ... and am struggling to get it just right.
Thoughts? Appreciate any help, I'm sure I'm overlooking something simple.
Actually, need this format:
[
{
"userName" : "david2#david2.com"
"label" : "Dealer"
"value" : 3
},
{
"userName" : "david2#david2.com"
"label" : "Inquiry"
"value" : 5
},
{
"userName" : "david2#david2.com"
"label" : "DD Test Skill"
"value" : 2
},
{
"userName" : "david2#david2.com"
"label" : "_11DavidTest"
"value" : 1
}
]
One way you could solve this is to create the userName array yourself and making sure it has the same length as userSkillLevels.skill, here's how you can do this:
(
$data := $map(data.hits.hits._source.item."Prod::User", function($v, $i, $a) {
{
"userName": $map([1..$count($v.userSkillLevels.skill)], function() {$v.userName}),
"label": $v.userSkillLevels.skill.name,
"value": $v.userSkillLevels.level
}
});
)
Feel free to check out this response in Stedi JSONata Playground: https://stedi.link/PVwGacu
Actually, need this format:
For the updated format from your question, this solution can produce it using the parent operator:
data.hits.hits._source.item."Prod::User".userSkillLevels.{
"userName": %.userName,
"label": skill.name,
"value": level
}
Check it out on the playground: https://stedi.link/gexB3Cb

Cypress intercept nested array and object and asserts it value

I am trying to assert value's in a requestbody which I have intercepted with cypress.
The values I need to assert are "name": "NewName", and "title": "STUB1-Draft" you can see in the requestbody example that I have attached.
My testscript in Cypress:
it.only('Check the requestbody', function () {
cy.intercept('PUT', '**/api/assessmenttest/**', req => {
req.reply({ statusCode: 200 });
}).as('NewSectionAndItem');
cy.wait('#NewSectionAndItem')
.its('request.body.test')
.its('testParts')
.its('testSections')
.its('name')
.should('include', 'NewName');
//cy.wait('#NewSectionAndItem').its('request.body.testParts').expect(arr_obj[1].name).to.equal('NewName')
The request body look like the following:
{
"structureStatistics": {
"testPartCount": 1,
"testSectionCount": 6,
"itemCount": 23
},
"name": "BIMMA",
"title": "OTAP",
"correctionInstructionsUrl": "C:Stub/FakePath/For/Cypress",
"correctionInstructionAppendices": [],
"testParts": [
{
"testSections": [
{
"order": 1,
"name": "Tekst 1 Looking for the one? ",
"itemReferences": [
{
"itemId": "55eb5a28-24d8-4705-b465-8e1454f73ac8",
"weight": 11,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "55eb5a28-24d8-4705-b465-8e1454f73ac8",
"title": "H-E-T1-1"
}
}
],
"id": "5c3eef2d-1094-4b9e-84c1-f184956f87fa"
},
{
"id": "ffaebc93-0bf6-4f75-944a-f61345a7be90",
"name": "NewName",
"itemReferences": [
{
"itemId": "58a29037-c92c-48f6-a7c3-a2f94e288992",
"weight": 0,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "58a29037-c92c-48f6-a7c3-a2f94e288992",
"title": "STUB1-Draft",
"state": "Draft"
}
}
]
},
{
"order": 2,
"name": "Tekst 2 The fruit Iron Ox bears",
"itemReferences": [
{
"itemId": "abfc0811-26c7-4d9d-b3cc-0c920e5af259",
"weight": 2,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "abfc0811-26c7-4d9d-b3cc-0c920e5af259",
"title": "H-E-T2-2"
}
},
{
"itemId": "3cfda5e0-0d64-44ef-8a4d-21f37484c024",
"weight": 12,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "3cfda5e0-0d64-44ef-8a4d-21f37484c024",
"title": "H-E-T2-3"
}
},
{
"itemId": "19ba8a53-9755-4beb-8f69-edd107b80230",
"weight": 1,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "19ba8a53-9755-4beb-8f69-edd107b80230",
"title": "H-E-T2-4"
}
},
{
"itemId": "3f5b7b81-df1f-4f01-8165-cb2226d9044d",
"weight": 1,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "3f5b7b81-df1f-4f01-8165-cb2226d9044d",
"title": "H-E-T2-5"
}
}
],
"id": "00f7455e-6d7d-4311-80cd-eff45c83ef2c"
},
{
"order": 3,
"name": "Tekst 3 How to live like a tramp",
"itemReferences": [
{
"itemId": "7e2d568c-4cde-4500-9c6b-c09f246155e4",
"weight": 1,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "7e2d568c-4cde-4500-9c6b-c09f246155e4",
"title": "H-E-T3-6"
}
},
{
"itemId": "87a5bf1c-451a-40b8-802a-53ee842cafcd",
"weight": 1,
"neutralisationType": "NoNeutralisation",
"itemSummary": {
"id": "87a5bf1c-451a-40b8-802a-53ee842cafcd",
"title": "H-E-T3-7"
}
}
],
"id": "390ecc2e-6715-4898-aaea-158e790525a2"
}
],
"navigationMode": "Linear",
"submissionMode": "Individual",
"id": "a546a67c-ac39-4e81-bf03-beb482c920a0"
}
],
"metadataToBePublished": [
"be63002c-dcf8-449f-a0ae-6ba50d4e2712",
"4d70239e-7a6e-47c3-b157-462d6c8c5edc"
],
"created": "2022-07-08T09:00:00+00:00",
"modified": "2022-09-21T23:55:58.6532451+02:00",
"createdBy": {
"id": "a45ea6db-bf04-427d-9354-7081b7592a3d",
"fullName": "Manual Construction"
},
"lastModifiedBy": {
"id": "129a584c-a677-4d9f-b289-019d1815064f",
"fullName": "OZKAN"
},
"id": "300eea01-ee10-4bd9-9356-8aaa933e949c"
}
I could not figure out how I can assert nested arrays and value's, without using deep.equal for the complete request. Thank you indeed!
It basically just looks like a couple of things
an extra test property in the test that's not there in the request
testParts and testSections are arrays (square bracket instead of curly bracket), so you need an index for them
Generally I think this would work
cy.wait('#NewSectionAndItem')
.its('request.body')
.its('testParts.0')
.its('testSections.1')
.its('name')
.should('include', 'NewName');
The problem is identifying the correct array indexes. testParts has only one element, so 0 is the only index option there.
To find the testSection index, create a dummy .json file in VS Code (or other editor) paste in the json from above and use the collapse toggles on the left to easily see which index contains the value you seek.
You can use a cy-spok plugin to make a spok assertion to easily check a nested property.
Example use
const spok = require('cy-spok')
// later in your test
cy.wait('#NewSectionAndItem')
.its('request.body')
.should(spok({
test: {
testParts: {
testSections: {
name: 'NewName',
title: 'STUB1-Draft'
}
}
}
});

ElasticSearch sort by field in NestedObject At First Index Of Array

I am trying to sort a field inside the first object of an array in the following docs
each docs has an array i want to retrieve the docs sorted by they first objects by there city name lets name that in the following result I want to have first the third documents because the name of the city its start by "L" ('london') then the second "M" ('Moscow') then the third "N" ('NYC')
the structure is a record that:
has an array
the array contains an object (called 'address')
the object has a field (called 'city')
i want to sort the docs by the first address.cities
get hello/_mapping
{
"hello": {
"mappings": {
"jack": {
"properties": {
"houses": {
"type": "nested",
"properties": {
"address": {
"properties": {
"city": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
}
}
}
}
}
}
Thos are the document that i indexed
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 1,
"hits": [
{
"_index": "hello",
"_type": "jack",
"_id": "2",
"_score": 1,
"_source": {
"houses": [
{
"address": {
"city": "moscow"
}
},
{
"address": {
"city": "belgrade"
}
},
{
"address": {
"city": "Sacramento"
}
}
]
}
},
{
"_index": "hello",
"_type": "jack",
"_id": "1",
"_score": 1,
"_source": {
"houses": [
{
"address": {
"city": "NYC"
}
},
{
"address": {
"city": "PARIS"
}
},
{
"address": {
"city": "TLV"
}
}
]
}
},
{
"_index": "hello",
"_type": "jack",
"_id": "3",
"_score": 1,
"_source": {
"houses": [
{
"address": {
"city": "London"
}
}
]
}
}
]
}
}
Try this (of course, add some test inside the script if field could be empty. Note it could be pretty slow, because elastic wont have this value indexed. Add a main address field would be faster (and really faster) for sure and would be the good way to do it.
{
"sort" : {
"_script" : {
"script" : "params._source.houses[0].address.city",
"type" : "string",
"order" : "asc"
}
}
}
You have to use _source instead of doc[yourfield] because you dont know in witch order elastic store your array.
EDIT: test if field exist
{
"query": {
"nested": {
"path": "houses",
"query": {
"bool": {
"must": [
{
"exists": {
"field": "houses.address"
}
}
]
}
}
}
},
"sort": {
"_script": {
"script" : "params._source.houses[0].address.city",
"type": "string",
"order": "asc"
}
}
}

ElasticSearch: Retrieve string concatenation, or partial array

I have many indexed documents such as this one:
{
"_index":"myindex",
"_type":"somedata",
"_id":"31d3255d-67b4-40e6-b9d4-637383eb72ad",
"_version":1,
"_score":1,
"_source":{
"otherID":"b4c95332-daed-49ae-99fe-c32482696d1c",
"data":[
{
"data":"d2454d41-a74e-43af-b3b0-0febeaf67a99",
"iD":"9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
},
{
"data":"some text",
"iD":"c554b8ce-c873-4fef-b306-ec65d2f40394"
},
{
"data":"5256983c-ef69-4363-9787-97074297c646",
"iD":"8c90e2be-6042-4450-b0fd-0732900f8f65"
},
{
"data":"other text",
"iD":"8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
},
{
"data":"3",
"iD":"c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
}
],
"iD":"31d3255d-67b4-40e6-b9d4-637383eb72ad"
}
}
It's type mapping is configured this way:
{
"somedata":{
"dynamic_templates":[
{
"defaultIDs":{
"match_pattern":"regex",
"mapping":{
"index":"not_analyzed",
"type":"string"
},
"match":".*(id|ID|iD)"
}
}
],
"properties":{
"otherID":{
"index":"not_analyzed",
"type":"string"
},
"data":{
"properties":{
"data":{
"type":"string"
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
}
}
I wish to be able to retrieve a string concatenation of data based on it's ID.
For example, given the id c554b8ce-c873-4fef-b306-ec65d2f40394, and the id 8d8f8a61-02d6-4d3e-9912-9ebb5d213c15, I would like to retrieve some text other text.
These IDs repeat in other documents of the same type with different data.
If this is not possible (which I suspect this is the case), I would like to at least retrieve a partial array containing my requested data.
Those arrays can become large (and so is the number of documents) and I would only need one or two elements from each hit.
If both my requests are not possible, how would you suggest changing my mappings in order to facilitate my needs?
Thanks in advance, Jonathan.
I have found a way to do exactly what I needed without changing my data structure.
(I actually did end up changing my data structure, but for reasons of space and efficiency).
All you have to do is enjoy the groovy goodness ElasticSearch has to offer:
{
"query" : { "term" : { "otherID" : "b4c95332-daed-49ae-99fe-c32482696d1c" } },
"script_fields" : { "requestedFields" : { "script" : "_source.data.findAll({ it.iD == 'c554b8ce-c873-4fef-b306-ec65d2f40394' || it.iD == '8d8f8a61-02d6-4d3e-9912-9ebb5d213c15'}) data.join(' ') " } }
}
Just goes to show how strong ElasticSearch really is.
I cannot help you with the field concatenation (maybe it's possible with scripting but I'm not experienced enough with it. I would assume a new field would have to be generated, etc.) but how to only retrieve the partial data.
It requires at least ES 1.5 because it uses inner_hits and you need to change the mapping.
I added type and include_in_parent to your data type:
DELETE somedata
PUT somedata
PUT somedata/sometype/_mapping
{
"sometype":{
"dynamic_templates":[
{
"defaultIDs":{
"match_pattern":"regex",
"mapping":{
"index":"not_analyzed",
"type":"string"
},
"match":".*(id|ID|iD)"
}
}
],
"properties":{
"otherID":{
"index":"not_analyzed",
"type":"string"
},
"data":{
"type": "nested",
"include_in_parent": true,
"properties":{
"data":{
"type":"string"
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
},
"iD":{
"index":"not_analyzed",
"type":"string"
}
}
}
}
Now indexing your document:
PUT somedata/sometype/1
{
"otherID":"b4c95332-daed-49ae-99fe-c32482696d1c",
"data":[
{
"data":"d2454d41-a74e-43af-b3b0-0febeaf67a99",
"iD":"9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
},
{
"data":"some text",
"iD":"c554b8ce-c873-4fef-b306-ec65d2f40394"
},
{
"data":"5256983c-ef69-4363-9787-97074297c646",
"iD":"8c90e2be-6042-4450-b0fd-0732900f8f65"
},
{
"data":"other text",
"iD":"8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
},
{
"data":"3",
"iD":"c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
}
],
"iD":"31d3255d-67b4-40e6-b9d4-637383eb72ad"
}
And here's how you can match and retrieve with inner_hits:
POST somedata/sometype/_search
{
"query": {
"nested": {
"path": "data",
"query": {
"bool": {
"should": [
{
"term": {
"data.iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
}
},
{
"term": {
"data.iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
}
}
]
}
},
"inner_hits": {}
}
}
}
In the result now look at this path: hits.hits[0].inner_hits.data.hits.hits[0]._source.data; it only contains your two requested matches:
{
"took": 1,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.5986179,
"hits": [
{
"_index": "somedata",
"_type": "sometype",
"_id": "1",
"_score": 0.5986179,
"_source": {
"otherID": "b4c95332-daed-49ae-99fe-c32482696d1c",
"data": [
{
"data": "d2454d41-a74e-43af-b3b0-0febeaf67a99",
"iD": "9362f2eb-9bd7-4924-8b0e-77c27bb0aa56"
},
{
"data": "some text",
"iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
},
{
"data": "5256983c-ef69-4363-9787-97074297c646",
"iD": "8c90e2be-6042-4450-b0fd-0732900f8f65"
},
{
"data": "other text",
"iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
},
{
"data": "3",
"iD": "c880bfdf-eb4b-4c80-9871-fd44e06b2ed2"
}
],
"iD": "31d3255d-67b4-40e6-b9d4-637383eb72ad"
},
"inner_hits": {
"data": {
"hits": {
"total": 2,
"max_score": 0.5986179,
"hits": [
{
"_index": "somedata",
"_type": "sometype",
"_id": "1",
"_nested": {
"field": "data",
"offset": 3
},
"_score": 0.5986179,
"_source": {
"data": "other text",
"iD": "8d8f8a61-02d6-4d3e-9912-9ebb5d213c15"
}
},
{
"_index": "somedata",
"_type": "sometype",
"_id": "1",
"_nested": {
"field": "data",
"offset": 1
},
"_score": 0.5986179,
"_source": {
"data": "some text",
"iD": "c554b8ce-c873-4fef-b306-ec65d2f40394"
}
}
]
}
}
}
}
]
}
}
Now, inner_hits is fairly new and the documentation also states:
Warning: This functionality is experimental and may be changed or removed completely in a future release.
YMMV.
Another thing to watch out: the inner_hits are sorted by score. In your original document they're in an array which is ordered but that information is lost in the actual result. If you require to have them in the same order in the inner_hits, I think you need to add a separate field for sorting (could just be the array index...) and sort the inner_hits by it.

Aggregating array of values in elasticsearch

I need to aggregate an array as follows
Two document examples:
{
"_index": "log",
"_type": "travels",
"_id": "tnQsGy4lS0K6uT3Hwzzo-g",
"_score": 1,
"_source": {
"state": "saopaulo",
"date": "2014-10-30T17",
"traveler": "patrick",
"registry": "123123",
"cities": {
"saopaulo": 1,
"riodejaneiro": 2,
"total": 2
},
"reasons": [
"Entrega de encomenda"
],
"from": [
"CompraRapida"
]
}
},
{
"_index": "log",
"_type": "travels",
"_id": "tnQsGy4lS0K6uT3Hwzzo-g",
"_score": 1,
"_source": {
"state": "saopaulo",
"date": "2014-10-31T17",
"traveler": "patrick",
"registry": "123123",
"cities": {
"saopaulo": 1,
"curitiba": 1,
"total": 2
},
"reasons": [
"Entrega de encomenda"
],
"from": [
"CompraRapida"
]
}
},
I want to aggregate the cities array, to find out all the cities the traveler has gone to. I want something like this:
{
"traveler":{
"name":"patrick"
},
"cities":{
"saopaulo":2,
"riodejaneiro":2,
"curitiba":1,
"total":3
}
}
Where the total is the length of the cities array minus 1. I tried the terms aggregation and the sum, but couldn't output the desired output.
Changes in the document structure can be made, so if anything like that would help me, I'd be pleased to know.
in the document posted above "cities" is not a json array , it is a json object.
If changing the document structure is a possibility I would change cities in the document to be an array of object
example document:
cities : [
{
"name" :"saopaulo"
"visit_count" :"2",
},
{
"name" :"riodejaneiro"
"visit_count" :"1",
}
]
You would then need to set cities to be of type nested in the index mapping
"mappings": {
"<type_name>": {
"properties": {
"cities": {
"type": "nested",
"properties": {
"city": {
"type": "string"
},
"count": {
"type": "integer"
},
"value": {
"type": "long"
}
}
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"registry": {
"type": "string"
},
"state": {
"type": "string"
},
"traveler": {
"type": "string"
}
}
}
}
After which you could use nested aggregation to get the city count per user.
The query would look something on these lines :
{
"query": {
"match": {
"traveler": "patrick"
}
},
"aggregations": {
"city_travelled": {
"nested": {
"path": "cities"
},
"aggs": {
"citycount": {
"cardinality": {
"field": "cities.city"
}
}
}
}
}
}

Resources