Is there a way to group by a child document field and include the parent fields to the result?
Imagine you have
[
{
"id": "p1",
"name": "parent 1",
"_childDocuments_": [
{
"id": "p1c1",
"name": "child1_of_parent1",
"color": "red"
},
{
"id": "p1c2",
"name": "child2_of_parent1",
"color": "yellow"
}
]
},
{
"id": "p2",
"name": "parent 2",
"_childDocuments_": [
{
"id": "p2c1",
"name": "child1_of_parent2",
"color": "yellow"
}
]
}
]
in a collection.
Now a query
/select?group=true&group.field=color&group.limit=10
returns
{
"responseHeader":{
"params":{
"group.limit":"10",
"group.field":"color",
"group":"true"
}
},
"grouped":{
"color":{
"matches":3,
"groups":[
{
"groupValue":"red",
"doclist":{"numFound":1,"docs":[
{
"id":"p1c1",
"name":"child1_of_parent1"
}
]
}
},
{
"groupValue":"yellow",
"doclist":{"numFound":2,"docs":[
{
"id":"p1c2",
"name":"child2_of_parent1"
},
{
"id":"p2c1",
"name":"child1_of_parent2"
}
]
}
}
]
}
}
}
But I need a result that contains their parent fields as well, something like
{
"responseHeader":{
"params":{
"group.limit":"10",
"group.field":"color",
"group":"true"
}
},
"grouped":{
"color":{
"matches":3,
"groups":[
{
"groupValue":"red",
"doclist":{"numFound":1,"docs":[
{
"id":"p1c1",
"name":"child1_of_parent1",
"parent":{
"id": "p1",
"name": "parent 1",
}
}
]
}
},
{
"groupValue":"yellow",
"doclist":{"numFound":2,"docs":[
{
"id":"p1c2",
"name":"child2_of_parent1",
"parent":{
"id": "p1",
"name": "parent 1",
}
},
{
"id":"p2c1",
"name":"child1_of_parent2",
"parent":{
"id": "p2",
"name": "parent 2",
}
}
]
}
}
]
}
}
}
I'm coming from relational databases, where this can easily be done. Hopefully there's a way in solr as well. I'm using solr 8.7.0
One solution I found is subquery. It satisfies the requirements, but the performance is what is to expect when you transfer a relational "join" to a document db.
It will definitely be a better idea to rework the datamodel to a flat structure.
Before I had to add a field in the child doc for the parent id (the default "root" doesn't work):
[
{
"id": "p1",
"name": "parent 1",
"_childDocuments_": [
{
"id": "p1c1",
"name": "child1_of_parent1",
"color": "red",
"parent_id": "p1"
},
{
"id": "p1c2",
"name": "child2_of_parent1",
"color": "yellow",
"parent_id": "p1"
}
]
},
{
"id": "p2",
"name": "parent 2",
"_childDocuments_": [
{
"id": "p2c1",
"name": "child1_of_parent2",
"color": "yellow",
"parent_id": "p2"
}
]
}
]
Now I can query
/select?group=true&group.field=color&group.limit=10&fl=*%2Cparent%3A%5Bsubquery%5D&parent.q=%7B%21terms+f%3Did+v%3D%24row.parent_id%7D
and it returns
{
"responseHeader":{
"params":{
"group.limit":"10",
"group.field":"color",
"group":"true",
"fl":"*,parent:[subquery]",
"parent.q":"{!terms f=id v=$row.parent_id}",
}
},
"grouped":{
"color":{
"matches":3,
"groups":[
{
"groupValue":"red",
"doclist":{"numFound":1,"docs":[
{
"id":"p1c1",
"name":"child1_of_parent1",
"parent":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
"id": "p1",
"name": "parent 1",
]}
}
]
}
},
{
"groupValue":"yellow",
"doclist":{"numFound":2,"docs":[
{
"id":"p1c2",
"name":"child2_of_parent1",
"parent":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
"id": "p1",
"name": "parent 1",
]}
},
{
"id":"p2c1",
"name":"child1_of_parent2",
"parent":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
"id": "p2",
"name": "parent 2",
]}
}
]
}
}
]
}
}
}
Feel free to comment if that reminds you at a better idea.
Related
How to calculate the sum of confident_score for every individual vendor?
Data stored in the DB:
[
{
"_id": "61cab38891152daf9387c0c7",
"name": "dummy",
"company_email": "abc#mailinator.com",
"brief_msg": "Cillum sed est prae",
"similar_case_ids": [],
"answer_id": [
"61cab38891152daf9387c0c9"
],
"pros_cons": [
{
"vendor_name": "xyzlab",
"score": [
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor F",
"confident_score": 80,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf0f"
},
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor FFF",
"confident_score": 40,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf10"
}
]
},
{
"vendor_name": "abclab",
"score": [
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor B",
"confident_score": 50,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf16"
},
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor BB",
"confident_score": 60,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf17"
}
]
}
]
the query for getting the matching id and grouping objects according to the vendor_name:
aggregate([
{
$match: { _id: id }
},
{
$unwind: {
path: '$pros_cons'
}
},
{
$group: {
_id: '$pros_cons'
}
},
])
};
After query I'm getting this:
[
{
"_id": {
"vendor_name": "abclab",
"score": [
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor B",
"confident_score": 50,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf16"
},
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor BB",
"confident_score": 60,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf17"
}
],
}
},
{
"_id": {
"vendor_name": "xyzlab",
"score": [
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor F",
"confident_score": 80,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf0f"
},
{
"question_id": "61c5b47198b2c5bbf9f6471c",
"title": "Vendor FFF",
"confident_score": 40,
"text": "text1",
"_id": "61cac505caeeeb3cec78bf10"
}
],
}
}
]
Need to calculate sum for (vendor_name:abclab)TOTAL=110 and for (vendor_name:xyzlab)TOTAL=120 INDIVIDUALLY
required output:
[
{
"vendor_name": "abclab",
"totalScore": 110,
"count" : 2
},
{
"vendor_name": "xyzlab",
"totalScore": 120,
"count" : 2
}
]
$match - Filter documents by id.
$unwind - Deconstruct pros_cons array to multiple documents.
$project - Decorate output documents. With $reduce, to create totalScore field by summing confident_score from each element in pros_cons.score array.
db.collection.aggregate([
{
$match: {
_id: "61cab38891152daf9387c0c7"
}
},
{
$unwind: {
path: "$pros_cons"
}
},
{
$project: {
_id: 0,
vendor_name: "$pros_cons.vendor_name",
totalScore: {
$reduce: {
input: "$pros_cons.score",
initialValue: 0,
in: {
$sum: [
"$$value",
"$$this.confident_score"
]
}
}
}
}
}
])
Sample Demo on Mongo Playground
Given the following search index document schema:
{
"value": [
{
"Id": "abc",
"Name": "Some name",
"Tags": [
{
"Id": "123",
"Name": "Tag123"
},
{
"Id": "456",
"Name": "Tag456"
}
]
},
{
"Id": "xyz",
"Name": "Some name",
"Tags": [
{
"Id": "123",
"Name": "Tag123"
},
{
"Id": "789",
"Name": "Tag789"
}
]
},
]
}
What is the correct syntax for an OData query that will return all records with any Tag/Ids that are contained in input list?
The closest I have got is:
Tags/any(object: object/Id search.in ('123,456,789'))
I have an autocomplete analyser for a field("keywords"). This field is an array of strings. When I query with a search string I want to show first the documents where a single element of the array keywords matches best. The problem is that if a part of the string matches with more elements of the array "keywords", then this document appears before another that has less but better matches. For example, if I have a query with the word "gas station" the returning documents' keywords are these:
"hits": [
{
"_index": "locali_v3",
"_type": "categories",
"_id": "5810767ddc536a03b4761acd",
"_score": 3.1974547,
"_source": {
"keywords": [
"Radio Station",
"Radio Station"
]
}
},
{
"_index": "locali_v3",
"_type": "categories",
"_id": "581076d8dc536a03b4761cc3",
"_score": 3.0407648,
"_source": {
"keywords": [
"Stationery Store",
"Stationery Store"
]
}
},
{
"_index": "locali_v3",
"_type": "categories",
"_id": "5810767ddc536a03b4761ace",
"_score": 2.903595,
"_source": {
"keywords": [
"TV Station",
"TV Station"
]
}
},
{
"_index": "locali_v3",
"_type": "categories",
"_id": "581076cddc536a03b4761c87",
"_score": 2.517158,
"_source": {
"keywords": [
"Praktoreio Ugrwn Kausimwn/Gkaraz",
"Praktoreio Ygrwn Kaysimwn/Gkaraz",
"Praktoreio Ugron Kausimon/Gkaraz",
"Praktoreio Ygron Kaysimon/Gkaraz",
"Πρακτορείο Υγρών Καυσίμων/Γκαράζ",
"Gas Station"
]
}
}
The "Gas Station" is fourth, although it has the best single element matching. Is there a way to tell ElasticSearch that I do not care about how many times "gas" or "station" appears in keywords? I want the max element of the array keywords match as the score factor.
My settings are:
{
"locali": {
"settings": {
"index": {
"creation_date": "1480937810266",
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": "1",
"max_gram": "20"
}
},
"analyzer": {
"keywords": {
"filter": [
"lowercase",
"autocomplete_filter"
],
"char_filter": [
"my_char_filter"
],
"type": "custom",
"tokenizer": "standard"
}
},
"char_filter": {
"my_char_filter": {
"type": "mapping",
"mappings": [
"ί => ι",
"Ί => Ι",
"ή => η",
"Ή => Η",
"ύ => υ",
"Ύ => Υ",
"ά => α",
"Ά => Α",
"έ => ε",
"Έ => Ε",
"ό => ο",
"Ό => Ο",
"ώ => ω",
"Ώ => Ω",
"ϊ => ι",
"ϋ => υ",
"ΐ => ι",
"ΰ => υ"
]
}
}
},
"number_of_shards": "1",
"number_of_replicas": "1",
"uuid": "TJjOt9L9QE2HrsUFHM6zJg",
"version": {
"created": "2040099"
}
}
}
}
}
And the mappings:
{
"locali": {
"mappings": {
"places": {
"properties": {
"formattedCategories": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"keywords": {
"type": "string",
"analyzer": "keywords"
},
"loc": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"location": {
"properties": {
"formattedAddress": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"locality": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"neighbourhood": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
}
}
},
"name": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"rating": {
"properties": {
"rating": {
"type": "long"
}
}
},
"seenDetails": {
"type": "long"
},
"verified": {
"type": "long"
}
}
},
"regions": {
"properties": {
"keywords": {
"type": "string",
"analyzer": "keywords"
},
"loc": {
"properties": {
"coordinates": {
"type": "geo_point"
}
}
},
"name": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"type": {
"type": "long"
},
"weight": {
"type": "long"
}
}
},
"categories": {
"properties": {
"keywords": {
"type": "string",
"analyzer": "keywords"
},
"name": {
"properties": {
"english": {
"type": "string"
},
"greek": {
"type": "string"
}
}
},
"weight": {
"type": "long"
}
}
}
}
}
}
Can you post your query here that you are trying here as well.
I tried your example with the following query
{
"query": {"match": {
"keywords": "gas station"
}
}
}
And i got your desired result.
{
"took": 2,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.081366636,
"hits": [
{
"_index": "stack",
"_type": "type",
"_id": "AVjP6QnpdNp-z_ybGd-L",
"_score": 0.081366636,
"_source": {
"keywords": [
"Praktoreio Ugrwn Kausimwn/Gkaraz",
"Praktoreio Ygrwn Kaysimwn/Gkaraz",
"Praktoreio Ugron Kausimon/Gkaraz",
"Praktoreio Ygron Kaysimon/Gkaraz",
"Πρακτορείο Υγρών Καυσίμων/Γκαράζ",
"Gas Station"
]
}
},
{
"_index": "stack",
"_type": "type",
"_id": "AVjP5-u5dNp-z_ybGd-I",
"_score": 0.03182549,
"_source": {
"keywords": [
"Radio Station",
"Radio Station"
]
}
},
{
"_index": "stack",
"_type": "type",
"_id": "AVjP6KiKdNp-z_ybGd-K",
"_score": 0.03182549,
"_source": {
"keywords": [
"TV Station",
"TV Station"
]
}
}
]
}
}
Try this query to see if you are getting desired result. Also you can reply with your mappings, query and ES version if this does't work for you.
Hope this solves your problem. Thanks
I have users indexed with categories as follows
{
id: 1
name: John
categories: [
{
id: 1
name: Category 1
},
{
id: 2
name: Category 2
}
]
},
{
id: 2
name: Mark
categories: [
{
id: 1
name: Category 1
}
]
}
And I'm trying to get all the documents with Category 1 or Category 2 with
{
filter:
{
bool: {
must: [
{
terms: {user.categories.id: [1, 2]}
}
]
}
}
}
But It only returns the first document that has the two categories, what I am doing wrong?
As I understood, terms search that one of the values is contained in the field, so for user 1
user.categories.id: [1, 2]
user 2
user.categories.id: [1]
Categoy id 1 is contained in both documents
The best way to handle this is probably with a nested filter. You'll have to specify the "nested" type in your mapping, though.
I can set up an index like this:
PUT /test_index
{
"settings": {
"number_of_shards": 1,
"number_of_replicas": 0
},
"mappings": {
"doc": {
"properties": {
"categories": {
"type": "nested",
"properties": {
"id": {
"type": "long"
},
"name": {
"type": "string"
}
}
},
"id": {
"type": "long"
},
"name": {
"type": "string"
}
}
}
}
}
then add some docs:
PUT /test_index/doc/1
{
"id": 1,
"name": "John",
"categories": [
{ "id": 1, "name": "Category 1" },
{ "id": 2, "name": "Category 2" }
]
}
PUT /test_index/doc/2
{
"id": 2,
"name": "Mark",
"categories": [
{ "id": 1, "name": "Category 1" }
]
}
PUT /test_index/doc/3
{
"id": 3,
"name": "Bill",
"categories": [
{ "id": 3, "name": "Category 3" },
{ "id": 4, "name": "Category 4" }
]
}
Now I can use a nested terms filter like this:
POST /test_index/doc/_search
{
"query": {
"constant_score": {
"filter": {
"nested": {
"path": "categories",
"filter": {
"terms": {
"categories.id": [1, 2]
}
}
}
},
"boost": 1.2
}
}
}
...
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "test_index",
"_type": "doc",
"_id": "1",
"_score": 1,
"_source": {
"id": 1,
"name": "John",
"categories": [
{
"id": 1,
"name": "Category 1"
},
{
"id": 2,
"name": "Category 2"
}
]
}
},
{
"_index": "test_index",
"_type": "doc",
"_id": "2",
"_score": 1,
"_source": {
"id": 2,
"name": "Mark",
"categories": [
{
"id": 1,
"name": "Category 1"
}
]
}
}
]
}
}
Here is the code I used:
http://sense.qbox.io/gist/668aefe910643b52a3a10d40aca67104491668fc
I need to aggregate an array as follows
Two document examples:
{
"_index": "log",
"_type": "travels",
"_id": "tnQsGy4lS0K6uT3Hwzzo-g",
"_score": 1,
"_source": {
"state": "saopaulo",
"date": "2014-10-30T17",
"traveler": "patrick",
"registry": "123123",
"cities": {
"saopaulo": 1,
"riodejaneiro": 2,
"total": 2
},
"reasons": [
"Entrega de encomenda"
],
"from": [
"CompraRapida"
]
}
},
{
"_index": "log",
"_type": "travels",
"_id": "tnQsGy4lS0K6uT3Hwzzo-g",
"_score": 1,
"_source": {
"state": "saopaulo",
"date": "2014-10-31T17",
"traveler": "patrick",
"registry": "123123",
"cities": {
"saopaulo": 1,
"curitiba": 1,
"total": 2
},
"reasons": [
"Entrega de encomenda"
],
"from": [
"CompraRapida"
]
}
},
I want to aggregate the cities array, to find out all the cities the traveler has gone to. I want something like this:
{
"traveler":{
"name":"patrick"
},
"cities":{
"saopaulo":2,
"riodejaneiro":2,
"curitiba":1,
"total":3
}
}
Where the total is the length of the cities array minus 1. I tried the terms aggregation and the sum, but couldn't output the desired output.
Changes in the document structure can be made, so if anything like that would help me, I'd be pleased to know.
in the document posted above "cities" is not a json array , it is a json object.
If changing the document structure is a possibility I would change cities in the document to be an array of object
example document:
cities : [
{
"name" :"saopaulo"
"visit_count" :"2",
},
{
"name" :"riodejaneiro"
"visit_count" :"1",
}
]
You would then need to set cities to be of type nested in the index mapping
"mappings": {
"<type_name>": {
"properties": {
"cities": {
"type": "nested",
"properties": {
"city": {
"type": "string"
},
"count": {
"type": "integer"
},
"value": {
"type": "long"
}
}
},
"date": {
"type": "date",
"format": "dateOptionalTime"
},
"registry": {
"type": "string"
},
"state": {
"type": "string"
},
"traveler": {
"type": "string"
}
}
}
}
After which you could use nested aggregation to get the city count per user.
The query would look something on these lines :
{
"query": {
"match": {
"traveler": "patrick"
}
},
"aggregations": {
"city_travelled": {
"nested": {
"path": "cities"
},
"aggs": {
"citycount": {
"cardinality": {
"field": "cities.city"
}
}
}
}
}
}