Elasticsearch is Aggregating by "Partial Term" instead of "Entire Term" - database

I'm currently trying to do something fancy in elasticsearch...and it ALMOST works.
Use case: I have to limit the number of results per a certain field to (x) results.
Example: In a result set of restaurants I only want to return two locations per restaurant name. If I search Mexican Food, then I should get (x) Taco Bell hits, (x) Del Taco Hits and (x) El Torito Hits.
The Problem: My aggregation is currently only matching partials of the term.
For Instance: If I try to match company_name, it will create one bucket for taco and another bucket for bell, so Taco Bell might show up in 2 buckets, resulting in (x) * 2 results for that company.
I find it hard to believe that this is the desired behavior. Is there a way to aggregate by the entire search term?
Here's my current aggregation JSON:
"aggs": {
"by_company": {
"terms": {
"field": "company_name"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1, "from": 0}
}
}
}
}
Your help, as always, is greatly appreciated!

Yes. If your "company_name" is just a regular string with the standard analyzer, OR your whatever analyzer you are using for "company_name" is splitting the name then this is your answer. ES stores "terms", not words, or entire text unless you are telling it to.
Assuming your current analyzer for that field does just what I described above, then you need another - let's call it "raw" - field that should mirror your company_name field but it should store the company name as is.
This is what I mean:
{
"mappings": {
"test": {
"properties": {
...,
"company_name": {
"type": "multi_field",
"fields": {
"company_name": {
"type": "string" #and whatever you currently have in your mapping for `company_name`
},
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
}
And in your query, you'll do it like this:
"aggs": {
"by_company": {
"terms": {
"field": "company_name.raw"
},
"aggs": {
"first_hit": {
"top_hits": {"size":1, "from": 0}
}
}
}
}

Related

Elasticsearch match query

I'm searching for some text in a field.
but the problem is whenever two documents contain all search tokens, the document which has more search tokens gets more points instead of the document that has less length.
My ElasticSearch index contains some names of foods. and I wanna search for some food in it.
The documents structure are like this
{"text": "NAME OF FOOD"}
Now I have two documents like
1: {"text": "Apple Syrup Apple Apple Syrup Apple Smoczyk's"}
2: {"text": "Apple Apple"}
If I search using this query
{
"query": {
"match": {
"text": {
"query": "Apple"
}
}
}
}
The first document comes first because contains more Apple in it.
which is not my expected result. I will be good that the second document gets more point because has Apple in it and its length is shorter then first one.
Elastic search scoring gives weightage to term frequency , field length. In general shorter fields are scored higher but term frequency can offset it.
You can use unique filter to generate unique tokens for the text. This way multiple occurrence of same token will not effect the scoring.
Mapping
{
"mappings": {
"properties": {
"text": {
"type": "text",
"analyzer": "my_analyzer"
}
}
},
"settings": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"unique", "lowercase"
]
}
}
}
}
}
Analyze
GET index29/_analyze
{
"text": "Apple Apple",
"analyzer": "my_analyzer"
}
Result
{
"tokens" : [
{
"token" : "apple",
"start_offset" : 0,
"end_offset" : 5,
"type" : "<ALPHANUM>",
"position" : 0
}
]
}
Only single token is generated even though apple appears twice.

Filter All using Elasticsearch

Let's say I have User table with fields like name, address, age, etc. There are more than 1000 records in this table, so I used Elasticsearch to retrieve this data one page at a time, 20 records.
And let's say I just wanted to search for some text "Alexia", so I wanted to display: is there any record contain Alexia? But special thing is that I wanted to search this text via all my fields within the table.
Does search text match the name field or age or address or any? IF it does, it should return values. We are not going to pass any specific field for Elastic query. If it returns more than 20 records matched with my text, the pagination should work.
Any idea of how to do such a query? or any way to connect Elasticsearch?
Yes you can do that by query String
{
"size": 20,
"query": {
"query_string": {
"query": "Alexia"
},
"range": {
"dateField": {
"gte": **currentTime** -------> This could be current time or age or any property that like to do a range query
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
For getting only 20 records you can pass the Size as 20 and for Pagination you can use RangeQuery and get the next set of Messages
{
"size": 20,
"query": {
"query_string": {
"query": "Alexia"
},
"range": {
"dateField": {
"gt": 1589570610732. ------------> From previous response
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
You can do the same by using match query as well . If in match query you specify _all it will search in all the fields.
{
"size": 20,
"query": {
"match": {
"_all": "Alexia"
},
"range": {
"dateField": {
"gte": **currentTime**
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
When you are using ElasticSearch to provide search functionality in search boxes , you should avoid using query_string because it throws error in case of invalid syntax, which other queries return empty result. You can read about this from query_string.
_all is deprecated from ES6.0, so if you are using ES version from 6.x ownwards you can use copy_to to copy all the values of field into single field and then search on that single field. You can refer more from copy_to.
For pagination you can make use of from and size parameter . size parameter tells you how many documents you want to retrieve and from tells from which hit you want to process.
Query :
{
"from" : <current-count>
"size": 20,
"query": {
"match": {
"_all": "Alexia"
},
"range": {
"dateField": {
"gte": **currentTime**
}
}
},
"sort": [
{
"dateField": {
"order": "desc"
}
}
]
}
from field value you can set incremently in each iteration to how much much documents you got. For e.g. first iteration you can set from as 0 . For next iteration you can set it as 21 (since in first iteration you got first 20 hits and in second iteration you want to get documents after first 20 hits). You can refer this.

Have result that match condition first

Given a very simple document:
`concert_name` - String containing the name of the concert
`city` - City ID of the concert
`band` - Band ID
`relevance` - An integer that indicate how important the concert is
I want to have all concerts in a specific city but I want first those for a specific band (sorted by relevance) and the all the other sorted by relevance
So I can have query like:
Give me all concerts in Milan and return first those for Pearl Jam
How can I do this in Elastica 1.X ?
EDIT 1
I think this can be done with sorting on multiple fields and using script. You would have to enable dynamic scripting. I am assigning value of 10 to the band you would like to match and others get value of 0. Try something like this
{
"query": {
"match": {
"city": "milan"
}
},
"sort": [
{
"_script": {
"script": "if(doc['band'].value == 'pearl') {10} else {0}",
"type": "number",
"order": "desc"
}
},
{
"relevance": {
"order": "desc"
}
}
]
}
I am assuming higher number means more important concert. I have tested this on ES 1.7
Does this help?

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

elastic search array score

I have a document has an array like
doc1
{
"item_type":"bag",
"color":["red","blue","green","orange"]
}
doc2
{
"item_type":"shirt",
"color":["red"]
}
when I do a multi_match search like
{
"query": {
"multi_match": {
"query": "red bag",
"type": "cross_fields",
"fields": ["item_type","color"]
}
}
}
The doc2 has much higher score, I understand color filed has less items get higher score and it get worse if I have more colors in doc1.
So is there a way I can ask Elasticsearch to score the same for an array field no matter how many items are there?
If you do not want to account for field length (fieldNorm) during the scoring you could disable norms for a field in the mapping.
For example the mapping for the above example would be
{
"properties": {
"item_type": {
"type": "string"
},
"color": {
"type": "string",
"norms": {
"enabled": false
}
}
}
}
This article from elasticsearch definitive guide gives a good insight into field-length-norms.

Resources