Elastic Search Interaction of Highlights with Synonym Filter - solr

We have an analyzer which includes the synonym filter which is defined as follows:
synonym_filter :
type : synonym
synonyms_path : synonyms.txt
ignore_case : true
expand : true
format : solr
In the synonym file we have a synonym defined as follows:
dawdle,waste time
Then in our data we have an entity with a name field "dawdle company".
Because of the synonym filter this gets analyzed to something like:
1 -dawdle- 2 -company- 3
1 -wasted- 2 -time- 3
With time and company in the same position. Then when performing a search for "wasted time" we get a hit in this entity. We would like the highlights to be "dawdle" since that is the equivalent synonym, but it seems elastic search sees this as a two hits since it matched "wasted" and "time" and it returns two highlights: "dawdle" and "company".
Is there a recommended way to solve these kind of issues where an unexpected word is returned in the highlights because it occupies the same position of a search term that was inserted because of a synonym?

#SergeyS the situation both you and #user2430530 has is perfectly described in this section of the documentation.
And the suggestion there is to try and define a single term for each serie of synonyms not to get back that mix up of terms highlighted in the result.
Something like this:
"analysis": {
"analyzer": {
"synonym": {
"tokenizer": "whitespace",
"filter": [
"synonym"
]
}
},
"filter": {
"synonym": {
"type": "synonym",
"synonyms": [
"dawdle, waste time=>waste_time"
]
}
}
}
Then you'll get the desired result from ES:
"highlight": {
"text": [
"some <em>dawdle</em> company"
]
}

Related

Differences between Suggesters and NGram

I've built an index with a Custom Analyzer
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "ingram",
"tokenizer": "whitespace",
"tokenFilters": [ "lowercase", "NGramTokenFilter" ],
"charFilters": []
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "NGramTokenFilter",
"minGram": 3,
"maxGram": 8
}
],
I came upon Suggesters and was wondering what the pros/cons were between these 2 approaches.
Basically, I'm doing an JavaScript autocomplete text box. I need to do partial text search inside of the search text (i.e. search=ell would match on "Hello World".
Azure Search offers two features to enable this depending on the experience you want to give to your users:
- Suggestions: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions
- Autocomplete: https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
Suggestions will return a list of matching documents even with incomplete query terms, and you are right that it can be reproduced with a custom analyzer that uses ngrams. It's just a simpler way to accomplish that (since we took care of setting up the analyzer for you).
Autocomplete is very similar, but instead of returning matching documents, it will simply return a list of completed "terms" that match the incomplete term in your query. This will make sure terms are not duplicated in the autocomplete list (which can happen when using the suggestions API, since as I mentioned above, suggestions return matching documents, rather than a list of terms).

How to perform a full-text search in Vespa?

I am trying to do a full-text search on a field of some documents, and I was looking for your advices on how to do so. I first tried to do this type of request:
GET http://localhost:8080/search/?query=lord+of+the+rings
But it was returning me the documents where the field was an exact match and contained no other information than the given string , so I tried the equivalent in YQL:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text CONTAINS "lord of the rings";
And I had the exact same results. But when further reading the documentation I fell upon the MATCHES instruction, and it indeed gives me the results I'm seem to be looking for, by doing this kind of request:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text MATCHES "lord of the rings";
Though I don't know why, for some requests of this type I encountered a timeout error of this type:
{
"root": {
"id": "toplevel",
"relevance": 1,
"fields": {
"totalCount": 0
},
"errors": [
{
"code": 12,
"summary": "Timed out",
"source": "site",
"message": "Timeout while waiting for sc0.num0"
}
]
}
}
So I solved this issue by adding greater than default timeout value:
GET http://localhost:8080/search/?yql=SELECT * FROM site WHERE text MATCHES "lord of the rings";&timeout=20000
My question is, am I doing full-text search the right way, and how could I improve it ?
EDIT: Here is the corresponding search definition:
search site {
document site {
field text type string {
stemming: none
normalizing: none
indexing: attribute
}
field title type string {
stemming: none
normalizing: none
indexing: attribute
}
}
fieldset default {
fields: title, text
}
rank-profile post inherits default {
rank-type text: about
rank-type title: about
first-phase {
expression: nativeRank(title, text)
}
}
}
What does your search definition file look like? I suspect you have put your text content in an "attribute" field, which defaults to "word match" semantics. You probably want "text match" semantics which means you'll need to put your content in an "index" type field.
https://docs.vespa.ai/documentation/reference/search-definitions-reference.html#match
The "MATCHES" operator you are using interprets your input as a regular expression, which is powerful, but slow as it applies the regular expression on all attributes (further optimizations to something like https://swtch.com/~rsc/regexp/regexp4.html are possible but not currently implemented).

How can you retrieve a full nested document in Solr?

In my instance of Solr 4.10.3 I would like to index JSONs with a nested structure.
Example:
{
"id": "myDoc",
"title": "myTitle"
"nestedDoc": {
"name": "test name"
"nestedAttribute": {
"attr1": "attr1Val"
}
}
}
I am able to store it correctly through the admin interface:
/solr/#/mySchema/documents
and I'm also able to search and retrieve the document.
The problem I'm facing is that when I get the response document from my Solr search, I cannot see the nested attributes. I only see:
{
"id": "myDoc",
"title": "myTitle"
}
Is there a way to include ALL the nested fields in the returned documents?
I tried with : "fl=[child parentFilter=title:myTitle]" but it's not working (ChildDocTransformerFactory from:https://cwiki.apache.org/confluence/display/solr/Transforming+Result+Documents). Is that the right way to do it or is there any other way?
I'm using: Solr 4.10.3!!!!!!
To get returned all the nested structure, you indeed need to use ChildDocTransformerFactor. However, you first need to properly index your documents.
If you just passed your structure as it is, Solr will index them as separate documents and won't know that they're actually connected. If you want to be able to correctly query nested documents, you'll have to pre-process your data structure as described in this post or try using (modifying as needed) a pre-processing script. Unfortunately, including the latest Solr 6.0, there's no nice and smooth solution on indexing and returning nested document structures, so everything is done through "workarounds".
Particularly in your case, you'll need to transform your document structure into this:
{
"type": "parentDoc",
"id": "myDoc",
"title": "myTitle"
"_childDocuments_": [
{
"type": "nestedDoc",
"name": "test name",
"_childDocuments_" :[
{
"type": "nestedAttribute"
"attr1": "attr1Val"
}]
}]
}
Then, the following ChildDocTransformerFactor query will return you all subdocuments (btw, although it says it's available since Solr 4.9, I've actually only seen it in Solr 5.3... so you need to test):
q=title:myTitle&fl=*,[child parentFilter=type:parentDoc limit=50]
Note, although it returns all nested documents, the returned document structure will be flattend (alas!), i.e., you'll get:
{
"type": "parentDoc",
"id": "myDoc",
"title": "myTitle"
"_childDocuments_": [
{
"type": "nestedDoc",
"name": "test name"
},
{
"type": "nestedAttribute"
"attr1": "attr1Val"
}]
}
Probably, not really what you've expected but... this is the unfortunate Solr's behavior that will be fixed in a nearest future release.
You can put
q={!parent which=}
and in fl field :"fl=*,[child parentFilter=title:myTitle].
It will give you all parent field and children field of title:mytitle

fuzzy search in elasticsearch different than fuzziness match boolean

i'm trying to figure out why the following queries produce vastly different results. i'm told a fuzzy query is almost never a good idea per this document Found-fuzzy so i'm trying to use a match query with a fuzziness parameter. they produce extremely different results. i'm not sure what's the best way of doing this.
my example is a movie title containing 'batman'. the user, however, types 'bat man' (with a space). this would make sense that a fuzzy query should find batman. it should also find other variations like spider man, but for now that's ok i guess. (not really, but...)
so the fuzzy search is actually returning more relevant results than the match one below. any ideas?
--fuzzy:
{
"query":{
"bool":{
"should": [
{
"fuzzy": {
"title": {
"value": "bat man",
"boost": 4
}
}
}
], "minimum_number_should_match": 1
}
}
}
--match:
{
"query":{
"bool":{
"should": [
{
"match": {
"title": {
"query": "bat man",
"boost": 4
}
}
}
], "minimum_number_should_match": 1
}
}
}
EDIT
i'm adding examples of what gets returned.
first, nothing gets returned using the match query, even with a high fuzziness value added (fuzziness: 5)
but i do get several 'batman' related titles using the fuzzy query such as 'batman' or 'batman returns'.
this gets even stranger when i do multiple fuzzy searches on 'bat man' using the fuzzy search... if i search my 'starring' field, in addition to the title field, (starring contains lists of actors), i get 'jason bateman' as well as the title 'batman'.
{
"_index": "store24",
"_type": "searchdata",
"_id": "081227987909",
"_score": 4.600759,
"fields": {
"title": [
"Batman"
]
}
},
{
"_index": "store24",
"_type": "searchdata",
"_id": "883929053353",
"_score": 4.1418676,
"fields": {
"title": [
"Batman Forever"
]
}
},
{
"_index": "store24",
"_type": "searchdata",
"_id": "883929331789",
"_score": 3.5298011,
"fields": {
"title": [
"Batman Returns"
]
}
}
BEST SO FAR (STILL NOT GREAT)
what i've found that works best so far is to combine both queries. this seems redundant, but i can't as yet make one work like the other. so, this seems to be better:
"should": [
{
"fuzzy": {
"title": {
"boost": 6.0,
"min_similarity": 1.0,
"value": "batman"
}
}
},
{
"match": {
"title": {
"query": "batman",
"boost": 6.0
,"fuzziness": 1
}
}
}
]
Elastic Search analyzes docs and converts them into terms, which are what is actually searched (not the docs themselves). The key difference between the two query types is that the match query does not analyze the query text before sending the query. So consider the example below:
The search of 'bat man' in a fuzzy search would first tokenize the term, then search. So what it really looks for is 'btmn,' which might not turn up the same matches. A good example of this is how Jason Bateman showed up because the last name was tokenized to btmn or a similar form.
More detailed information on the Analyzing of text fields when searching can be read http://exploringelasticsearch.com/searching_data.html#sec-searching-analysis
When a search is performed on an analyzed field, the query itself is
analyzed, matching it up to the documents which are analyzed when
added to the database. Reducing words to these short tokens normalizes
the text allowing for fast efficient lookups. Whether you’re searching
for "rollerblading" in any form, internally we’re just looking for
"rollerblad".

Solr, adding a record via JSON with a multi-value field and boosted values

I'm pretty new to Solr, I'm trying to add a multi-value field with boost values defined for each value, all defined via JSON. In other words, I'd like this to work:
[{ "id": "ID1000",
"tag": [
{ "boost": 1, "value": "A test value" },
{ "boost": 2, "value": "A boosted value" } ]
}]
I know how to do that in XML (multiple <field name = 'tag' boost = '...'>), but the JSON code above doesn't work, the server says "Error parsing JSON field value. Unexpected OBJECT_START". Has Solr a limit/bug?
PS: I fixed the originally-missing ']' and that's not the problem.
EDIT: It seems the way to go should be payloads (http://wiki.apache.org/solr/Payloads), but I couldn't make them to work on Solr (followed this: http://sujitpal.blogspot.co.uk/2011/01/payloads-with-solr.html). Leaving the question open to see if someone can further help.
Found the following sentence in the from the Solr Relevancy FAQ - Query Elevation Component section
An Index-time boost on a value of a multiValued field applies to all values for that field.
I do not think adding an individual boost to each value in the multivalued field is going to work. I know that the Xml will allow it, but I would guess that it may only apply the boost value from the last value applied to the field.
So based on that I would change the Json to the following and see if that works.
[
{
"id": "ID1000",
"tag": {
"boost": 2,
"value": [ "A test value", "A boosted value"]
}
}
]
The JSON seems to be invalid missing a closing ]
[
{
"id": "ID1000",
"tag": [
{
"boost": 1,
"value": "A test value"
},
{
"boost": 2,
"value": "A boosted value"
}
]
}
]
You hit an edge case. You can have the boosts on single values and you can have an array of values. But not one inside another (from my reading of Solr 4.1 source code)
That might be something to create as an enhancement request.
If you are generating that JSON by hand, you can try:
"tag": { "boost": 1, "value": "A test value" },
"tag": { "boost": 2, "value": "A boosted value" }
I believe Sols will merge the values then. But if you are generating it via a framework, it will most likely disallow or override multiple object property names (tag here).
The error has nothing to do with boosting.
I get the same error with a very simple json doc.
No luck solving it.
see Solr errors when trying to parse a collection: Error parsing JSON field value. Unexp ected OBJECT_START
I hit the same error message. Actually the error message was misplaced. The underlying real error was the two of the required fields as per schema.xml in solr configuration were missing in the json payload.
An error message of the kind "required parameters are missing in the document" would have been more helpful here. You might want to check if some required fields are missing in the json payload.

Resources