Solr highlighting gives field/snippets with ANY term, instead of those that satisfy the query fully - solr

I'm using Solr 5.x, standard highlighter, and i'm getting snippets which matches even one of the search terms only, even if i indicate q.op=AND.
I need ONLY the fields and snippets that matches ALL the terms (unless i say q.op=OR or just omit it), i.e. the field/snippet must satisfy the query. Solr does return the field/snippet that has all the terms, but also return many others.
I'm using hl.fl=*, to get the only fields having the terms, and searching against the default field ('text' containing full doc). Need to use * since i have multiple dynamic fields. Most fields are 'text_general' type (for search and HL), and some are 'string' type for faceting.
If its not possible for snippets to have all the terms, i MUST get only the fields that satisfy the query fully (since the question is more talking about matching all the terms, but the search query can become arbitrarily complex, so the fields/snippets should match the query).
Also, next is to get snippets highlighted with proximity based search/terms. What should i do/use for this? The fields coming in highlighting in this scenario should also satisfy the proximity query (unlike i get a field that contain any term, without regard to proximity constrains and other query terms etc)
Thanks for your help.

I've also encountered the same problem with highlighting. In my case, the query like
(foo AND bar) OR eggs
highlighted eggs and foo despite bar was not present in the document. I didn't manage to come up with proper solution, however I devised a dirty workaround.
I use the following query:
id:highlighted_document_id AND text:(my_original_query)
with debugQuery set to true. Then I parse explain text for highlighted_document_id. The text contains the terms from the query, which have contributed to the score. The terms, which should not be highlighted, are not present in the explanation.
The Python regex expressions I use to extract the terms (valid for Solr 5.2.1):
term_regex = re.compile(r'weight\(text:(.+) in')
wildcard_term_regex = re.compile(r'text:(.+), product')
then I simply search the markings in the highlighted text and remove them if the term doesn't match against any of the term in term_regex and wildcard_term_regex.
The solution is probably pretty limited, but works for me.

Related

How to perform an exact search in Solr

I implementing Solr search using an API. When I call it using the parameters as, "Chillout Lounge", it returns me the collection which are same/similar to the string "Chillout Lounge".
But when I search for "Chillout Lounge Box", it returns me results which don't have any of these three words.(in the DB there are values which have these 3 values, but they are not returned.)
According to me, Solr uses Fuzzy search, but when it is done it should return me some values, which will have at least one these value.
Or what could be the possible changes I should to my schema.XML, such that is would give me proper values.
First of all - "Fuzzy search" is a feature you'll have to ask for (by using ~ in standard Lucene query syntax).
If you're talking about regular searches, you can use q.op to select which operator to use. q.op=AND will make sure that all the terms match, while q.op=OR will make any document that contain at least one of the terms be returned. As long as you aren't using fq for this, the documents that match more terms should be scored higher (as the score will add up across multiple terms), and thus, be shown higher in the result set.
You can use the debug query feature in the web interface to see scores for each term for a document, and find out why the document was returned at all. If the document doesn't match any terms, it shouldn't be returned, unless you're asking for all documents to be returned.
Be aware that the analyzer chain defined for the field you're searching might affect what's considered a match and not.
You'll have to add a proper example to get a more detailed answer.

No result return by Solr when query contains word that is not in the collection

I am trying to set up Solr but encountered the problem mentioned in the title. I just downloaded Solr and used the built-in example. When I used a query with words occurred in the example documents, such as "ipod". Solr worked properly. However, when I added some words that are not in these documents, such as "what". Solr does not return anything. For me, it is weird since the relevance scores should be computed to query terms separately and added up. Non-existing query term should not affect the ranking (even though the coord norm is affected, thus the scores of documents will change).
Could anyone tell me what might be the issue? Thanks.
There are several ways of configuring how you want this behavior. I'll assume that you're using the edismax query handler for these examples, although some of these also apply to the standard lucene query parser.
The reason for not always wanting "ipod what" to retrieve the same subset sa "ipod" is that you'll get a poor result set and user experience for terms that are more general than "ipod" (i.e. searching for "microsoft windows" will not be perceived as a good search result if you're showing only general hits for anything about windows - it's usually better to say "we didn't find anything" in those cases). It all depends on your use case.
First, you can do it yourself, by applying either AND or OR between terms to get the exact kind of matching you're looking for.
You can use q.op to configure wether each term should be AND-ed together (all required) or OR-ed together (any one is sufficient). This overrides the (now deprecated) value from <solrQueryParser defaultOperator=".."/> in schema.xml.
For (e)dismax, there's the mm parameter, which allows you do more specific, but in a general way, handling of how you want matches to be performed. mm allows you to say "at least 50% of the terms should match" or "if there's only two terms, both should match, but any over that should be optional" or "match everything up to four, and 75% after that".

Haystack/Solr boosting results if the query is found in a specific field

We're having issues with non relevant results being returned as the highest results in our search and we're trying to improve that behavior, but not really sure how.
We have SearchIndex with about a dozen fields. The document=True field is a template backed field that we have placed the majority of the content into. Some of the stuff found in there is much less relevant than other stuff, even if it's still useful.
To give a concrete example: if a user searches for "red rose", we want to return red roses as the top results...even better if lower results are just roses or just red, or even are described as being "rose red" in color.
The issue is our document=True field has a ton of items that are described as being "rose red". Worse the actual red roses don't have "red" and "rose" particularly close to each other as those values would come from disparate fields. As a result we get the top few hundred results that are completely irrelevant.
What we would like to do is either:
A. Search the primary document and then search each of our other fields and boost (but not hard filter) accordingly. If the term "rose" appears in one of the items names and "red" appears as one of it's attribute values than that result should have a higher score. This gives us the optimal results in theory sorted by relevancy.
B. Search all fields at once and boost if the value is any of the "boosted" fields.
It seems like using field boost should be the answer, but we can't figure out how to express it since filtering based on a field is a harsh exclude and we want it to only impact the relevance scoring.
The result of both of these is effectively the same. We just can't figure out how to do either of them with Haystack. Or if we'd have to fall back to raw queries how to write a solr query that accomplishes this.
I can give you some pointers, as I did not get the exact use case :-
You can check on Solr edismax query parser to configure:-
Fields you want to search on - Mainly to select the results
Variable boost on fields for relevancy - To determine the importance on fields
Variable boost for different words combination e.g. single words, phrase match, shingle match with slop to determine relevancy
Provide additional boost on other fields
This will help you to filter the results and order them accordingly as per the field and word combination matches

Terms Prevalence in SolR searches

Is there a way to specify a set of terms that are more important when performing a search?
For example, in the following question:
"This morning my printer ran out of paper"
Terms such as "printer" or "paper" are far more important than the rest, and I don't know if there is a way to list these terms to indicate that, in the global knowledge, they'd have more weight than the rest of words.
For specific documents you can use QueryElevationComponent, which uses special XML file in which you place your specific terms for which you want specific doc ids.
Not exactly what you need, I know.
And regarding your comment about users not caring what's underneath, you control the final query. Or, in the worst case, you can modify it after you receive it at Solr server side.
Similar: Lucene term boosting with sunspot-rails
When you build the query you can define what are the values and how much these fields have weight on the search.
This can be done in many ways:
Setting the boost
The boost can be set by using "^ "
Using plus operator
If you define + operator in your query, if there is a exact result for that filed value it is shown in the result.
For a better understanding of solr, it is best to get familiar with lucene query syntax. Refer to this link to get more info.

Solr Index appears to be valid - but returns no results

Solr newbie here.
I have created a Solr index and write a whole bunch of docs into it. I can see
from the Solr admin page that the docs exist and the schema is fine as well.
But when I perform a search using a test keyword I do not get any results back.
On entering * : *
into the query (in Solr admin page) I get all the results.
However, when I enter any other query (e.g. a term or phrase) I get no results.
I have verified that the field being queried is Indexed and contains the values I am searching for.
So I am confused what I am doing wrong.
Probably you don't have a <defaultSearchField> correctly set up. See this question.
Another possibility: your field is of type string instead of text. String fields, in contrast to text fields, are not analyzed, but stored and indexed verbatim.
I had the same issue with a new setup of Solr 8. The accepted answer is not valid anymore, because the <defaultSearchField> configuration will be deprecated.
As I found no answer to why Solr does not return results from any fields despite being indexed, I consulted the query documentation. What I found is the DisMax query parser:
The DisMax query parser is designed to process simple phrases (without complex syntax) entered by users and to search for individual terms across several fields using different weighting (boosts) based on the significance of each field. Additional options enable users to influence the score based on rules specific to each use case (independent of user input).
In contrast, the default Lucene parser only speaks about searching one field. So I gave DisMax a try and it worked very well!
Query example:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video
You can also specify which fields to search exactly to prevent unwanted side effects. Multiple fields are separated by spaces which translate to + in URLs:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features+text
Last but not least, give the fields a weight:
http://localhost:8983/solr/techproducts/select?defType=dismax&q=video&qf=features^20.0+text^0.3
If you are using pysolr like I do, you can add those parameters to your search request like this:
results = solr.search('search term', **{
'defType': 'dismax',
'qf': 'features text'
})
In my case the problem was the format of the query. It seems that my setup, by default, was looking and an exact match to the entire value of the field. So, in order to get results if I was searching for the sit I had to query *sit*, i.e. use wildcards to get the expected result.
With solr 4, I had to solve this as per Mauricio's answer by defining type="text_en" to the field.
With solr 6, use text_general.

Resources