Solr exact match field boosting - solr

I have this requirement: if the query text match exactly with a particular field value (the title field) the result must be first or al least be boosted.
So I need to boost the results with the exact match.
My solution is to create the title as an untokenized field, so it'll match only exactly, and boost this the title with an edismax query.
Is there any othere way?
How can I index a field untokenized? So without tokenize on spaces?

Use a KeywordTokenizer - this will index the field as a single value, but still allow you to attach filters - for example to lowercase the text before storing the token.
If you don't want to perform lowercasing either, you can use a string (StrField) field - a string field will only give a hit if the value is exactly the same.
This is usually what you'll do to give exact hits a larger boost than other hits - and you can use the qf parameter to dismax (which you probably are already) to give this list. Use copyField to index the content into separate fields with different definitions.

Related

Searching with String-style within Text fields in Solar

I have a field containing short texts (a few tokens). I index it as Text rather than String because I need to search within the text.
However, I need to search with the String-style (matching the entire field).
For example, if a field is Google Search Engine. I currently find the row by searching "search engine". While preserving this behavior, I need another option to catch the row only if the search term is "google search engine".
I believe it is possible by regex, but it should be slow.
I wonder if there is a standard way to do so or if I need to add another field of the same content but with the String type.
Use multiple fields - the definition of the second field will differ based on whether you want the search to be case sensitive or not. If you're OK with having a case sensitive field (i.e. "Google" and "google" are different terms), then string is the correct choice.
If you want the field to be case insensitive, use a TextField with a KeywordTokenizer (which keeps the input as a single, large token) with a LowercaseFilter attached (which lowercases the content).
You can then search both fields by using qf - query fields - with the edismax/dismax query parses and score them differently. If you only need explicit searching (you choose whether you want to match the whole string or just words in it yourself), using the field name in the regular way would work.
Use a copyField instruction to index the same content into both fields without changing your indexing pipeline. You'll need to reindex your core / collection for the new field to get any values.
And no, you can't do this with a regex, since the regex is applied against the tokens. You already have the tokens split up into smaller parts, so /foo bar/ doesn't have a foo bar token to match against, just foo and bar - neither match the regex.

Solr Text field and String field - different search behaviour

I am working on Solr 4+.
I have several fields into my solr schema with different solr field types.
Does the search on text field and string field differs?
Because I am trying to search on string field (which is a copy field of few facet fields) which does not work as expected. The destination string field is indexed and stored both.
However, when I change destination field which a text field (only indexed), it works fine.
Can you suggest why this happens? What is exactly the difference between text and string fields in solr in respect to searches?
TextFields usually have a tokenizer and text analysis attached, meaning that the indexed content is broken into separate tokens where there is no need for an exact match - each word / token can be matched separately to decide if the whole document should be included in the response.
StrFields cannot have any tokenization or analysis / filters applied, and will only give results for exact matches. If you need a StrField with analysis or filters applied, you can implement this using a TextField and a KeywordTokenizer.
A general text field that has reasonable, generic cross-language defaults: it tokenizes with StandardTokenizer, removes stop words from case-insensitive "stopwords.txt" (empty by default), and down cases. At query time only, it also applies synonyms.
The StrField type is not analyzed, but indexed/stored verbatim.

Solr docs must match one field

I have two fields
text field .. All important fields like category, product name, brand are copied into it.
attributes field .. All attributes are copied into this field.
I have a single search query e.g. "50 mm diameter drill"
I want to search this string in both fields. I am assuming that this will match all products that have drill in the text field.
I want to narrow down the result in case any attributes that match any of 50 mm diameter.
And in case none matches in the attributes field I want to return all documents that match text field.
Edit: I dont want any docs which don't match text field.
I only want that if search is matched to attributes field, and docs are found we return only those docs.
If not found we return all docs which match text field
This is getting a bit tricky and a lot of things depend on your field processing requirements.
You will need to use a combination of field weighting, to rank attributes field higher and edismax minimum match mm
Minimum match allows you to configure how many terms in the query must be hit in order for it to display results. This helps weed out documents that only hit on one term in one field.
Lastly, if you really want to have your own logic in here, you can prepend field with + to make it mandatory. For example +attributes:drill will only return items that have drill in the attributes field.
Whether "drill" will match depends on how your fields are processed, but probably, yes. The easiest way to do this is to not limit by "if not matched here, do this ..", but to score matches in the attributes field higher. You can do this by using qf (if using (e)dismax) together with their weights, such as attributes^20 text which will score any match in attributes 20 times more than a match in text. Any search matching documents with the correct term in attributes will then be scored higher than those just matching in text.
You can also do something similar in the q parameter, where you can weight each term separately: text:drill OR attributes:drill^20.

Solr DisMax query equivalent

I am trying to set up elevate handler in SOLR 3.5.0 and I need the equivalent of the below query in dismax format which defines different boost values on the same field based on the match type(exact match gets 200 whereas wildcard match gets 100).
q=name:(foo*^100.0 OR foo^200.0)
This is one way to solve this problem.
Keep a text field with only WhiteSpaceTokenizer (and maybe LowerCaseFilter depending on your case-sensitivity needs). Use this field for the exact match. Let's call this field name_ws.
Instead of using a wild-card query on name_ws, use a text-type copy field with EdgeNGramTokenizer in your analyzer chain, which will output tokens like:
food -> f, fo, foo, food
Let's call this field name_edge.
Then you can issue this dismax query:
q=foo&defType=dismax&qf=name_ws^200+name_edge^100
(Add debugQuery=on to verify if the scoring works the way you want.)

Different indexing and search strategies on same field without doubling index size?

For a phrase search, we want to bring up results only if there's an exact match (without ignoring stopwords). If it's a non-phrase search, we are fine displaying results even if the root form of the word matches etc.
We currently pass our data through standardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter. Due to this when user wants to search for "password management", search brings up results containing "password manager".
If I remove StemFilter, then I will not be able to match for the root form of the word for non-phrase queries. I was thinking if I should index the same data as part of two fields in document.
For the first field (to be used for phrase searches), following tokenizers/filters will be used:
StandardTokenizer, LowerCaseFilter
For the second field (Non-phrase searches)
StandardTokenizer, StopFilter, PorterStemFilter, LowerCaseFilter
Now, based on whether it's a phrase search or not, I need to rewrite user's query to search in the appropriate field.
Is this the right way to address this issue? Is there any other way to achieve this without doubling index size?
let's say user's query is
summary:"Furthermore, we should also fix this"
Internally this will be translated to
summary_field1:"Furthermore, we should also fix this"
If user's query is
summary:(Furthermore, we should also fix this)
Internally this will be translated to
+summary_field2:furthermor +summary_field2:we +summary_field2:should +summary_field2:also +summary_field2:fix
both summary_field1 and summary_field2 index the same data. summary_field1 passes through only StandardTokenizer and LowerCaseFilter, whereas summary_field2 passes through StandardTokenizer, StopFilter, PorterStemFilter and LowerCaseFilter.
Please let me know if I'm missing something here.
By defining two different fields you can search for exact matches.
By using boosts you can also bring results in one query. For example:
(firstField:"password management")^5 OR (secondField:"pasword management")^1

Resources