Solr and searching phrases with double quotes - solr

I have an ecommerce site where I am implementing Solr (using the Solarium library) and there are product names and descriptions that contain double quotes (usually standing for inches). Before I started to grasp the analyzer and tokenizer portion of Solr, I simply assigned the datatype of text_en_splitting to fields that would contain this data. If someone searches for the phrase - blue 1" binder - the double quote is being removed and the first 10 results being returned are not necessarily binders. The results returned seem to be matching the word blue and the number 1 (they aren't binders). Looking through the analysis of the query in Solr admin, I see the double quotes are getting removed from the WordDelimiterFilterFactory. I like WordDelimiterFilterFactory for other reasons (like dealing with the phrase post-it note) so I'm trying to find a happy medium here. Is there a better way to both index and query fields that contain double quotes that should be kept in place when performing searches (because they actually mean something)?

What I ended up doing was adding a replacement filter before the word delimiter and used the word inch.
<filter class="solr.PatternReplaceFilterFactory" pattern='(\d)"' replacement='$1 inch' replace="all"/>

Solr Query Parsers (such as DisMax) use a call to
SolrPluginUtils.stripUnbalancedQuotes(userQuery))
to remove unbalanced quotes. Balanced quotes are for phrase queries.
So you should really design your own query parser.
You may also consider replacing quotes to feet at the front end, before query comes to Solr.

Related

substring match in solr query

I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

Can changes in synonyms.txt file take effect without reindex?

We are using Sunspot-solr 4.0 when I update synonyms file it does not change anything in search. Do I really need to re-index after making changes in synonyms.txt or there is any other trick to update synonyms file that I am missing?
That depends on when you're expanding the synonyms. If you're expanding at query time, the updates will be visible without any reindexing, but if you're expanding at index time (which is the recommended way), you'll have to reindex to get the new synonyms included in the index.
The reasoning behind recommending expansion at index time compared to query time is described in the old wiki:
This is because there are two potential issues that can arrise at query time:
The Lucene QueryParser tokenizes on white space before giving any text to the Analyzer, so if a person searches for the words sea biscit the analyzer will be given the words "sea" and "biscit" seperately, and will not know that they match a synonym.
Phrase searching (ie: "sea biscit") will cause the QueryParser to pass the entire string to the analyzer, but if the SynonymFilter is configured to expand the synonyms, then when the QueryParser gets the resulting list of tokens back from the Analyzer, it will construct a MultiPhraseQuery that will not have the desired effect. This is because of the limited mechanism available for the Analyzer to indicate that two terms occupy the same position: there is no way to indicate that a "phrase" occupies the same position as a term. For our example the resulting MultiPhraseQuery would be "(sea | sea | seabiscuit) (biscuit | biscit)" which would not match the simple case of "seabiscuit" occuring in a document
Even when you aren't worried about multi-word synonyms, idf differences still make index time synonyms a good idea. Consider the following scenario:
An index with a "text" field, which at query time uses the SynonymFilter with the synonym TV, Televesion and expand="true"
Many thousands of documents containing the term "text:TV"
A few hundred documents containing the term "text:Television"
A query for text:TV will expand into (text:TV text:Television) and the lower docFreq for text:Television will give the documents that match "Television" a much higher score then docs that match "TV" comparably -- which may be somewhat counter intuitive to the client. Index time expansion (or reduction) will result in the same idf for all documents regardless of which term the original text contained.
There's an really detailed explanation of what's actually happening behind the scenes available in Better synonym handling in Solr.
As long as you're aware of these issues and the trade-off, doing query time synonyms could work fine - but you'll have to test it against your queries and what you expect the results to be - and be aware of the pitfalls.

Multi-word synonym search in Solr

I'm trying to use a synonym filter to search for a phrase.
peter=> spider man, spiderman, Mary Jane, .....
I use the default configuration. When I put these synonyms into synonym.txt and restart Solr it seems to work only partially: It starts to search for "spider", "man", "spiderman", "Mary" and "Jane" but what I want to search for are the meaningful combinations - like "spider man", "Mary Jane" and "spiderman".
Yes sadly this is a well known problem due to how the Solr query parser breaks up on whitespace before analyzing. So instead of seeing "spider" before "man" in the token stream, you instead simply see each word on its own. Just "spider" with nothing before/after and just "man" with nothing before/after.
This is because most Solr query forms see a space as basically an "OR". Search for "spider OR man" instead of looking at the full text, analyzing it to generate synonyms, then generating a query from that.
For more background, there's this blog post
There's a large number of solutions to this problem, including the following:
hon-lucene-synonyms. This plugin runs an analyzer before generating an edismax query over multiple fields. It's a bit of a blackbox, and I've found it can generate some complex query forms that generate weird performance and relevance bugs.
Lucidwork's autophrase query parser By selectively autophrasing, this plugin lets you specify key phrases (spider man) that should not be broken into OR queries and can have synonym expansion applied
OpenSource Connection's Match query parser. Searches a single field using a query-specified analyzer run before the field is searched. Also searches multi-word synonyms as phrases. My favorite, but disclaimer: I'm the author :)
Rene Kriegler's Querqy -- Querqy is a Solr plugin for query preprocessing rules. These rules can identify your key phrases and rewrite the query to non-multiterm form.
Roll your own: Learn to write your own query parser plugin and handle the problem however you want.
My usually strategy for this kind of problem is to use the synonym filter not to expand a search to include all of the possible synonyms, but to normalize to a single form. I do this both in my index and query field analysis.
For example, with this line in my fieldType/analyzer block in schema.xml:
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
(Note the expand="false")
...and this line in my synonyms.txt:
spiderman, spider man, Mary Jane => peter
This way I make sure that any of these four values will be indexed and searched as "peter". For example, if the source document mentions "The Amazing Spider Man" it will be indexed as "The Amazing peter". When a user searches for "Mary Jane" it will search for "peter" instead, so it will match.
The important thing here is that because "Mary" is not one of the comma-separated synonyms, it won't be changed if it appears without "Jane" following. So searching for "Mary is amazing" will actually search for "Mary is amazing", and it will not match the document.
One of the important details, is that I choose a normalized form (e.g. "peter") that is only one word. I could organize it this way:
peter, spiderman, spider man => Mary Jane
but because Mary Jane is two words, it may (depending on other features of my search), match the two words separately as well as together. By choosing a single word form to normalize into, I make sure that my tokenizer won't try to break it up.
It's a known limitation within Solr / Lucene. Essentially you would have to provide an alternative form of tokenization so that specific space delimited words (i.e. phrases) are treated as single words.
One way of achieving this is to do this client side - i.e. in your application that is calling Solr, when indexing, keep a list of synonym phrases and find / replace those phrase values with an alternative (for example removing the spaces or replacing it with a delimiter that isn't treated as a token boundary).
E.g. if you have "Hello There" as a phrase you want to use in a synonym, then replace it with "HelloThere" when indexing.
Now in your synonyms.txt file you can have (for example):
Hi HelloThere Wotcha => Hello
Similarly when you search, replace any incidences of "Hello There" in the query string with HelloThere and then they will be matched as a synonym of Hello.
Alternatively, you could use the AutoPhraseTokenFilter that LucidWorks created, available on github. This works by maintaining a token stream so that it can work out if a combination of two or more sequential tokens matches one of the synonym phrases, and if it doesn't, it throws away the first token as not matching the phrase. I'm not sure how much overhead this adds, but it seems a good approach - would be nice to have by default in Solr as part of the SynonymFilter.

Questions about SQL Server Full-Text Search Syntax

I'm trying to work out some of the finer details of the Full-Text Search syntax. I have the basics down but have run up against the following questions.
Near the top of this page, it shows cases where double quotes are required and includes the example "oatmeal". Why does this need double quotes? Is this a typo?
Near the bottom of the same page, it states that any instances of AND, AND NOT, OR, or OR NOT should be wrapped in quotes in some cases. Why? Aren't these stop words anyway?
This page provides a number of examples that include quoted and unquoted search terms. What is the difference between CONTAINS(Column, 'term') and CONTAINS(Column, '"term"')?
1) Full text search uses double quotes as a text delimiter and its good practise to use them especially if you are likely to have more complex search terms. E.g. You can put phrases or words in double quotes.
"Oatmeal"
"Hot Oatmeal"
2) I think the logic in the "and and not or or not" section is for this is to find the boolean terms and wrap the content either side in quotes.
e.g. OatMeal or Hot Oatmeal
translate to
'"Oatmeal" or "hot oatmeal"'
rather than
'oatmeal "or" hot oatmeal'
3) It should be contains(column, 'terms') containstable would use the 3 term query containstable(table, column, 'terms') I have done some testing on a database here with 30k documents in it contains(,'"dentist"') and contains(,'dentist') returned the same 652 rows. Containstable returned them with the same ranking.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources