substring match in solr query - solr

I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10

Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).

You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

Related

Priority for compounded words in Solr

I am trying to improve my search results using Solr.
currently i am working on compounded words, so far i am getting results for the compounded word and its parts but the problem is that there is no prioity/weight between the results.
i would like to have the results relating to the original compounded word have higher weight/prioity that the its parts.
is there a way to do this with Solr ?
As an exsample the searchword might be "støvsuger", currently i am getting equal results for "støvsuger", "støv" and "suger". What i would like is the that "støvsuger" is higher weighted than "støv and "suger".
this is what i am currently doing:
<filter class="solr.DictionaryCompoundWordTokenFilterFactory" minSubwordSize="4" dictionary="lang/ordbog.txt" onlyLongestMatch="true" maxSubwordSize="15" minWordSize="7"/>
The current querystring looks like this:
{0}Portal1_{1}_{2}/select?defType=edismax
&fl=id,title,shortDescription,htmlContent,kbId,score
&mm={3}
&q={4}
&qf=_priorityKeywords^60 title^80 portalTitle^60 shortDescription^50 htmlContent^20
&pf=_priorityKeywords^60 title^100 portalTitle^60 shortDescription~10^50 htmlContent~10^20
&rows=500
&wt=json
&tie=0.1
Where the you can ingore {0}, {1}, {2}, and the {3} is the number of search words and {4} is the search word/term.
Have one field with the content without the compound word token filter and one field with the compound word token filter. Boost hits in the field without the compound word token filter more than hits in the one with (I'll just assume we're talking about a category here, but it'd work the same for any text referring to vacuum cleaners in your case):
qf=category_without_compounds^5 category_with_compounds
.. will give a 5x weight to any hits in the field without the expanded compound words. You can also give an even larger boost to exact hits (where the search query matches the a category or title exactly, for example).
You define a copyField instruction in your schema / collection configuration to copy the same content into both fields automagically.

How to config solr that use Synonym base on KeywordTokenizerFactory

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

issue in searching uppercase string with wildcard

I am using solr search. my search field contains both diamond and Diamond.
But when i search for Diamond or diamond it gives me correct results. But when i search for Diamond* or diamond*, I get result for diamond* but no results found for Diamond* . although i have applied <filter class="solr.LowerCaseFilterFactory"/>.
would you please suggest me what can be the issue.
"Unlike other types of Lucene queries, Wildcard, Prefix, and Fuzzy queries are not passed through the Analyzer, which is the component that performs operations such as stemming and lowercasing"
http://wiki.apache.org/lucene-java/LuceneFAQ#Are_Wildcard.2C_Prefix.2C_and_Fuzzy_queries_case_sensitive.3F
Inside this link there's the workaround for this problem

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

How to do partial beginning matches in Solr?

I'm trying to search for partial beginning matches on a big list of lastnames. So Wein* should find Weinberg, Weinkamm etc.
I could do this by creating a special field, and adding
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" preserveOriginal="1"/>
to its type specification in schema.xml. When I add the line above only to the indexing analyzer and leave it empty for the query analyzer, I can then search by just search special_field:Wein and get the expected results.
Now I see that solr also has a *-syntax. What's the connection between EdgeNGramFilterFactory and the *-syntax?
Am I doing things correctly or is there a better, more regular way?
Thanks!
Or just do a simple wild card match:
name:Pe*
I don't recommend the Wein* query. That is implemented internally as PrefixQuery, which rewrites the original query to include all terms that have prefix equals "Wein". Depending on how large is your index (I mean how many terms), this query rewriting can be a bottleneck.
The EdgeNGramFilter at index time is a better approach. This solution will use more space, but queries will be processed much faster.
Note: I also asked this question in the Lucene forum where I got a good answer:
http://lucene.472066.n3.nabble.com/How-to-do-partial-beginning-matches-td781147.html

Resources