I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.
I am a beginner in Solr. In my project, NGramFilterFactory and EdgeNGramFilterFactory, both are being used for a field. My understanding as per the document is EdgeNGramFilterFactory is used for "starts with" query while NGramFilterFactory is suitable for "contains" query.
I indexed a small dataset for both combinations (one in which I used only NGramFilterFactory and in another I used both NGramFilterFactory and EdgeNGramFilterFactory) but I did not see any difference in the output.
If my understanding is correct, in a way EdgeNGramFilterFactory is a subset of NGramFilterFactory. If this is true then is there any benefit of using both types of filters on the same field?
You should not be using both filters on the same field, they will completely mess up your matching. If you need to match in a middle of a token, you use NGrams. If you only need to match from the start, you use EdgeNGrams. Never both together.
How do you set up partial (substring) fuzzy match in Solr 4.2.1?
For example, if you have a list of US cities indexed, I would like a search term "Alber" to match "Alburquerque".
I have tried using the NGramFilterFactory on the <fieldType> and rebuilt the index but queries do not return results as expected - they still work as if I had just done the standard text_general defaults. Exact matches work, and explicit fuzzy searches would work given sufficient similarity (for example "Alberquerque~" with one misspelling would work.)
I did go to the analyzer tool in the Solr admin and saw that my ngrams were indeed being generated.
Is there something i'm missing from the query side?
Or should I take a different approach altogether?
And can this work with dismax? (Multiple fields indexed like this with different weights)
Thanks!
I want to do a phrase search in solr without analyzers being applied to it
eg - If I search for "DelhiDareDevil" (i.e - with inverted commas)it should search the exact text and not apply any analyzers or tokenizers on this field
However if i search for DelhiDareDevil it should use tokenizers and analyzers and split it to something like this delhi dare devil
any help would be appreciated
Not sure of any Out of the Box approach, you can copy field the content to a new field with no analysis.
And, You can define your own query parser and have the query being searched on the field with no analysis.
I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution