How to do partial beginning matches in Solr? - solr

I'm trying to search for partial beginning matches on a big list of lastnames. So Wein* should find Weinberg, Weinkamm etc.
I could do this by creating a special field, and adding
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" preserveOriginal="1"/>
to its type specification in schema.xml. When I add the line above only to the indexing analyzer and leave it empty for the query analyzer, I can then search by just search special_field:Wein and get the expected results.
Now I see that solr also has a *-syntax. What's the connection between EdgeNGramFilterFactory and the *-syntax?
Am I doing things correctly or is there a better, more regular way?
Thanks!

Or just do a simple wild card match:
name:Pe*

I don't recommend the Wein* query. That is implemented internally as PrefixQuery, which rewrites the original query to include all terms that have prefix equals "Wein". Depending on how large is your index (I mean how many terms), this query rewriting can be a bottleneck.
The EdgeNGramFilter at index time is a better approach. This solution will use more space, but queries will be processed much faster.

Note: I also asked this question in the Lucene forum where I got a good answer:
http://lucene.472066.n3.nabble.com/How-to-do-partial-beginning-matches-td781147.html

Related

substring match in solr query

I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

Solr: Localities & solr.ICUCollationField usage?

I'm learning Solr and have become confused trying to figure out ICUCollation, what it does, what it is for and how to use it. From here. I haven't found any good explanation of this online. The doc appear to be saying that I need to use this ICUCollation and implies that it does magical things for me, but does not seem to explain exactly why or exactly what, and how it integrates with anything else.
Say I have a text field in French and I want stopwords removed, accents, punctuation and case ignored and stemming... how does ICUCollation come into this? Do I set solr.ICUCollationField and locale='fr' and it will do everything else automatically? Or do I set solr.ICUCollationField and then tokenizer and filters on this in addition? Or do I not use solr.ICUCollationField at all because that's for something completely different? And if so, then what?
Collation is the organisation of written information into an order - ICUCollactionField (the API documentation also provides a good description) is meant to enable you to provide locale aware sorting, as the sort order is defined by cultural norms and specific language properties. This is useful to allow different sorting based on those rules, such as the difference between Norwegian and Swedish, where a Swede would order Å before Æ/Ä and Ø/Ö, while a Norwegian would order it Æ/Ä, Ø/Ö and then Å.
Since you usually don't want to sort by a tokenized field (exception: KeywordTokenizer) or a multivalued field, these fields are usually not processed any more than allowing for the sorting / collation to be performed.
There is a case to be made for collation filters for searching as well, as search in practice is just comparison. This means that if you're aiming to search for two words that would be identical when compared in the locale provided, it would be a hit. The tokens indexed will not make any sense when inspected, but as long as the values are reduced to the same token both when indexing and searching, it would work. There's an example of this on the wiki under UnicodeCollation.
Collation does not affect stopwords (StopFilterFactory), accents (ICUFoldingFilterFactory), punctuation, case (depending on locale - if the locale for sorting is case aware, then it does not) (LowercaseFilterFactory or ICUNormalizer2FilterFactory) or stemming (SnowballPorterFilterFactory). Have a look at the suggested filters for that. Most filters or tokenizers in Solr does very specific tasks, and try to avoid doing "everything and the kitchen sink" in one single filter.
You normally have two or more fields for one text input if you want to do different things like:
search: text analysis
sort: language sensitive / case insensitive sorting
facet: string
For search use something like:
<fieldType name="textFR" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
For sorting use:
<fieldType name="textSortFR" class="solr.ICUCollationField"
locale="fr"
strength="primary" />
or simply:
<fieldType name="textSort" class="solr.ICUCollationField"
locale=""
strength="primary" />
(If you have to support many languages. Should work fine enough in most cases.)
Do make use of the Analysis UI in the SOLR Admin: open the analysis view for your index, select the field type (e.g. your sort field), add a representative input value in the left text area and a test value in the right field (in case of sorting, this right side value is not as interesting as the sort field is not used for matching).
The output will show you whether:
accents are removed
elisions are removed
lower casing is applied
etc.
For example, if you see that elisions (l'atelier) are not remove (atelier) but you would like to discard it for sorting you would have to add the elision filter (see example for search field type above).
https://cwiki.apache.org/confluence/display/solr/Language+Analysis

Solr - removing special characters

a pretty basic question but can anyone tell me how to remove special characters from documents while indexing in solr? I went through Solr wiki but couldn't find anything relevant. I saw few tokenizers like WhiteSpaceTokenizerFactory and StandardTokenizerFactory. I am using WhiteSpaceTokenizerFactory in my schema.xml but it doesn't seem to solve the purpose. I am still able to query using "*" and "-" etc.
Consider using the standard tokenizer:
<tokenizer class="solr.StandardTokenizerFactory"/>
It should remove the characters you have mentioned.
Once the words have been tokenized you may apply further processing, like splitting on case change and numerics, using the WordDelimiterFilterFactory for better matching.
Also, very useful almost all the time when dealing with schema configuration, is the solr's analysis page. It gives you a lot of valuable feedback.

Solr british and american spelling

Search for 'globali*z*ation' only returns search results for 'globalization' but doesn't include any results for 'globali*s*ation' and vice versa.
I'm looking
into solr.HunspellStemFilterFactory filter (available in Solr 3.5).
<filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic,en_US.dic" affix="en_GB.aff,en_US.aff" ignoreCase="true" />
Before upgrading from Solr 3.4 to 3.6.1 I was wondering if Hunspell filter is the way to go?
Thanks
If stemming doesn't solve this for you, you could always use a SynonymFilterFactory in order to normalize both spellings into one, I guess a dictionary containing US/UK spelling variations wouldn't be hard to come by.

Solr Search Issue

We are storing a large number of tweets and blogs feeds into solr.
Now if the user searches for twitter mentions like, #rohit , records which just contain the word rohit are also being returned. Even if we do an exact match "#rohit", I understand this happens because of use of WordDelimiterFilterFactory which splits on special charaters,
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
How can I force Solr to not return without "#". I don't want to remove the WordDelimiterFilterFactory, since the splitOnCaseChange and stemEnglishPossessive are helpful? Hope I am being clear.
Regards,
Rohit
If you set preserveOriginal="1" this problem should be fixed. If not your tokenizer might strip the #, so you have to chose another one like, solr.WhitespaceTokenizerFactory.
What I would do is create a new fieldType with the preserveOriginal="1" in it. Then you can create a copyfield into the old fieldType. That way you will end up with two different versions of the field that can both be searched, just because sometimes you will want to search without the '#' as well. What you can do then, if somebody searches with some special characters, like the '#' have them search the preserved original field, otherwise search the default field like normal.

Resources