Fuzzy search with 1 distance does not works for other languages in Solr

Fuzzy search with 1 distance does not works for other languages in Solr - solr

I have documents with fields name_en, name_de, name_fr etc. And words cutter in english and mutter in german. If I fuzzy-search with name_en:cuter~1 (with only one t) it works fine, but if I search for name_de:muter~1 it just does not return any result.
However it works with fuzzy distance 2. So name_de:muter~2 works correct and return mutter. The languages have different analyzers in schema.xml, so this should be the difference. But it is still not clear why for german distance 1 does not work.
Here is config for german
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ManagedStopFilterFactory" managed="de" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ShingleFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.GermanStemFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
Could someone explain why distance is 2, but not 1. As I can observe, distance between mutter and muter is 1, not 2.

This happens because mutter is truncated by the german stemmer and get indexed as mutt, where cutter appears to be left untouched by most english stemmers (tested with Porter and Snowball/Porter2 algorithms, known to be the most aggressive) :
The edit distance for cuter to match cutter is 1.
The edit distance for muter to match mutt is 2.
In order to make the fuzzy search work as expected, you need to preserve the original (unstemmed) tokens in the analysis chain so that they get indexed too and thus can be matched properly by the distance algorithm at query time.
A simple solution is to use the KeywordRepeatFilterFactory, placed before the stemmer, so that the unstemmed tokens are preserved and indexed at the same position as the stemmed one. Otherwise you would have to use a specific field type.
You might also have the same kind of issues with wildcard queries, for the same reason, and the solutions would be the same.
Nb. I noticed you are using a shingle filter, it's important to place the keyword repeater after the shingle filter, so that repeated unigrams can be stemmed and repeated shingles removed by the duplicate filter, otherwise shingles would be made of repeated keywords.

Related

Apache Solr Query Building

I am a newbie to Apache Solr. I am trying to figure out the tokenizer, filter, and query parameters for the following query, but haven't been able to figure out if it's possible yet (still reading through all the documentation):
I have two fields - title and description. We want to do a search where:
1. Matches from title have more relevance than from description.
2. Complete word matches take precedence over all others (for query kit, kit takes precedence over kitchen).
3. An index entry that begins with the query field takes preference over one that just contains the field (for query goo, good takes precedence over Magoo).
Is this even possible? If so, how do I do this?

Weighting between fields isn't an issue that tokenizers or filters are concerned with - their job is to take some input text, split it into tokens (tokenizers) and then run it through a sequence of processing steps (filters).
The edismax and dismax query parsers have a parameter named qf that allows you to give a list of fields that should be queried, and give a separate weight for each one - allowing you to tune exactly how much weight to give to each field. qf=title^5 description would weigh a hit in the field title five times higher than a field in description - all things else being identical (but they usually aren't identical, since you're not indexing the same content into both fields).
And that's the reason why scoring isn't an exact science, so if you want to have some sort of relevancy score used (i.e. different words hit will give different scores), you'll have to tweak these weights to fit the rank you're looking for. Appending debugQuery=true to a query is very helpful when you're tweaking scoring, since it'll show you exactly how much each term is contributing to the end score for a document.
Your first criteria, title vs description is solved by having a TextField with a StandardTokenizer and a lowercasing filter (and depending on what you're looking for, optionally stemming, synonyms, etc.).
You'll also (probably) want a lowercasefilter in the examples given below, but I've omitted it to keep the examples compact.
Your second case is solved by having a second field type that has an EdgeNGramFilter, and then having two new fields - title_edge and description_edge that uses this field type.
Both this and the NGramFilter example below uses the type="index" attribute, since it usually only makes sense to expand ngrams when indexing. Otherwise any two words starting with (or for NGram Filter, containing) identical letters would give a match.
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="40" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
The third criteria is solved by having a third set of fields, title_ngram and description_ngram that has an NGramFilter in its sequence:
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
Be aware that an NGramFilter will result in a lot of tokens being generated, require more storage and making searches process more tokens when generating a match. This may or may not be relevant for your use case.
That being said, there's something to be said about matching inner terms in words - especially very short strings. They might give results where the user isn't able to understand why the document was matched, as it might be a small match (a single letter while typing a query) somewhere. Someone searching for just "c" to find something about the programming language, will get every hit that has a word that contains c (but if you've boosted your fields properly, the exact hit should be at the top, luckily).

Solr Search with wrong spell

I have integrated Solr with My eComemrce web application. I am indexing product title and many other fields of Product to Solr. Now I have indexed BLÅBÆRSOMMEREN into product title/name. I have added EdgeNGram as well for Title field. Because of EdgeNGram if I search any of the token I got the result. And Because of spell check if I Search for wrong spell like: BLÅBÆRISOMMEREN, I got the result. But if I search for BLÅBÆRI, I did not get any result as there is not any token for the same.
I want the products in result which have BLÅBÆR because that token is exist. Same for any other wrong spell search.
How can I achieve this? Any help will be appreciated!
Thanks.

It sounds like you may have Solr's tokenization configured differently for indexing and querying.
So, in your example the following terms may appear in the index:
B
BL
BLÅ
BLÅB
BLÅBÆ
BLÅBÆR
BLÅBÆRS
However as your query terms are not being processed into ngrams, you are only searching for
BLÅBÆRI
which does not appear within your indexed terms.
This is a common practice when using ngrams, however it sounds like in your use-case you want to return partial matches within your results.
Check your Solr schema to make sure that you have a matching EdgeNGram filter configured for query-time as you do for index-time, e.g.
<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
</fieldType>
Make sure you're sorting by score though, as this strategy will most likely give you many false-positives!

For misspelled words you can use a fuzzy query (allowing matches on index terms with an edit distance of ~1 or ~2 from the query term).
Using your example, BLÅBÆRISOMMEREN is edit distance 1 (one character difference) from your indexed term.
Therefore the query q=title:BLÅBÆRISOMMEREN~1 will match your title term but BLÅBÆRI will not (without the ngram approach from the previous answer.).
You can also investigate Solr's Suggester component if you're trying to build auto-suggest, as it also can handle fuzzy suggestions like: (BLÅBÆRI -> BLÅBÆRSOMMEREN) and typically responds faster than a traditional query.

Solr Queries: Single Terms versus Phrases

In our search based on Solr, we have started by using phrases.
For example, when the user types
blue dress
then the Solr query will be
title:"blue dress" OR description:"blue dress"
We now want to remove stop words. Using the default StopFilterFactory, the query
the blue dress
will match documents containing "blue dress" or "the blue dress".
However, when typing
blue the dress
then it does not match documents containing "blue dress".
I am starting to wonder if we shouldn't instead only search using single terms. That is, convert the above user search into
title:the OR title:blue OR title:dress OR description:the OR description:blue OR description:dress
I am a bit reluctant to do this, though, as it seems doing the work of the StandardTokenizerFactory.
Here is my schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
The title and the description fields are both of type text_general.
Is the single terms search the standard way of searching in Solr? Am I exposing ourselves to problems by tokenising the words before calling Solr (performance issues, maybe)?
Maybe thinking in term of single terms vs. phrases is just wrong and we should leave it to the user to decide?

What you stumble over is the fact that the stopwordfilter prevents the indexing of stopwords, but their position is indexed nevertheless. Something like a spaceholder is stored in the index where the stopword occurs.
So when you put this to your index
the blue dress
it will be indexed as
* blue dress
The same happens when you hand in the phrase
"blue the dress"
as a query. It will be treated as
"blue * dress"
Now Solr compares these two fragments and it does not match as the * is at the wrong position.
Prior to Solr 4.4 this used to be tackled via setting enablePositionIncrements="true" in the StopFilterFactory as described by Pascal Dimassimo. Apparently there has been a refactoring that did break that option on the StopFilterFactory as discussed on SO and Solr's Jira.
Update
When reading through the reference documentation of the Extended Dis Max Query Parser I found this
The stopwords Parameter
A Boolean parameter indicating if the StopFilterFactory configured in the query analyzer should be respected when parsing the query: if it is false, then the StopFilterFactory in the query analyzer is ignored.
I will check if this helps with the problem.

Although the initial approach might work if the query was split into multiple title:term statements, this is prone to errors (as the tokens might be split in the wrong places) and is also duplicating, probably badly, the work done by the built-in tokenizer.
The right approach is to maintain the initial query as-is and rely on the Solr configuration to handle it properly. This makes sense, but the difficulty was that I wanted to specify the fields in which I wanted to search. And it turns out that there is no way to do that using the default query parser, which is the one known as LuceneQParserPlugin (confusingly, there is a parameter called fl, for Field List, which is used for specifying the returned fields, not the fields to search in).
To be complete, it must be mentioned that it is possible to simulate the list of parameters to search in by using the copyField configuration is schema.xml. I do not find this very elegant nor flexible enough.
The elegant solution is to use the ExtendedDisMax query parser, aka edismax. With it, we can maintain the query as is, and fully leverage the configuration in the schema. In our case, it looks like this:
SolrQuery solrQuery = new SolrQuery();
solrQuery.set("defType", "edismax");
solrQuery.set("q", query); // ie. "blue the dress"
solrQuery.set("qf", "description title");
According to this page:
(e)Dismax generally makes the best first choice query parser for user facing Solr applications
It would have helped if this had indeed been the default choice.

Transitioning from Stemmed to unstemmed field in solr

I am working with SOLR (3.x) and need to transition a field from a stemmed to an unstemmed version.
Is there a stemming filter that will index both the exact text as well as stemmed text (so I can match on both in the near term) or am I forced to copy to a new field and then transition to the new field.

from http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
A repeated question is "how can I have the original term contribute more to the score than the stemmed version"? In Solr 4.3, the KeywordRepeatFilterFactory has been added to assist this functionality. This filter emits two tokens for each input token, one of them is marked with the Keyword attribute. Stemmers that respect keyword attributes will pass through the token so marked without change. So the effect of this filter would be to index both the original word and the stemmed version. The 4 stemmers listed above all respect the keyword attribute.
For terms that are not changed by stemming, this will result in duplicate, identical tokens in the document. This can be alleviated by adding the RemoveDuplicatesTokenFilterFactory.
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeywordRepeatFilter"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
This will match your exact term and stemmed version both. Though for exact term, the score is going to be high as both unstemmed and stemmed version will be matched and scores added.
We have used this before but then moved on to creating two fields (exactly as Arun's comment), stemmed and unstemmed, searching in both simultaneously and providing boosts as we need it. This gives us more control on what we are doing.
Just another option, see what suits you.

Solr query/field analyzer

I am total beginner with Solr and have a problem with unwanted characters getting into query results. For example when I search for "foo bar" I got content with "'foo' bar" etc. I just want to have exact matches. As far as I know this can be set up in schema.xml file.
My content field type:
<fieldtype name="textNoStem" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.LowerCaseFilterFactory" />
<tokenizer class="solr.KeywordTokenizerFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldtype>
Please let me know if you know the solution.
Kind Regards.

For both analyzers, the first line should be the tokenizer. The tokenizer is used to split the text into smaller units (words, most of the time). For your need, the WhitespaceTokenizerFactory is probably the right choice.
If you want absolute exact match, you do not need any filter after the tokenizer. But if you do no want searches to be case sensitive, you need to add a LowerCaseFilterFactory.
Notice that you have two analyzers: one of type 'index' and the other of type 'query'. As the names implied, the first one is used when indexing content while the other is used when you do queries. A rule that is almost always good is to have the same set of tokenizers/filters for both analyzers.

If you just want exact matches use the KeywordTokenizerFactory instead of the StandardTokenizerFactory at query time.

I guess you dont get any results because the tokening is done differently on the data that is already indexed.
As Pascal said, whitespaceTokenizer is the right choice in your case. Use it at both index and query time and check the results after indexing some data, not on the previously indexed data.
I suggest using analysis page to see the results with out actually indexing.Its quite useful.Make changes in schema, refresh the core, go to analysis page and look at verbose output to get the step by step analysis.