With solr, I try to highlighting some text using hl.formatter option with hl.simple.pre/post.
My problem is that the hl.simple.pre/post code doesn't appear sometime in the highlighting results, I don't understand why.
By example I call this URL :
http://localhost:8080/solr/Employees/select?q=lastName:anthan&fl=lastName&wt=json&indent=true&hl=true&hl.fl=lastName&hl.simple.pre=<em>&hl.simple.post=</em>
I get :
..."highlighting": {
"NB0094418": {
"lastName": [
"Yogan<em>anthan</em>" => OK
]
},
"NB0104046": {
"lastName": [
"Vijayakanthan" => KO, I want Vijayak<em>anthan</em>
]
},
"NB0144981": {
"lastName": [
"Parmananthan" => KO, I want Parman<em>anthan</em>
]
},...
Someone have an idea why I have this behavior ?
My configuration :
schema.xml
<fieldType name="nameType" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="50" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
</analyzer>
</fieldType>
...
<fields>
<field name="lastName" type="nameType" indexed="true" stored="true" required="true" />
</fields>
solrconfig.xml
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
</requestHandler>
...
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<fragmenter name="gap" default="true" class="solr.highlight.GapFragmenter">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<fragmenter name="regex" class="solr.highlight.RegexFragmenter">
<lst name="defaults">
<int name="hl.fragsize">70</int>
<float name="hl.regex.slop">0.5</float>
<str name="hl.regex.pattern">[-\w ,/\n\"']{20,200}</str>
</lst>
</fragmenter>
<formatter name="html" default="true" class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
<encoder name="html" default="true" class="solr.highlight.HtmlEncoder" />
<fragListBuilder name="simple" default="true" class="solr.highlight.SimpleFragListBuilder" />
<fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder" />
<fragmentsBuilder name="default" default="true" class="solr.highlight.ScoreOrderFragmentsBuilder">
</fragmentsBuilder>
<fragmentsBuilder name="colored" class="solr.highlight.ScoreOrderFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[
<b style="background:yellow">,<b style="background:lawgreen">,
<b style="background:aquamarine">,<b style="background:magenta">,
<b style="background:palegreen">,<b style="background:coral">,
<b style="background:wheat">,<b style="background:khaki">,
<b style="background:lime">,<b style="background:deepskyblue">]]></str>
<str name="hl.tag.post"><![CDATA[</b>]]></str>
</lst>
</fragmentsBuilder>
</highlighting>
</searchComponent>
I was dealing with a very similar problem until yesterday. I tried many different solutions, iteratively, so some details I ended up with of this may not be necessary. But I'll describe what I got working eventually. Short answer, I think the highlighter is failing to find the term position information it needs on longer fields.
Firstly, the symptoms I was seeing: sometimes the search term highlight would show up, and sometimes the entire field would show up in the highlighting section, but without the highlight information. The pattern ended up being based on both the length of the field, and the length of the search term. I found that the longer the field (actually, the token that was ngrammed), the shorter the search term that could be highlighted successfully. It wasn't 1-to-1, though. I found that for a field with 11 or fewer characters, highlighting worked fine in all cases. If the field had 12 characters, no ngram longer than 9 characters would be highlighted. For a field with 15 characters, ngrams longer than 7 characters would not be highlighted. For fields longer than 18 characters, ngrams longer than 6 characters would not be highlighted. And for fields longer than 21 characters, ngrams longer than 5 aren't highlighted, and fields longer than 24 characters wouldn't highlight more than 4 characters. (It looks like, from the examples you have above, that the specific sizes you are seeing are not exactly the same, but I do notice that the names in the documents where the highlighting did not work were longer than the one where it did.)
So, here's what ended up working:
I switched from using WhitespaceTokenizer and NGramFilterFactory to using NGramTokenizerFactory instead. (You are already using this, and I'll have more later on a difficulty this raised for me.) This wasn't sufficient to solve the problem, though, because the term positions still weren't being stored.
I started using the FastVectorHighlighter. This forced some changes in how my schema fields were indexed (including storing storing the term vectors, positions and offsets), and I also had to change my pre- and post- indicator tag configuration from hl.simple.pre to hl.tag.pre (and similarly for *post).
Once I had made these changes, the highlighting started working consistently. This had the side-effect, though, of removing the behavior I had been getting from the WhitespaceTokenizer. If I had a field that contained the phrase "this is a test" I was ending up with ngrams that included "s is a", "a tes", etc., and I really just wanted the ngrams of the individual words, not of the whole phrase. There is a note in the NGramTokenizer JavaDocs that you can override NGramTokenizer.isTokenChar() to provide pre-tokenizing, but I couldn't find an example of this on the web. I'll include one below.
End result:
WhitespaceSplittingNGramTokenizer.java:
package info.jwismar.solr.plugin;
import java.io.Reader;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.util.Version;
public class WhitespaceSplittingNGramTokenizer extends NGramTokenizer {
public WhitespaceSplittingNGramTokenizer(Version version, Reader input, int minGram, int maxGram) {
super(version, input, minGram, maxGram);
}
public WhitespaceSplittingNGramTokenizer(Version version, AttributeFactory factory, Reader input, int minGram,
int maxGram) {
super(version, factory, input, minGram, maxGram);
}
public WhitespaceSplittingNGramTokenizer(Version version, Reader input) {
super(version, input);
}
#Override
protected boolean isTokenChar(int chr) {
return !Character.isWhitespace(chr);
}
}
WhitespaceSplittingNGramTokenizerFactory.java:
package info.jwismar.solr.plugin;
import java.io.Reader;
import java.util.Map;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
public class WhitespaceSplittingNGramTokenizerFactory extends TokenizerFactory {
private final int maxGramSize;
private final int minGramSize;
/** Creates a new WhitespaceSplittingNGramTokenizer */
public WhitespaceSplittingNGramTokenizerFactory(Map<String, String> args) {
super(args);
minGramSize = getInt(args, "minGramSize", NGramTokenizer.DEFAULT_MIN_NGRAM_SIZE);
maxGramSize = getInt(args, "maxGramSize", NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
#Override
public Tokenizer create(AttributeFactory factory, Reader reader) {
return new WhitespaceSplittingNGramTokenizer(luceneMatchVersion, factory, reader, minGramSize, maxGramSize);
}
}
These need to be packaged up into a .jar and installed someplace where SOLR can find it. One option is to add a lib directive in solrconfig.xml to tell SOLR where to look. (I called mine solr-ngram-plugin.jar and installed it in /opt/solr-ngram-plugin/.)
Inside solrconfig.xml:
<lib path="/opt/solr-ngram-plugin/solr-ngram-plugin.jar" />
schema.xml (field type definition):
<fieldType name="any_token_ngram" class="solr.TextField">
<analyzer type="index">
<tokenizer class="info.jwismar.solr.plugin.WhitespaceSplittingNGramTokenizerFactory" maxGramSize="30" minGramSize="2"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{30})(.*)?" replacement="$1" replace="all" />
</analyzer>
</fieldType>
schema.xml (field definitions):
<fields>
<field name="property_address_full" type="string" indexed="false" stored="true" />
<field name="property_address_full_any_ngram" type="any_token_ngram" indexed="true"
stored="true" omitNorms="true" termVectors="true" termPositions="true"
termOffsets="true"/>
</fields>
<copyField source="property_address_full" dest="property_address_full_any_ngram" />
solrconfig.xml (request handler (you can pass these parameters in the normal select URL, instead, if you prefer)):
<!-- request handler to return typeahead suggestions -->
<requestHandler name="/suggest" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="defType">edismax</str>
<str name="rows">10</str>
<str name="mm">2</str>
<str name="fl">*,score</str>
<str name="qf">
property_address_full^100.0
property_address_full_any_ngram^10.0
</str>
<str name="sort">score desc</str>
<str name="hl">true</str>
<str name="hl.fl">property_address_full_any_ngram</str>
<str name="hl.tag.pre">|-></str>
<str name="hl.tag.post"><-|</str>
<str name="hl.fragsize">1000</str>
<str name="hl.mergeContinuous">true</str>
<str name="hl.useFastVectorHighlighter">true</str>
</lst>
</requestHandler>
If you are asking why your configuration defined hl.tag.pre and hl.tag.post are not appearing in the sample query you gave and is instead showing the <em> and </em> pre/post tags...
This is because you are specifying the hl.tag.pre and hl.tag.post parameters in the query string (at request time). Therefore, they are overriding the defaults settings you have defined for the highlight searchComponent in your solrconfig.xml file.
Either remove those query string parameters or set the searchComponent configuration file to set hl.tag.pre and hl.tag.post in a <lst name="invariant"> to force these to override any request time parameters.
Here is an overview of the various Configuration Settings
Related
I am new to Solr and trying to provide partial word matching with Solr 8.8.1, but partials are giving no results. I have combed the blogs without luck to fix this.
For example, the text of the document contains the word longer. Index analysis gives lon, long, longe, longer. If I query longer using alltext_en:longer, I get a match. However, if I query (for example) longe using alltext_en:longe, I get no match. explainOther returns 0.0 = No matching clauses.
It seems that I am missing something obvious, since this is not a complex phrase query.
Apologies in advance if I have missed any needed details - I will update the question if you tell me what else is needed to know.
Here are the relevant field specs from my managed-schema:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="3"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
<field name="alltext_en" type="text_en" multiValued="true" indexed="true" stored="true"/>
<copyField source="*_txt_en" dest="alltext_en"/>
Here is the relevant part of solrconfig.xml:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="q">*:*</str>
<str name="q.alt">*:*</str>
<str name="rows">50</str>
<str name="fl">*,score,[explain]</str>
<str name="ps">10</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">_text_</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="hl.simple.pre"><span class="artica-snippet"></str>
<str name="hl.simple.post"></span></str>
<!-- Spell checking defaults -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
That stemming filter will modify the tokens in ways you don't predict - and since they only happen on the token you try to match agains the ngrammed tokens when querying, the token might not be what you expect). If you're generating ngrams, stemming filters should usually be removed. I'd also remove the possessive filter (Also, small note - try to avoid using * when formatting text, since it's hard to know if you've used it when querying and the formatting is an error - instead use a backtick to indicate that the text is a code keyword/query.) – MatsLindh
That answered it - I removed the stemmer from the index step and everything was fine. Brilliant, thank you, #MatsLindh!
Background
I have a Solr spellchecker configured like the following in schema.xml:
<fieldType name="spell_field" class="solr.TextField">
<analyzer type="index">
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords.txt" />
<filter class="solr.LengthFilterFactory" min="3" max="255" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
</analyzer>
<analyzer type="query">
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords.txt" />
<filter class="solr.LengthFilterFactory" min="3" max="255" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" />
<tokenizer class="solr.WhitespaceTokenizerFactory" />
</analyzer>
</fieldType>
which is used for:
<field name="spellcheck" type="spell_field" indexed="true" stored="false" multiValued="true" />
and like the following in solrconfig.xml:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">dflt</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.count">10</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.maxCollations">3</str>
<str name="spellcheck.collateMaxCollectDocs">1</str>
<str name="spellcheck.maxCollationTries">2</str>
</lst>
<arr name="last-components">
<str>suggest</str>
</arr>
</requestHandler>
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<str name="queryAnalyzerFieldType">spellcheck</str>
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="field">spellcheck</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<int name="minPrefix">1</int>
<int name="minQueryLength">3</int>
<int name="maxEdits">2</int>
<int name="maxInspections">3</int>
<int name="minQueryLength">3</int>
<float name="maxQueryFrequency">0.01</float>
<float name="thresholdTokenFrequency">.00001</float>
<float name="accuracy">0.5</float>
</lst>
</searchComponent>
The problem
Solr will sometimes return search results with special characters in them as the first suggestion. This is a problem because my application uses the first to rebuild the query.
For example, if I search on "VOLTAGER", the first spelling suggestion Solr produces is "voltage:", so the rebuilt query looks like myField:voltage:. Then, after the query is sent, Solr's logger displays the following warning: SpellCheckCollator: Exception trying to re-query to check if a spell check possibility would return any hits.
The underlying Exception is a parse error because myField:voltage: is not a valid query.
"VOLTAGER" also returns a plain "voltage", but further down the suggestion list, and my requirements state I must grab the first spelling correction from the list.
Ideally, in the above example, "VOLTAGER" would only return "voltage".
What I've Tried
I tried adding the following line to the index and query analyzer in the spell_field field type:
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9])" replacement=""/>
This did remove all special characters from the spellchecker, but it had the nasty side effect of also sharply reducing the amount of results returned from the spellchecker. For example, "VOLTAGER" no longer returns anything. Neither does "circut", which normally returns "circuit".
Currently, I have the following line in the Java application that connects to Solr:
correctedTerms = correctedTerms.replaceAll("[^A-Za-z0-9]", "");
It works by making sure whatever is returned has no special characters, but I would much rather configure Solr's spellchecker to stop returning corrections with special characters in the first place.
In summary
I'm trying to get Solr's spellchecker to stop returning special characters in its suggestions. Basically I just want letters returned. How do I achieve what I want?
In my original question, I was apparently a bit confused about who was causing what errors and where. The ultimate problem was Solr was automatically testing collations with terms that had illegal ASCII characters appended to them (the : character, usually). The special characters weren't coming from collation, however, they were just returned by the spellchecker and even if I removed all special characters from my analyzed fields, the spellchecker would continue to return some suggestions with the : character appended.
The way I solved this problem was to just remove the collator itself. So now my spellcheck config looks like this:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">dflt</str>
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.count">10</str>
</lst>
<arr name="last-components">
<str>suggest</str>
</arr>
</requestHandler>
and I still have the following in my code when retrieving suggestions from the Suggestion Map:
correctedTerms = correctedTerms.replaceAll("[^A-Za-z0-9]", "");
Annoying, but at least now Solr isn't throwing a bunch of exceptions every time the collator fails and my code can provide a safety net to make sure nothing illegal makes it down to Solr.
The downside is I now have to do collations myself and, unlike Solr, I can't really guarantee any one collation will produce results. That said, my requirements aren't very heavy duty for the spellchecker, so while this behavior is undesirable, it's not unacceptable.
If anybody has had this problem and solved it without removing the collator, I would be very interested to hear about it.
I am trying to test the spellchecking functionality with Solr 4.7.2 using solr.DirectSolrSpellChecker (where you don't need to build a dedicated index).
I have a field named "title" in my index; I used a copy field definition to create a field named "title_spell" to be queried for the spellcheck (title_spell is correctly filled). However, in the admin solr admin console, I always get empty suggesions.
For example: I have a solr document with the title "A B automobile"; I enter in the admin console (spellcheck crossed and under the input field spellcheck.q) "atuomobile". I expect to get at least something like "A B automobile" or "automobile" but the spellcheck suggestion remains empty...
My configuration:
schema.xml (only relevant part copied):
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="de_DE/synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="title_spell" type="textSpell" indexed="true" stored="true" multiValued="false"/>
solr.xml (only relevant part copied):
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">title_spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="distanceMeasure">internal</str>
<float name="accuracy">0.5</float>
<int name="maxEdits">2</int>
<int name="minPrefix">1</int>
<int name="maxInspections">5</int>
<int name="minQueryLength">4</int>
<float name="maxQueryFrequency">0.01</float>
<float name="thresholdTokenFrequency">.01</float>
</lst>
</searchComponent>
...
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
</lst>
<!--Versuch, das online datum mit in die Gewichtung zu nehmen...-->
<lst name="appends">
<str name="bf">recip(ms(NOW/MONTH,sort_date___d_i_s),3.16e-11,50,1)</str>
<!--<str name="qf">title___td_i_s_gcopy^1e-11</str>-->
<str name="qf">title___td_i_s_gcopy^21</str>
<str name="q.op">AND</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
What did I miss? Thanks for your answers!
How large is your index? For a small index (think less than a few million docs), you're going to have to tune accuracy, maxQueryFrequency, and thresholdTokenFrequency. (Actually, it would probably be worth doing this on larger indices as well.)
For example, my 1.5 million doc index uses the following for these settings:
<float name="maxQueryFrequency">0.01</float>
<float name="thresholdTokenFrequency">.00001</float>
<float name="accuracy">0.5</float>
accuracy tells Solr how accurate a result needs to be before it's considered worth returning as a suggestion.
maxQueryFrequency tells Solr how frequently the term needs to occur in the index before it's can be considered worth returning as a suggestion.
thresholdTokenFrequency tells Solr what percentage of documents the term must be included in before it's considered worth returning as a suggestion.
If you plan to use spellchecking on multiple phrases, you may need to add a ShingleFilter to your title_spell field.
Another thing you might try is setting your queryAnalyzerFieldType to title_spell.
Can you please try editing your requestHandler declaration.
<requestHandler name="/standard" class="solr.SearchHandler" default="true">
and query url as:
http://localhost:8080/solr/service/standard?q=<term>&qf=title_spell
First experiment with small terms and learn how it is behaving. One problem here is it will only return all the terms starting with the same query term. You can use FuzzyLookupFactory which will match and return fuzzy result. For more information check solr suggester wiki.
Hello everyone i am using solr 4.10 and i am not getting the result as per my expectation. i want to get auto complete suggestion using multiple fields that is discountCatName,discountSubName and vendorName. i have a created multi-valued field "suggestions" using copyfield and using that filed for searching in suggester configuration.
Note: discountSubName & discountCatName are again multi-valued field, vendorName is string.
This is a suggestion field data from one of my document:
"suggestions": [
"Budget Car Rental",
"Car Rentals",
"Business Deals",
"Auto",
"Travel",
"Car Rentals" ]
If i type for a "car" i am getting "Budget Car Rental" in my suggestion but not "Car Rentals", below are my configurations. let me know if i need to change the tokenizer and filters.Any help in this would be appreciate.
Below is my code block as per explained the scenario above.
Suggestion field,fieldType,searchComponent and request handler respectively which i am using for auto complete suggestions
<!--suggestion field -->
<field name="suggestions" type="suggestType" indexed="true" stored="true" multiValued="true"/>
<copyField source="discountCatName" dest="suggestions"/>
<copyField source="discountSubName" dest="suggestions"/>
<copyField source="vendorName" dest="suggestions"/>
<!--suggest fieldType -->
<fieldType name="suggestType" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9]" replacement=" " />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<!--suggest searchComponent configuration -->
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">analyzing</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="suggestAnalyzerFieldType">suggestType</str>
<str name="blenderType">linear</str>
<str name="minPrefixChars">1</str>
<str name="doHighlight">false</str>
<str name="weightField">score</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">suggestions</str>
<str name="buildOnStartup">true</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<!--suggest request handler -->
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">analyzing</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
I just discovered by debugging Solr 4.10 source code there is a bug in DocumentDictionaryFactory lookup, it's always look in the first string incase of multi-valued field and then stop suggestion from that document hence i am not getting expected output from my above configuration.
I have a created a separate index for all the fields i want to apply search like catName0...catName10, subName0...subName10 and then created multiple suggestion dictionaries for each fields and lastly i parsed the response form all the suggestion dictionary merged them and sorted based on weight and highlight position.
Lengthy approach but no other way as this solr 4.10 was required.
I've followed the solr wiki article for suggester almost to the T here: http://wiki.apache.org/solr/Suggester. I have the following xml in my solrconfig.xml:
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">description</str>
<float name="threshold">0.05</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
However, when I run the following query (or something similar):
../suggest/?q=barbequ
I only get the following result xml back:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">78</int>
</lst>
<lst name="spellcheck">
<lst name="suggestions"/>
</lst>
</response>
As you can see, this isn't very helpful. Any suggestions to help resolve this?
A couple of things I can think of that might cause this problem:
The source field ("description") is incorrect - ensure that this is indeed the field that seeds terms for your spell checker. It could even be that the field is a different case (eg. "Description" instead of "description").
The source field in your schema.xml is not set up correctly or is being processed by filters that cause the source dictionary to be invalid. I use a separate field to seed the dictionary, and use <copyfield /> to copy relevant other fields to that.
The term "barbeque" doesn't appear in at least 5% of records (you've indicated this requirement by including <float name="threshold">0.05</float>) and therefore is not included in the lookup dictionary
In SpellCheckComponent the <str name="spellcheck.onlyMorePopular">true</str> setting means that only terms that would produce more results are returned as suggestions. According to the Suggester documentation this has a different function (sorting suggestions by weight) but it might be worth switching this to false to see if it is causing the issue.
Relevant parts of my schema.xml:
<schema>
<types>
<!-- Field type specifically for spell checking -->
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StandardFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StandardFilterFactory" />
</analyzer>
</fieldType>
</types>
<fields>
<field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" />
</fields>
<!-- Copy fields which are used to seed the spell checker -->
<copyField source="name" dest="spell" />
<copyField source="description" dest="spell" />
<schema>
Could the problem be that you're querying /suggest instead of /spell
../suggest/?q=barbequ
In my setup this the string I pass in:
/solr/spell?q=barbequ&spellcheck=true&spellcheck.collate=true
And the first time you do a spellcheck you need to include
&spellcheck.build=true
I'm running on solr 4 btw. So, perhaps /suggest is an entirely different endpoint that does something else. If so, apologize.
Please check, if the term-parameter are set in the schema.xml, like:
<field name="TEXT" type="text_en" indexed="true" stored="true" multiValued="true"
termVectors="true"
termPositions="true"
termOffsets="true"/>
...restart solr and reindex again