Solr highlighting - terms with umlaut not found/not highlighted - solr

I am playing with 7.2 version of solr. I've uploaded a nice collection of texts in German language and trying to query and highlight a few queries.
If I fire this query with hightlight:
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.snippets=3&wt=xml&rows=1
I get a nice text back:
<response>
<lst name="responseHeader">
<bool name="zkConnected">true</bool>
<int name="status">0</int>
<int name="QTime">10</int>
<lst name="params">
<str name="hl.snippets">3</str>
<str name="q">trans:Zeit</str>
<str name="hl">true</str>
<str name="hl.q">Kundigung</str>
<str name="hl.fl">trans</str>
<str name="rows">1</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="418" start="0" maxScore="1.6969817">
<doc>
<str name="id">x</str>
<str name="trans">... Zeit ...</str>
<date name="t">2018-03-01T14:32:29.400Z</date>
<int name="l">2305</int>
<long name="_version_">1594374122229465088</long>
</doc>
</result>
<lst name="highlighting">
<lst name="x">
<arr name="trans">
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
</arr>
</lst>
</lst>
</response>
However, if I supply the Kündigung as highlight text, I get no answers, as the text/query parser replaced all the ü characters with u.
I have a feeling that I need to supply the correct qparser. How should I specify it? It seems to me that the collection was build with and queried with the default LuceneQParser parser. How can I supply this parser in the url above?
UPDATE:
http://localhost:8983/solr/trans/schema/fields/trans returns
{
"responseHeader":{
"status":0,
"QTime":0},
"field":{
"name":"trans",
"type":"text_de",
"indexed":true,
"stored":true}}
Update 2: So I've looked at the managed-schema of my solr installation/collection schema configuration and found the following:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
the way I interpret the information is that since query and index parts are omited, the above code is meant to be the same for both query and index. Which... does not show any misconfiguration issues similar to the answer 2 below...
I rememberred though, adding the field trans with type text_de:
curl -X POST -H 'Content-type:application/json' --data-binary '{
"add-field":{
"name":"trans",
"type":"text_de",
"stored":true,
"indexed":true}
}' http://localhost:8983/solr/trans/schema
I've deleted all the documents using
curl http://localhost:8983/solr/trans/update?commit=true -d "<delete><query>*:*<
/query></delete>"
and then reinserting them again:
curl -X POST http://localhost:8983/solr/trans/update?commit=true -H "Content-Type: application/json" -d #all.json
Is it the correct way to "rebuild" the indexes in solr?
UPDATE 3: The Charset settings of the standart JAVA installation were not set to UTF-8:
C:\tmp>java -classpath . Hello
Cp1252
Cp1252
windows-1252
C:\tmp>cat Hello.java
public class Hello {
public static void main(String args[]) throws Exception{
// not crossplateform safe
System.out.println(System.getProperty("file.encoding"));
// jdk1.4
System.out.println(
new java.io.OutputStreamWriter(
new java.io.ByteArrayOutputStream()).getEncoding()
);
// jdk1.5
System.out.println(java.nio.charset.Charset.defaultCharset().name());
}
}
UPDATE 4: Restarted the solr with UTF8 settings:
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 8983 -s example/cloud/node1/solr
bin\solr.cmd start -Dfile.encoding=UTF8 -c -p 7574 -s example/cloud/node2/solr -z localhost:9983
Checked the JVM settings:
http://localhost:8983/solr/#/~java-properties
file.​encoding UTF8
file.​encoding.​pkg sun.io
reinserted the docs. No change: http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=Kundigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml gives:
<lst name="highlighting">
<lst name="32e42caa-313d-45ed-8095-52f2dd6861a1">
<arr name="trans">
<str> ... <em>Kündigung</em> ...</str>
<str> ... <em>Kündigung</em> ...</str>
</arr>
</lst>
</lst>
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=K%C3%BCndigung&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml gives:
<lst name="highlighting">
<lst name="32e42caa-313d-45ed-8095-52f2dd6861a1"/>
</lst>
uchardet all.json (file -bi all.json) reports UTF-8
Running from the ubuntu subsystem under windows:
$ export LC_ALL='en_US.UTF-8'
$ export LC_CTYPE='en_US.UTF-8'
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
"query" : "trans:Kündigung",
"limit" : "1", params: {"hl.q":"Kündigung"}
}'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":21,
"params":{
"hl":"true",
"fl":"id",
"json":"\n{\n \"query\" : \"trans:Kündigung\",\n \"limit\" : \"1\", params: {\"hl.q\":\"Kündigung\"}\n}",
"hl.fl":"trans"}},
"response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
{
"id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
},
"highlighting":{
"b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{}}}
$ curl -H "Content-Type: application/json" http://localhost:8983/solr/trans/query?hl=true\&hl.fl=trans\&fl=id -d '
{
"query" : "trans:Kündigung",
"limit" : "1", params: {"hl.q":"Kundigung"}
}'
{
"responseHeader":{
"zkConnected":true,
"status":0,
"QTime":18,
"params":{
"hl":"true",
"fl":"id",
"json":"\n{\n \"query\" : \"trans:Kündigung\",\n \"limit\" : \"1\", params: {\"hl.q\":\"Kundigung\"}\n}",
"hl.fl":"trans"}},
"response":{"numFound":124,"start":0,"maxScore":4.3724422,"docs":[
{
"id":"b952b811-3711-4bb1-ae3d-e8c8725dcfe7"}]
},
"highlighting":{
"b952b811-3711-4bb1-ae3d-e8c8725dcfe7":{
"trans":[" ... <em>Kündigung</em> ..."]}}}
UPDATE 5 Without supplying hl.q (http://localhost:8983/solr/trans/select?q=trans:Kundigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml or http://localhost:8983/solr/trans/select?q=trans:K%C3%BCndigung&hl=true&hl.fl=trans&hl.qparser=lucene&hl.snippets=3&rows=1&wt=xml):
<lst name="highlighting">
<lst name="b952b811-3711-4bb1-ae3d-e8c8725dcfe7">
<arr name="trans">
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
<str> ... <em>Kündigung</em> ... </str>
</arr>
</lst>
</lst>
in this case, the hl.q took the highlighting terms from the query itself, and did a superb job..

Could be a problem with your JVM's encoding. What about -Dfile.encoding=UTF8? Check LC_ALL and LC_CTYPE too. Should be UTF-8.
What field type is the trans field? I even indexed german text with text_en and do not have any problems with Umlauts in highlighting or search and I use the LuceneQParser too.
How looks the response when you query via Solr Admin UI (http://localhost:8983/solr/#/trans/query) and hl checkbox activated?

Check your analyzer chain too. I get the same behaviour as you described, when I misconfigure the chain this way:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.GermanLightStemFilterFactory"/>
</analyzer>
</fieldType>
The GermanNormalizationFilterFactory and GermanLightStemFilterFactory both replaces umlauts.

What you need to specify is the attribute, for which the highlighting is done. Similar to q=trans:Zeit, where you specified trans as an attribute, you need to specify hl.q to be hl.q=trans:Kündigung. Your request then becomes:
http://localhost:8983/solr/trans/select?q=trans:Zeit&hl=true&hl.fl=trans&hl.q=trans:Kündigung&hl.snippets=3&wt=xml&rows=1
This answer was humbly presented by David Smiley, Stefan Matheis, and Erick Erickson, solr community and support. This is the post on their behalf.

Related

Solr 8.8 - trouble matching partial words with eDisMax and EdgeNGramFilter

I am new to Solr and trying to provide partial word matching with Solr 8.8.1, but partials are giving no results. I have combed the blogs without luck to fix this.
For example, the text of the document contains the word longer. Index analysis gives lon, long, longe, longer. If I query longer using alltext_en:longer, I get a match. However, if I query (for example) longe using alltext_en:longe, I get no match. explainOther returns 0.0 = No matching clauses.
It seems that I am missing something obvious, since this is not a complex phrase query.
Apologies in advance if I have missed any needed details - I will update the question if you tell me what else is needed to know.
Here are the relevant field specs from my managed-schema:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="3"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
<field name="alltext_en" type="text_en" multiValued="true" indexed="true" stored="true"/>
<copyField source="*_txt_en" dest="alltext_en"/>
Here is the relevant part of solrconfig.xml:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="q">*:*</str>
<str name="q.alt">*:*</str>
<str name="rows">50</str>
<str name="fl">*,score,[explain]</str>
<str name="ps">10</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">_text_</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="hl.simple.pre"><span class="artica-snippet"></str>
<str name="hl.simple.post"></span></str>
<!-- Spell checking defaults -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
That stemming filter will modify the tokens in ways you don't predict - and since they only happen on the token you try to match agains the ngrammed tokens when querying, the token might not be what you expect). If you're generating ngrams, stemming filters should usually be removed. I'd also remove the possessive filter (Also, small note - try to avoid using * when formatting text, since it's hard to know if you've used it when querying and the formatting is an error - instead use a backtick to indicate that the text is a code keyword/query.) – MatsLindh
That answered it - I removed the stemmer from the index step and everything was fine. Brilliant, thank you, #MatsLindh!

Spellcheck Solr: solr.DirectSolrSpellChecker config

I am trying to test the spellchecking functionality with Solr 4.7.2 using solr.DirectSolrSpellChecker (where you don't need to build a dedicated index).
I have a field named "title" in my index; I used a copy field definition to create a field named "title_spell" to be queried for the spellcheck (title_spell is correctly filled). However, in the admin solr admin console, I always get empty suggesions.
For example: I have a solr document with the title "A B automobile"; I enter in the admin console (spellcheck crossed and under the input field spellcheck.q) "atuomobile". I expect to get at least something like "A B automobile" or "automobile" but the spellcheck suggestion remains empty...
My configuration:
schema.xml (only relevant part copied):
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="de_DE/synonyms.txt" ignoreCase="true"
expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldType>
...
<field name="title_spell" type="textSpell" indexed="true" stored="true" multiValued="false"/>
solr.xml (only relevant part copied):
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">textSpell</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">title_spell</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="distanceMeasure">internal</str>
<float name="accuracy">0.5</float>
<int name="maxEdits">2</int>
<int name="minPrefix">1</int>
<int name="maxInspections">5</int>
<int name="minQueryLength">4</int>
<float name="maxQueryFrequency">0.01</float>
<float name="thresholdTokenFrequency">.01</float>
</lst>
</searchComponent>
...
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
</lst>
<!--Versuch, das online datum mit in die Gewichtung zu nehmen...-->
<lst name="appends">
<str name="bf">recip(ms(NOW/MONTH,sort_date___d_i_s),3.16e-11,50,1)</str>
<!--<str name="qf">title___td_i_s_gcopy^1e-11</str>-->
<str name="qf">title___td_i_s_gcopy^21</str>
<str name="q.op">AND</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
What did I miss? Thanks for your answers!
How large is your index? For a small index (think less than a few million docs), you're going to have to tune accuracy, maxQueryFrequency, and thresholdTokenFrequency. (Actually, it would probably be worth doing this on larger indices as well.)
For example, my 1.5 million doc index uses the following for these settings:
<float name="maxQueryFrequency">0.01</float>
<float name="thresholdTokenFrequency">.00001</float>
<float name="accuracy">0.5</float>
accuracy tells Solr how accurate a result needs to be before it's considered worth returning as a suggestion.
maxQueryFrequency tells Solr how frequently the term needs to occur in the index before it's can be considered worth returning as a suggestion.
thresholdTokenFrequency tells Solr what percentage of documents the term must be included in before it's considered worth returning as a suggestion.
If you plan to use spellchecking on multiple phrases, you may need to add a ShingleFilter to your title_spell field.
Another thing you might try is setting your queryAnalyzerFieldType to title_spell.
Can you please try editing your requestHandler declaration.
<requestHandler name="/standard" class="solr.SearchHandler" default="true">
and query url as:
http://localhost:8080/solr/service/standard?q=<term>&qf=title_spell
First experiment with small terms and learn how it is behaving. One problem here is it will only return all the terms starting with the same query term. You can use FuzzyLookupFactory which will match and return fuzzy result. For more information check solr suggester wiki.

Solr 4.10 - Suggester is not working with multi-valued field

Hello everyone i am using solr 4.10 and i am not getting the result as per my expectation. i want to get auto complete suggestion using multiple fields that is discountCatName,discountSubName and vendorName. i have a created multi-valued field "suggestions" using copyfield and using that filed for searching in suggester configuration.
Note: discountSubName & discountCatName are again multi-valued field, vendorName is string.
This is a suggestion field data from one of my document:
"suggestions": [
"Budget Car Rental",
"Car Rentals",
"Business Deals",
"Auto",
"Travel",
"Car Rentals" ]
If i type for a "car" i am getting "Budget Car Rental" in my suggestion but not "Car Rentals", below are my configurations. let me know if i need to change the tokenizer and filters.Any help in this would be appreciate.
Below is my code block as per explained the scenario above.
Suggestion field,fieldType,searchComponent and request handler respectively which i am using for auto complete suggestions
<!--suggestion field -->
<field name="suggestions" type="suggestType" indexed="true" stored="true" multiValued="true"/>
<copyField source="discountCatName" dest="suggestions"/>
<copyField source="discountSubName" dest="suggestions"/>
<copyField source="vendorName" dest="suggestions"/>
<!--suggest fieldType -->
<fieldType name="suggestType" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[^a-zA-Z0-9]" replacement=" " />
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
<!--suggest searchComponent configuration -->
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">analyzing</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="suggestAnalyzerFieldType">suggestType</str>
<str name="blenderType">linear</str>
<str name="minPrefixChars">1</str>
<str name="doHighlight">false</str>
<str name="weightField">score</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">suggestions</str>
<str name="buildOnStartup">true</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<!--suggest request handler -->
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">analyzing</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
I just discovered by debugging Solr 4.10 source code there is a bug in DocumentDictionaryFactory lookup, it's always look in the first string incase of multi-valued field and then stop suggestion from that document hence i am not getting expected output from my above configuration.
I have a created a separate index for all the fields i want to apply search like catName0...catName10, subName0...subName10 and then created multiple suggestion dictionaries for each fields and lastly i parsed the response form all the suggestion dictionary merged them and sorted based on weight and highlight position.
Lengthy approach but no other way as this solr 4.10 was required.

solr : highlighting : hl.simple.pre/post doesn't appear sometime

With solr, I try to highlighting some text using hl.formatter option with hl.simple.pre/post.
My problem is that the hl.simple.pre/post code doesn't appear sometime in the highlighting results, I don't understand why.
By example I call this URL :
http://localhost:8080/solr/Employees/select?q=lastName:anthan&fl=lastName&wt=json&indent=true&hl=true&hl.fl=lastName&hl.simple.pre=<em>&hl.simple.post=</em>
I get :
..."highlighting": {
"NB0094418": {
"lastName": [
"Yogan<em>anthan</em>" => OK
]
},
"NB0104046": {
"lastName": [
"Vijayakanthan" => KO, I want Vijayak<em>anthan</em>
]
},
"NB0144981": {
"lastName": [
"Parmananthan" => KO, I want Parman<em>anthan</em>
]
},...
Someone have an idea why I have this behavior ?
My configuration :
schema.xml
<fieldType name="nameType" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="50" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.ASCIIFoldingFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z])" replacement="" replace="all" />
</analyzer>
</fieldType>
...
<fields>
<field name="lastName" type="nameType" indexed="true" stored="true" required="true" />
</fields>
solrconfig.xml
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
</lst>
</requestHandler>
...
<searchComponent class="solr.HighlightComponent" name="highlight">
<highlighting>
<fragmenter name="gap" default="true" class="solr.highlight.GapFragmenter">
<lst name="defaults">
<int name="hl.fragsize">100</int>
</lst>
</fragmenter>
<fragmenter name="regex" class="solr.highlight.RegexFragmenter">
<lst name="defaults">
<int name="hl.fragsize">70</int>
<float name="hl.regex.slop">0.5</float>
<str name="hl.regex.pattern">[-\w ,/\n\"&apos;]{20,200}</str>
</lst>
</fragmenter>
<formatter name="html" default="true" class="solr.highlight.HtmlFormatter">
<lst name="defaults">
<str name="hl.simple.pre"><![CDATA[<em>]]></str>
<str name="hl.simple.post"><![CDATA[</em>]]></str>
</lst>
</formatter>
<encoder name="html" default="true" class="solr.highlight.HtmlEncoder" />
<fragListBuilder name="simple" default="true" class="solr.highlight.SimpleFragListBuilder" />
<fragListBuilder name="single" class="solr.highlight.SingleFragListBuilder" />
<fragmentsBuilder name="default" default="true" class="solr.highlight.ScoreOrderFragmentsBuilder">
</fragmentsBuilder>
<fragmentsBuilder name="colored" class="solr.highlight.ScoreOrderFragmentsBuilder">
<lst name="defaults">
<str name="hl.tag.pre"><![CDATA[
<b style="background:yellow">,<b style="background:lawgreen">,
<b style="background:aquamarine">,<b style="background:magenta">,
<b style="background:palegreen">,<b style="background:coral">,
<b style="background:wheat">,<b style="background:khaki">,
<b style="background:lime">,<b style="background:deepskyblue">]]></str>
<str name="hl.tag.post"><![CDATA[</b>]]></str>
</lst>
</fragmentsBuilder>
</highlighting>
</searchComponent>
I was dealing with a very similar problem until yesterday. I tried many different solutions, iteratively, so some details I ended up with of this may not be necessary. But I'll describe what I got working eventually. Short answer, I think the highlighter is failing to find the term position information it needs on longer fields.
Firstly, the symptoms I was seeing: sometimes the search term highlight would show up, and sometimes the entire field would show up in the highlighting section, but without the highlight information. The pattern ended up being based on both the length of the field, and the length of the search term. I found that the longer the field (actually, the token that was ngrammed), the shorter the search term that could be highlighted successfully. It wasn't 1-to-1, though. I found that for a field with 11 or fewer characters, highlighting worked fine in all cases. If the field had 12 characters, no ngram longer than 9 characters would be highlighted. For a field with 15 characters, ngrams longer than 7 characters would not be highlighted. For fields longer than 18 characters, ngrams longer than 6 characters would not be highlighted. And for fields longer than 21 characters, ngrams longer than 5 aren't highlighted, and fields longer than 24 characters wouldn't highlight more than 4 characters. (It looks like, from the examples you have above, that the specific sizes you are seeing are not exactly the same, but I do notice that the names in the documents where the highlighting did not work were longer than the one where it did.)
So, here's what ended up working:
I switched from using WhitespaceTokenizer and NGramFilterFactory to using NGramTokenizerFactory instead. (You are already using this, and I'll have more later on a difficulty this raised for me.) This wasn't sufficient to solve the problem, though, because the term positions still weren't being stored.
I started using the FastVectorHighlighter. This forced some changes in how my schema fields were indexed (including storing storing the term vectors, positions and offsets), and I also had to change my pre- and post- indicator tag configuration from hl.simple.pre to hl.tag.pre (and similarly for *post).
Once I had made these changes, the highlighting started working consistently. This had the side-effect, though, of removing the behavior I had been getting from the WhitespaceTokenizer. If I had a field that contained the phrase "this is a test" I was ending up with ngrams that included "s is a", "a tes", etc., and I really just wanted the ngrams of the individual words, not of the whole phrase. There is a note in the NGramTokenizer JavaDocs that you can override NGramTokenizer.isTokenChar() to provide pre-tokenizing, but I couldn't find an example of this on the web. I'll include one below.
End result:
WhitespaceSplittingNGramTokenizer.java:
package info.jwismar.solr.plugin;
import java.io.Reader;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.util.Version;
public class WhitespaceSplittingNGramTokenizer extends NGramTokenizer {
public WhitespaceSplittingNGramTokenizer(Version version, Reader input, int minGram, int maxGram) {
super(version, input, minGram, maxGram);
}
public WhitespaceSplittingNGramTokenizer(Version version, AttributeFactory factory, Reader input, int minGram,
int maxGram) {
super(version, factory, input, minGram, maxGram);
}
public WhitespaceSplittingNGramTokenizer(Version version, Reader input) {
super(version, input);
}
#Override
protected boolean isTokenChar(int chr) {
return !Character.isWhitespace(chr);
}
}
WhitespaceSplittingNGramTokenizerFactory.java:
package info.jwismar.solr.plugin;
import java.io.Reader;
import java.util.Map;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.ngram.NGramTokenizer;
import org.apache.lucene.analysis.util.TokenizerFactory;
import org.apache.lucene.util.AttributeSource.AttributeFactory;
public class WhitespaceSplittingNGramTokenizerFactory extends TokenizerFactory {
private final int maxGramSize;
private final int minGramSize;
/** Creates a new WhitespaceSplittingNGramTokenizer */
public WhitespaceSplittingNGramTokenizerFactory(Map<String, String> args) {
super(args);
minGramSize = getInt(args, "minGramSize", NGramTokenizer.DEFAULT_MIN_NGRAM_SIZE);
maxGramSize = getInt(args, "maxGramSize", NGramTokenizer.DEFAULT_MAX_NGRAM_SIZE);
if (!args.isEmpty()) {
throw new IllegalArgumentException("Unknown parameters: " + args);
}
}
#Override
public Tokenizer create(AttributeFactory factory, Reader reader) {
return new WhitespaceSplittingNGramTokenizer(luceneMatchVersion, factory, reader, minGramSize, maxGramSize);
}
}
These need to be packaged up into a .jar and installed someplace where SOLR can find it. One option is to add a lib directive in solrconfig.xml to tell SOLR where to look. (I called mine solr-ngram-plugin.jar and installed it in /opt/solr-ngram-plugin/.)
Inside solrconfig.xml:
<lib path="/opt/solr-ngram-plugin/solr-ngram-plugin.jar" />
schema.xml (field type definition):
<fieldType name="any_token_ngram" class="solr.TextField">
<analyzer type="index">
<tokenizer class="info.jwismar.solr.plugin.WhitespaceSplittingNGramTokenizerFactory" maxGramSize="30" minGramSize="2"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="^(.{30})(.*)?" replacement="$1" replace="all" />
</analyzer>
</fieldType>
schema.xml (field definitions):
<fields>
<field name="property_address_full" type="string" indexed="false" stored="true" />
<field name="property_address_full_any_ngram" type="any_token_ngram" indexed="true"
stored="true" omitNorms="true" termVectors="true" termPositions="true"
termOffsets="true"/>
</fields>
<copyField source="property_address_full" dest="property_address_full_any_ngram" />
solrconfig.xml (request handler (you can pass these parameters in the normal select URL, instead, if you prefer)):
<!-- request handler to return typeahead suggestions -->
<requestHandler name="/suggest" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="defType">edismax</str>
<str name="rows">10</str>
<str name="mm">2</str>
<str name="fl">*,score</str>
<str name="qf">
property_address_full^100.0
property_address_full_any_ngram^10.0
</str>
<str name="sort">score desc</str>
<str name="hl">true</str>
<str name="hl.fl">property_address_full_any_ngram</str>
<str name="hl.tag.pre">|-></str>
<str name="hl.tag.post"><-|</str>
<str name="hl.fragsize">1000</str>
<str name="hl.mergeContinuous">true</str>
<str name="hl.useFastVectorHighlighter">true</str>
</lst>
</requestHandler>
If you are asking why your configuration defined hl.tag.pre and hl.tag.post are not appearing in the sample query you gave and is instead showing the <em> and </em> pre/post tags...
This is because you are specifying the hl.tag.pre and hl.tag.post parameters in the query string (at request time). Therefore, they are overriding the defaults settings you have defined for the highlight searchComponent in your solrconfig.xml file.
Either remove those query string parameters or set the searchComponent configuration file to set hl.tag.pre and hl.tag.post in a <lst name="invariant"> to force these to override any request time parameters.
Here is an overview of the various Configuration Settings

Solr Spell Check

I am working with Solr Spell Check . Got it up and running . However for certain misspells it is not giving the expected result :
Correct Word : Cancer
Incorrect Spelling : Cacner ,cacnar , cancar ,cancre,cancere .
I am not getting "cancer" as the suggestion for "cacnar" instead it shows "inner" which although sounds more like cacner is not the correct suggestion . And for cacnar again I am getting a suggestion as 'pulmonary'.
Any way of configuring it to display cancer instead of the other results ?
Alternatively is there any score for the suggestions that can be referred to before showing it to the user ?
As per request here is the configuration :
The field used for dictionary (in schema.xml):
<copyField source="procname" dest="dtextspell" />
<field name = "dtextspell" stored="false" type="text_small" multiValued="true" indexed="true"/>
Definition of "text_small" (again in schema.xml) :
<fieldType name="text_small" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
<analyzer type ="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StandardFilterFactory"/>
</analyzer>
</fieldType>
In solrconfig.xml :
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_small</str>
<lst name="spellchecker">
<str name="name">default</str>
<str name="classname">solr.IndexBasedSpellChecker</str>
<str name="field">dtextspell</str>
<float name="thresholdTokenFrequency">.0001</float>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="field">name</str>
<str name="buildOnCommit">true</str>
</lst></searchComponent>
Attached it to the select request handler like this :
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="spellcheck.count">10</str>
<str name="df">text</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr> </requestHandler>
To build the spell check :
http://localhost:8080/solr/select?q=*:*&spellcheck=true&spellcheck.build=true
To search for term :
http://localhost:8080/solr/select?q=procname:%22cacner%22&spellcheck=true&defType=edismax
The response XML :
<lst name="spellcheck"><lst name="suggestions">
<lst name="cacner">
<int name="numFound">1</int>
<int name="startOffset">10</int>
<int name="endOffset">16</int>
<arr name="suggestion">
<str>inner</str> <end tags start from here>
Hope it helps !!
Sounds like you've not rebuilt the spellchecker's index recently. Request a manual update by make a query with spellcheck=true&spellcheck.build=true appended to the query string (do NOT do this on every request, as the build process can take some time). You should also make sure that you're using the correct field to build your spellchecker's index.
You can also configure the spellchecker component to rebuild the index on every commit or on every optimize, by adding:
<str name="buildOnCommit">true</str>
or
<str name="buildOnOptimize">true</str>
to your spellchecker configuration.

Resources