solr / lucene query highlighting in arabic - solr

I'm working with Solr 4.1 and I want to highlight an arabic query. but it doesn't work correctly. It finds the word to be highlighted correctly but when it want to add the highlight tag (for example ) it cant find the write index to add these tag to. for example it create something like this for query pizza.
<str>i eat<em> pizz</em>a every weekend</str>
it works for English correctly but i just want to explain what i mean.
or here is an arabic example for query علی:
<str>أَخْبَرَنِي الرَّئِیسُ الْعَفِیفُ أَبُو الْبَقَاءِ هِبَةُ اللَّه‌ِ بْنُ نَمَا بْن<em>ِ عَلِي</em>ِّ بْ</str>
which i expect to be:
<str>أَخْبَرَنِي الرَّئِیسُ الْعَفِیفُ أَبُو الْبَقَاءِ هِبَةُ اللَّه‌ِ بْنُ نَمَا بْنِ <em>عَلِيِّ</em> بْ</str>
note that I use the following field description:
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="searchEng.solr.ar.CharFilter" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- for any non-arabic -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" enablePositionIncrements="true"/>
<!-- normalizes ﻯ to ﻱ, etc -->
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
the first charFilter just normalize some Arabic characters.

Related

Solr not returning the exact element

Using Solr 7.7.3
I have an element with the label:"alpha-ravi"
and when I search in solr label:"alpha" its returning the element with the label "alpha-ravi"
when looking at the solr doc, it should not return this element.
can anyone explain why this behavior ?
If you want to retrieve the exact results (i.e return docs with "alpha-ravi" only if the user types the exact "alpha-ravi" in the search), then I would suggest you could go with the Keyword tokenizer (solr.KeywordTokenizerFactory). This tokenizer would treat the entire "alpha-ravi" as a single token and thus, will not return partial results if there's a match for "alpha" or "ravi".
For example: in your schema.xml file you should add something like (configure the various filter chains as per your need)
<fieldType name="single_token_string" class="solr.TextField" sortMissingLast="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
And then you can use this fieldType in the same schema.xml (referencing the KeywordTokenizer we just defined)
<field name="myField" type="single_token_string" indexed="true" stored="true" />
By default, Solr uses the StandardTokenizer and thus, splits "alpha-ravi" on that hyphen into multiple tokens (thus, matching "alpha" and "ravi").
Also, as an alternative you could run a query with a phrase as well (which will not be tokenized on spaces/delimiters). Possibly something likehttp:localhost:8983/solr/...fq=label:"alpha-ravi"
Hope that helps. All the best!

How to get an "ends with" search in Solr 4.8.1?

I have a document, indexed on Solr, which contains this field:
{
"manufacturerSkuEndsWith": [
"DU351118DR0"
]
}
My goal is to get an "ends with" search on the manufacturerSkuEndsWith field. For example, the following queries should match the value above: DR0, 8DR0, 18DR0, 118DR0... but these queries should NOT match: DU35, 118DR, 118...
My problem is that the query 118 matches that document, even though DU351118DR0 does not end with 118.
My Solr & Lucene version is 4.8.1. I've found out that in this version the side="back" for the EdgeNGramTokenizer is not supported anymore: LUCENE-3907. In this thread, they are suggesting to use a ReverseStringFilter to get a behaviour similar to an EdgeNGramTokenizer with side="back", so this is how I configured the manufacturerSkuEndsWith field in my schema.xml:
<field indexed="true" multiValued="true" name="manufacturerSkuEndsWith" stored="true" type="smccTextReversedNGram"/>
<copyField dest="manufacturerSkuEndsWith" source="ManufacturerSku"/>
<fieldType class="solr.TextField" name="smccTextReversedNGram" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" maxGramSize="10" minGramSize="3"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
but this configuration does not perform an "ends with" search:
How can I get that type of search, instead?
You're using the NGramTokenizer and not the EdgeNGramFilter as shown in the examples. The NgramTokenizer will generate tokens from inside the string as well, and not just from the edge.
To get the behavior you're looking for you have to have a KeywordTokenizer (which will keep the input as a single token), and then use the ReverseStringFilter to reverse it - before using the EdgeNGramFilter to generate strings from the start of the now reversed string:
foo -> oof -> o, oo, oof
You can then either run these through the reversed string filter again to get the "correct" versions indexed:
-> o, oo, foo
.. or you can do as you've done in your field, and reverse the input string instead:
foo -> oof -> matches the oof token

Autocomplete in Solr with Case-insensitive feature

I have been trying out this autocomplete feature in Solr4.7.1 using Suggester.I have configured it to display phrase suggestions also.Problem is If I type "game" I get suggestions as "game" or phrases containing "game".
But If I type "Game" no suggestion is displayed at all.How can I get suggestions case-insensitive?
I have configured in schema.xml fields like this:
<fieldType name="text_auto" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
what worked for me was tweaking a code from velocity file, head.vm, I changed 'terms.prefix': function() { return $("#q").val().toLowerCase();},
which solved my issue as I am using terms component for suggestions.
I tried the same schema in the Solr Admin Analysis view. You can provide the index and query value here to see how the tokens are matched.
For your schema, I tried it in my local solr instance, it seems to work fine. ie., the Game and game are considered equal and matched.
I would urge you to post the request query, and/or provide the Suggester configurations (if you are using the same).

Solr doesn't find all chinese sign

I want to use Solr for a page in chinese. It works fine, but i can't find some of the chars.
I use the SmartChineseSentenceTokenizerFactory in my schema.xml like this:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
<filter class="solr.SmartChineseWordTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
<filter class="solr.SmartChineseWordTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>
I've tried the CJKTokenizerFactory also, the result was even worse.
On an example page i've got the following text (a copy from wikipedia-china)
就必須參加 國中教育會考
It's indexed in Solr and i can search for all sign except 教
This char means something like: teach, instruct, teaching, religion - so it's a normal word.
That's just one example in which single chars can not be found.
I'm having a similar issue, but I believe it's because smart Chinese uses a dictionary that looks for cognates instead of single characters. I can also search for 教育 or 教授 without a problem but then 教 produces nothing. So I have two searches on our site, one uses solr and the other is a simple search against the text, and then I just give users directions on the site how each search works.
What was your ultimate solution?

Solr preserve whitspace search

Below is my fieldtype and I want to preserve the white space during search
<fieldType name="searchterm" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="250" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
So example: input = "alpha beta" and I search for either "alpha" ,"beta" will match, but how do I enforce the non match for a search term like "alpha eta" (which should not match). I should also match for "eta","pha" but not "alpha eta"
Would be nice to know what kind of application needs such a search :-).
You can do the following:
if your search term has no spaces, use your existing field searchterm.
to help with search queries that have space(s) in them, create a new copyField (say called newsearchterm) which uses EdgeNGramFilterFactory instead of NGramFilterFactory.
For newsearchterm the analysis will happen this way:
alpha beta ==> alp, alph, alpha, bet, beta
so a search newsearchterm:(alpha AND eta) won't match alpha beta.

Resources