Solr doesn't find all chinese sign - solr

I want to use Solr for a page in chinese. It works fine, but i can't find some of the chars.
I use the SmartChineseSentenceTokenizerFactory in my schema.xml like this:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
<filter class="solr.SmartChineseWordTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.SmartChineseSentenceTokenizerFactory"/>
<filter class="solr.SmartChineseWordTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldType>
I've tried the CJKTokenizerFactory also, the result was even worse.
On an example page i've got the following text (a copy from wikipedia-china)
就必須參加 國中教育會考
It's indexed in Solr and i can search for all sign except 教
This char means something like: teach, instruct, teaching, religion - so it's a normal word.
That's just one example in which single chars can not be found.

I'm having a similar issue, but I believe it's because smart Chinese uses a dictionary that looks for cognates instead of single characters. I can also search for 教育 or 教授 without a problem but then 教 produces nothing. So I have two searches on our site, one uses solr and the other is a simple search against the text, and then I just give users directions on the site how each search works.
What was your ultimate solution?

Related

No response with query string containing whitespace in SOLR autocomplete

I am trying to use SOLR autocomplete feature, Basically once a user types 3 characters, I want to show response with every character typed. SOLR version is 6.5.1. Below is the configuration I am using.
<fieldType name="searchFieldType" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I have a sample index which is having field values as below.
"ta", "taj", "tajbacd", "tajabcd", "taj cbad","taj abcd", "taj bcad","taj abcd cbad", "taj abcd abcd","taj abcd bacd", "abcd taj","abcd ta", "random string"
When I am seraching for "taj", I am getting expected results But if I search for "taj ", or "taj ab", Solr is not returning any results. Can you guys help me here. I tried to use Analysis, which is showing ngram is found, below is the screenshot of the same.
So, I read your question too fast...my bad.
Can you show us the requests you are using to veirfy this? Both the one working and the one not working.
By the way, one thing you can already fix, if you send only 3 chars or more, you can change your minGramSize="1" to minGramSize="3."
Well you can just easily use wildcard/partial match in this case
q={!complexphrase inOrder=true}YourField:"taj ab*"

Solr suggest exact match

I am trying to make solr return exact match on suggestion, ex:
spellcheck.q=tota does return total in results but
spellcheck.q=total does not return total in results.
I am using this field for suggestions:
<fieldType name="textSpellShingle" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Any idea how to make Solr returns exact matches on suggest??
You are using the SpellChecker component, which, as the name indicate, is meant for spellchecking. It returns suggestions for how entry the should be spelled. When the word is spelled correct (which equals a exact match) it returns nothing, which is the reason you dont see the word in the list.
Since Solr 4.7 a new Suggestion component has been added, which is actually implemented for autosuggestion and yields the results you expect.
can you try with this
<fieldType name="textSpellShingle" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="50" side="back"/>
</analyzer>
</fieldType>
As mentionned in this wiki page: https://cwiki.apache.org/confluence/display/solr/Suggester
To be used as the basis for a suggestion, the field must be stored.
Make sure your field is stored.
You field isn't stored so it is returning the data crunched by your indexer.
Your problem came because you used the old suggest component based on the spellcheck component (I suppose you used a version of solr before 5).
With the old spellcheck/suggest, if the word match it is not return in the response!
Test with the solr.suggestComponent (if present in your version).
see: https://cwiki.apache.org/confluence/display/solr/Suggester

Solr substring search yields all indexed results

To do a substring search, I have added a new fieldType - "Text" with NgramFilter.
It works fine perfectly but downside is this problem
Example
name = ['Apple','Samy','And','a']
When I do a search name:a, then all the above items gets pulled up. Even when search changes to "App". All the above items are pulled. How can I fix this issue?
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="100" />
</analyzer>
</fieldType>
As you can see in the analysis, both the indexed value and the query value gets parsed through the EdgeNGramFilter - meaning that it will match anything that is a substring of anything else. Add a simpler filter for querying the field, and you should be good to go.
The example from the Wiki should be usable by just copying and pasting it:
<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
My initial guess was that since you don't provide two alternative definitions, Solr will use the same chain for both. Your analysis output confirms that suspicion. Try adding a analyser with type="query" to have a specific chain for querying the field (you do not want EdgeNGram both places).

Autocomplete in Solr with Case-insensitive feature

I have been trying out this autocomplete feature in Solr4.7.1 using Suggester.I have configured it to display phrase suggestions also.Problem is If I type "game" I get suggestions as "game" or phrases containing "game".
But If I type "Game" no suggestion is displayed at all.How can I get suggestions case-insensitive?
I have configured in schema.xml fields like this:
<fieldType name="text_auto" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory"
minShingleSize="2"
maxShingleSize="4"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
what worked for me was tweaking a code from velocity file, head.vm, I changed 'terms.prefix': function() { return $("#q").val().toLowerCase();},
which solved my issue as I am using terms component for suggestions.
I tried the same schema in the Solr Admin Analysis view. You can provide the index and query value here to see how the tokens are matched.
For your schema, I tried it in my local solr instance, it seems to work fine. ie., the Game and game are considered equal and matched.
I would urge you to post the request query, and/or provide the Suggester configurations (if you are using the same).

solr / lucene query highlighting in arabic

I'm working with Solr 4.1 and I want to highlight an arabic query. but it doesn't work correctly. It finds the word to be highlighted correctly but when it want to add the highlight tag (for example ) it cant find the write index to add these tag to. for example it create something like this for query pizza.
<str>i eat<em> pizz</em>a every weekend</str>
it works for English correctly but i just want to explain what i mean.
or here is an arabic example for query علی:
<str>أَخْبَرَنِي الرَّئِیسُ الْعَفِیفُ أَبُو الْبَقَاءِ هِبَةُ اللَّه‌ِ بْنُ نَمَا بْن<em>ِ عَلِي</em>ِّ بْ</str>
which i expect to be:
<str>أَخْبَرَنِي الرَّئِیسُ الْعَفِیفُ أَبُو الْبَقَاءِ هِبَةُ اللَّه‌ِ بْنُ نَمَا بْنِ <em>عَلِيِّ</em> بْ</str>
note that I use the following field description:
<fieldType name="text_ar" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="searchEng.solr.ar.CharFilter" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<!-- for any non-arabic -->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_ar.txt" enablePositionIncrements="true"/>
<!-- normalizes ﻯ to ﻱ, etc -->
<filter class="solr.ArabicNormalizationFilterFactory"/>
<filter class="solr.ArabicStemFilterFactory"/>
</analyzer>
</fieldType>
the first charFilter just normalize some Arabic characters.

Resources