Is Solr SuggestComponent able to return shingles instead of whole field values? - solr

I use solr 5.0.0 and want to create an autocomplete functionality generating suggestions from the word-grams (or shingles) of my documents.
The problem is that in return of a suggest-query I only get complete "terms" of the search field which can be extremly long.
CURRENT PROBLEM:
Input:"so"
Suggestions:
"......extremly long text son long text continuing......"
"......next long text solar next text continuing......"
GOAL:
Input: "so"
Suggestions with shingles:
"son"
"solar"
"solar test"
etc
<searchComponent name="suggest" class="solr.SuggestComponent"
enable="${solr.suggester.enabled:true}" >
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title_and_description_suggest</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">autocomplete</str>
<str name="queryAnalyzerFieldType">autocomplete</str>
<str name="buildOnCommit">true</str>
</lst>
schema.xml:
<fieldType name="autocomplete" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" outputUnigramsIfNoShingles="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I want to return max 3 words as autocomplete term. Is this possible with the SuggestComponent or how would you do it? No matter what I try I always receive the complete field value of matching documents.
Is that expected behaviour or what did I do wrong?
Many thanks in advance

In schema.xml define fieldType as follows:
<fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
In schema.xml define your field as follows:
<field name="example_field" type="text_autocomplete" indexed="true" stored="true"/>
Write your query as follows:
query?q=*&
rows=0&
facet=true&
facet.field=example_field&
facet.limit=-1&
wt=json&
indent=true&
facet.prefix=so
In the facet.prefix field, specify the term being searched for which you want suggestions ('so', in this example). If you need less than 5 words in the suggestion, reduce maxShingleSize in the fieldType definition accordingly. By default, you will get the results in decreasing order of their frequency of occurrence.

Related

Why solr keywordtokenizerfactory field query response is taking so long

I have made the type definition in Solr \conf\managed-schema like below. My core is very huge i.e. ~30 million documents, ~20 GB of data/indexed size, ~12 fields, and JVM memory allocation is -Xms10g, -Xmx10g
<field name="Address" type="text_keyword" default="" multiValued="false" indexed="true" stored="true"/>
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And a /update request handler at conf\solrconfig.xml,
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
in conf\solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<str name="fields">Address,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field10</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Here is my search field query example:
http://localhost:8983/solr/mycore/select?indent=true&q.op=OR&q=*:*&fq=((Address:(*premier*)) OR (Field1:(*premier*)) OR (Field2:(*premier*)) OR (Field3:(*premier*)) OR (Field4:(*premier*)) OR (Field5:(*premier*)) OR (Field6:(*premier*)) OR (Field7:(*premier*)) OR (Field8:(*premier*)) OR (Field9:(*premier*)) OR (Field10:(*premier*)) OR (Field11:(*premier*)))
AND ((Address:(*1\:34*)) OR (Field1:(*1\:34*)) OR (Field2:(*1\:34*)) OR (Field3:(*1\:34*)) OR (Field4:(*1\:34*)) OR (Field5:(*1\:34*)) OR (Field6:(*1\:34*)) OR (Field7:(*1\:34*)) OR (Field8:(*1\:34*)) OR (Field9:(*1\:34*)) OR (Field10:(*1\:34*)) OR (Field11:(*1\:34*)))
&rows=10&start=0&wt=json
The special characters are escaped with \ and I am trying filter the records like LIKE clause in RDBMS. The above query is working fine but the response is taking so long. What can I do to speed it up?

Solr term search not searching all values from multifields value

I have solr field
<field name="AllTitles" type="text_general" indexed="true" stored="false" multiValued="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Example of Value for AllTitles entered is
AllTitles: [ "Anything", "wuhan coronavirus", "anything" ]
AllTitles: [ "wuhan coronavirus", "anything", "anything" ]
It searches from first index but if any matching term on index other than 1st then it's not searching
For example when I search
q="wuhan coronavirus"
I get 2 results. When I search using field name "AllTitles"
q=AllTitles:"wuhan coronavirus"
I get 7 results correctly.
Can anybody help me identifying the issue?
First, in your SolrConfig.xml check what field has been defined in the "df". In the below example it is "text".
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</requestHandler>
Second, in the schema.xml or managed-schema, whichever you are using, make sure you have copied "AllTitles" to "text". Like this,
<copyField source="AllTitles" dest="text"/>
You might as well test it by adding "AllTitles" to your "df" parameter when you query, before doing all these, like raghu777 has mentioned.

Solr Spellcheck for Multi Word Phrases

I have a problem with solr spellcheck suggestions for multi word phrases. With the query for 'red chillies'
q=red+chillies&wt=xml&indent=true&spellcheck=true&spellcheck.extendedResults=true&spellcheck.collate=true
I get
<lst name="suggestions">
<lst name="chillies">
<int name="numFound">2</int>
<int name="startOffset">4</int>
<int name="endOffset">12</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst><str name="word">chiller</str><int name="freq">4</int></lst>
<lst><str name="word">challis</str><int name="freq">2</int></lst>
</arr>
</lst>
<bool name="correctlySpelled">false</bool>
<str name="collation">red chiller</str>
</lst>
The problem is, even though 'chiller' has 4 results in index, 'red chiller' has none. So we end up suggesting a phrase with 0 result.
What can I do to make spellcheck work on the whole phrase only? I tried using KeywordTokenizerFactory in query:
<fieldType name="text_spell" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
And I also tried adding
<str name="sp.query.extendedResults">false</str>
within
<lst name="spellchecker">
in solrconfig.xml.
But neither seems to make a difference.
What would be the best way to make spellcheck only give collation that have results for the whole phrase? Thanks!
The real issue here is that you need to specify the spellcheck.collateParam.q.op=AND and also (optionally) spellcheck.collateParam.mm=100%
These params enforce the collate queries executed correctly.
You can read more about this on the solr docs

Solr russian spellcheck

I am using solr spellcheck for russian language. When you are typing with Cyrillic chars, everything it's ok, but it doesn't work when you are typing with Latin chars.
I want that spellcheck correct and when you are typing with Cyrillic chars and when are you typing with Latin chars. And corret to text with Cyrillic chars.
For example, when you type:
телевидениеее or televidenieee
It should correct to:
телевидение
schema.xml:
<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
</analyzer>
</fieldType>
solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spellcheck</str>
<str name="classname">solr.IndexBasedSpellChecker</str>
<str name="buildOnCommit">true</str>
<str name="buildOnOptimize">true</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="accuracy">0.75</str>
</lst>
<lst name="spellchecker">
<str name="name">wordbreak</str>
<str name="field">spellcheck</str>
<str name="classname">solr.WordBreakSolrSpellChecker</str>
<str name="combineWords">false</str>
<str name="breakWords">true</str>
<int name="maxChanges">1</int>
</lst>
</searchComponent>
Thanks for help
It can be achived with ICUTransformFilterFactory, which will (un)transliterate the input query each time.
Here is an example, of how one can enable this functionality:
Enable icu4j amalyzers (lucene-analyzers-icu-*.jar, icu4j-*.jar):
Those libraries can be found in contrib/analysis-extras folder of solr distribution from official site (they also available via maven).
In solrconfig.xml add something like these to enable them (there can be a single lib dir with all the jars that you need, in this example it just uses default location relative to example/solr/collection1/conf folder from official distribution):
<lib dir="../../../contrib/analysis-extras/lib" regex=".*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
Split spell_text field analyzers into two separate list for index and query.
Add solr.ICUTransformFilterFactory as query analyzer with the following id Any-Cyrillic; NFD; [^\p{Alnum}] Remove:
<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
<filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
</analyzer>
</fieldType>
Regarding the ICUTransformFilterFactory id - Any-Cyrillic; NFD; [^\p{Alnum}] Remove:
Related stackoverflow question
Official guide
The configuration described above is working on my local machine the same way for russian transliterations and russian words

Boost result by specified search term on top

I'm using apache solr 3.1 with drupal
How can boost result on top which is specified in search field?
Example in search field, if user enters continuing, solr shows the document which have Continuity on top and the one with continuing below, i want to show the one with continuing above than the Continuity
http://localhost:8983/solr/select/?q=continuing&qf=title&fl=title%20score&bq=title:continuing^10.0
It seems you have stemmer in the filter chains, due to which continuing and Continuity and mapped to the same root and would be treated equal.
you want want to check for the stemmer you are using and want to get one depending upon your needs. The default porter stemmer is very agressive, and you may want an less agressive options.
Solr does not currently boost the exact match higher than the other terms which generated the same root.
One option would be to have two fields in your schema.
Stemmed (title_stemmed) and a Non Stemmed version (title - without the stemming filter)
example -
schema.xml -
<!-- Without Porter Stemmer -->
<fieldType name="text_non_stemmed" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<!-- With Porter Stemmer -->
<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="title" type="text" indexed="true" stored="true" termVectors="false" omitNorms="false"/>
<field name="title_non_stemmed" type="text_non_stemmed" indexed="true" stored="true" termVectors="false" omitNorms="false"/>
<copyField source="title" dest="title_non_stemmed"/>
you can weight the fields -
solrconfig.xml - modify the default request handler
<requestHandler name="search" class="solr.SearchHandler" default="true">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="defType">dismax</str>
<str name="qf">
title_non_stemmed^1 title^0.8
</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
<str name="fl">*,score</str>
</lst>
</requestHandler>
so that the exact match produces more score than the non exact matches and appears higher.
URL -
http://localhost:8983/solr/select/?q=continuing

Resources