Need exact matches in solr search results - solr

Currently, I am facing the following small problem while doing exact search (query enclosed within double quotes).
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "\"sale\"",
"indent": "true",
"fl": "displayValue, categoryName, approved, averageRating, lastOneWeekCount, connectorName, score",
"wt": "json",
"_": "1579279511471"
}
},
"response": {
"numFound": 918,
"start": 0,
"maxScore": 11.044312,
"docs": [
{
"displayValue": "Net Sales Vs Contribution Margin",
"categoryName": "Sales Analytics (B07)",
"connectorName": "New BOBJ",
"lastOneWeekCount": 3,
"approved": "yes",
"averageRating": 4,
"score": 11.044312
},
The above "sale" query is matching against "Sales" term in the indexed data, which is not exact. Also this is happening because of the EdgeNgramFilterFactory that is in the defined text field (which uses whitespace tokenizer).
I have managed to incrementally resolve different search issues with the current implementation of select request handler and now I need to solve the above problem of exact match. Following is my solrconfig details.
<lst name="defaults">
<str name="exact">false</str>
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="defType">edismax</str>
<str name="qf">
displayValue^20 description^5 connectorName_txt zenDescription_txt^5 zenBusinessOwner_txt^2
categoryName^8 reportOwner^2 reportDetailsNameColumn^5
</str>
<str name="pf2">
displayValue^20 description^5 connectorName_txt zenDescription_txt^5 zenBusinessOwner_txt^2
categoryName^8 reportOwner^2 reportDetailsNameColumn^5
</str>
<str name="pf3">
displayValue^20 description^5 connectorName_txt zenDescription_txt^5 zenBusinessOwner_txt^2
categoryName^8 reportOwner^2 reportDetailsNameColumn^5
</str>
<str name="tie">1</str>
<str name="mm">100%</str>
<int name="ps2">3</int>
<int name="ps3">9</int>
<int name="qs">0</int>
<str name="df">text</str>
<str name="q.alt">*:*</str>
<str name="sort">score desc, averageRating desc, lastOneWeekCount desc</str>
<str name="bq">
query({!boost b=20}approved:"yes")
</str>
</lst>
<lst name="appends">
<str name="fq">{!switch case.false='*:*' case.true='text_ex:$q' v=$exact}</str>
</lst>
</requestHandler>
In the above config details, I have attempted to solve the exact search problem by adding an extra switch case query parser in the config (after searching the net). Basically, I want to implement exact search if user input query has double quotes. I wanted to implement exact search when user specifies exact=true using the switch query parser. But I am kind of stuck as I am not getting any results.
Can someone please help?
P.S Attaching the schema definition as well. Please check.
<fieldType name="text_ws" class="solr.TextField" omitNorms="false">
<analyzer type="index" omitTermFreqAndPositions="false">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
<fieldType name="text_exact" class="solr.TextField" omitNorms="false">
<analyzer type="index" omitTermFreqAndPositions="false">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" preserveOriginal="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-FoldToASCII.txt"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0"
catenateWords="0" catenateNumbers="0" preserveOriginal="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>

Using double quotes does not mean exact. It only allows you to make phrase queries where the terms have to appear after each other. Solr (Lucene) searches against the tokens you've generated.
Use a field with a specific definition that does not change the tokens (i.e. no ngrams, no stemming, etc). If you only want to match the whole field exactly (but case insensitive), use a KeywordTokenizer with a LowercaseFilter. If you only want case sensitive, exact hits for the whole field, use a string field.
If you want exact matches against each term, use a tokenizer with the behavior you're after, and pick filters to normalize case (i.e. to make it case insensitive) or not. You then decide which field to query based on whether the user is asking for an exact search or not.
You're going to have to determine how "foo" bar should behave and how "foo bar" baz should behave as well.

Related

Why solr keywordtokenizerfactory field query response is taking so long

I have made the type definition in Solr \conf\managed-schema like below. My core is very huge i.e. ~30 million documents, ~20 GB of data/indexed size, ~12 fields, and JVM memory allocation is -Xms10g, -Xmx10g
<field name="Address" type="text_keyword" default="" multiValued="false" indexed="true" stored="true"/>
<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And a /update request handler at conf\solrconfig.xml,
<requestHandler name="/update" class="solr.UpdateRequestHandler" >
<lst name="defaults">
<str name="update.chain">dedupe</str>
</lst>
in conf\solrconfig.xml
<updateRequestProcessorChain name="dedupe">
<processor class="solr.processor.SignatureUpdateProcessorFactory">
<bool name="enabled">true</bool>
<str name="signatureField">id</str>
<str name="fields">Address,Field1,Field2,Field3,Field4,Field5,Field6,Field7,Field8,Field9,Field10,Field10</str>
<str name="signatureClass">solr.processor.Lookup3Signature</str>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Here is my search field query example:
http://localhost:8983/solr/mycore/select?indent=true&q.op=OR&q=*:*&fq=((Address:(*premier*)) OR (Field1:(*premier*)) OR (Field2:(*premier*)) OR (Field3:(*premier*)) OR (Field4:(*premier*)) OR (Field5:(*premier*)) OR (Field6:(*premier*)) OR (Field7:(*premier*)) OR (Field8:(*premier*)) OR (Field9:(*premier*)) OR (Field10:(*premier*)) OR (Field11:(*premier*)))
AND ((Address:(*1\:34*)) OR (Field1:(*1\:34*)) OR (Field2:(*1\:34*)) OR (Field3:(*1\:34*)) OR (Field4:(*1\:34*)) OR (Field5:(*1\:34*)) OR (Field6:(*1\:34*)) OR (Field7:(*1\:34*)) OR (Field8:(*1\:34*)) OR (Field9:(*1\:34*)) OR (Field10:(*1\:34*)) OR (Field11:(*1\:34*)))
&rows=10&start=0&wt=json
The special characters are escaped with \ and I am trying filter the records like LIKE clause in RDBMS. The above query is working fine but the response is taking so long. What can I do to speed it up?

Is Solr SuggestComponent able to return shingles instead of whole field values?

I use solr 5.0.0 and want to create an autocomplete functionality generating suggestions from the word-grams (or shingles) of my documents.
The problem is that in return of a suggest-query I only get complete "terms" of the search field which can be extremly long.
CURRENT PROBLEM:
Input:"so"
Suggestions:
"......extremly long text son long text continuing......"
"......next long text solar next text continuing......"
GOAL:
Input: "so"
Suggestions with shingles:
"son"
"solar"
"solar test"
etc
<searchComponent name="suggest" class="solr.SuggestComponent"
enable="${solr.suggester.enabled:true}" >
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">title_and_description_suggest</str>
<str name="weightField">price</str>
<str name="suggestAnalyzerFieldType">autocomplete</str>
<str name="queryAnalyzerFieldType">autocomplete</str>
<str name="buildOnCommit">true</str>
</lst>
schema.xml:
<fieldType name="autocomplete" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true" outputUnigramsIfNoShingles="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I want to return max 3 words as autocomplete term. Is this possible with the SuggestComponent or how would you do it? No matter what I try I always receive the complete field value of matching documents.
Is that expected behaviour or what did I do wrong?
Many thanks in advance
In schema.xml define fieldType as follows:
<fieldType name="text_autocomplete" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="5"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
In schema.xml define your field as follows:
<field name="example_field" type="text_autocomplete" indexed="true" stored="true"/>
Write your query as follows:
query?q=*&
rows=0&
facet=true&
facet.field=example_field&
facet.limit=-1&
wt=json&
indent=true&
facet.prefix=so
In the facet.prefix field, specify the term being searched for which you want suggestions ('so', in this example). If you need less than 5 words in the suggestion, reduce maxShingleSize in the fieldType definition accordingly. By default, you will get the results in decreasing order of their frequency of occurrence.

Solr Spellcheck for Multi Word Phrases

I have a problem with solr spellcheck suggestions for multi word phrases. With the query for 'red chillies'
q=red+chillies&wt=xml&indent=true&spellcheck=true&spellcheck.extendedResults=true&spellcheck.collate=true
I get
<lst name="suggestions">
<lst name="chillies">
<int name="numFound">2</int>
<int name="startOffset">4</int>
<int name="endOffset">12</int>
<int name="origFreq">0</int>
<arr name="suggestion">
<lst><str name="word">chiller</str><int name="freq">4</int></lst>
<lst><str name="word">challis</str><int name="freq">2</int></lst>
</arr>
</lst>
<bool name="correctlySpelled">false</bool>
<str name="collation">red chiller</str>
</lst>
The problem is, even though 'chiller' has 4 results in index, 'red chiller' has none. So we end up suggesting a phrase with 0 result.
What can I do to make spellcheck work on the whole phrase only? I tried using KeywordTokenizerFactory in query:
<fieldType name="text_spell" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
And I also tried adding
<str name="sp.query.extendedResults">false</str>
within
<lst name="spellchecker">
in solrconfig.xml.
But neither seems to make a difference.
What would be the best way to make spellcheck only give collation that have results for the whole phrase? Thanks!
The real issue here is that you need to specify the spellcheck.collateParam.q.op=AND and also (optionally) spellcheck.collateParam.mm=100%
These params enforce the collate queries executed correctly.
You can read more about this on the solr docs

Solr russian spellcheck

I am using solr spellcheck for russian language. When you are typing with Cyrillic chars, everything it's ok, but it doesn't work when you are typing with Latin chars.
I want that spellcheck correct and when you are typing with Cyrillic chars and when are you typing with Latin chars. And corret to text with Cyrillic chars.
For example, when you type:
телевидениеее or televidenieee
It should correct to:
телевидение
schema.xml:
<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
</analyzer>
</fieldType>
solrconfig.xml
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">spellcheck</str>
<str name="classname">solr.IndexBasedSpellChecker</str>
<str name="buildOnCommit">true</str>
<str name="buildOnOptimize">true</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="accuracy">0.75</str>
</lst>
<lst name="spellchecker">
<str name="name">wordbreak</str>
<str name="field">spellcheck</str>
<str name="classname">solr.WordBreakSolrSpellChecker</str>
<str name="combineWords">false</str>
<str name="breakWords">true</str>
<int name="maxChanges">1</int>
</lst>
</searchComponent>
Thanks for help
It can be achived with ICUTransformFilterFactory, which will (un)transliterate the input query each time.
Here is an example, of how one can enable this functionality:
Enable icu4j amalyzers (lucene-analyzers-icu-*.jar, icu4j-*.jar):
Those libraries can be found in contrib/analysis-extras folder of solr distribution from official site (they also available via maven).
In solrconfig.xml add something like these to enable them (there can be a single lib dir with all the jars that you need, in this example it just uses default location relative to example/solr/collection1/conf folder from official distribution):
<lib dir="../../../contrib/analysis-extras/lib" regex=".*\.jar" />
<lib dir="../../../contrib/analysis-extras/lucene-libs" regex=".*\.jar" />
Split spell_text field analyzers into two separate list for index and query.
Add solr.ICUTransformFilterFactory as query analyzer with the following id Any-Cyrillic; NFD; [^\p{Alnum}] Remove:
<fieldType name="spell_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[,.;:]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement=""/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="true"/>
<filter class="solr.LengthFilterFactory" min="3" max="256" />
<filter class="solr.ICUTransformFilterFactory" id="Any-Cyrillic; NFD; [^\p{Alnum}] Remove" />
</analyzer>
</fieldType>
Regarding the ICUTransformFilterFactory id - Any-Cyrillic; NFD; [^\p{Alnum}] Remove:
Related stackoverflow question
Official guide
The configuration described above is working on my local machine the same way for russian transliterations and russian words

XML response does not contained the text surrounding the word that i searched

I have indexed a pdf in solr and when i make a query for a text called BOEHRINGER, my xml response is as follows
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="q">text:BOEHRINGER</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="author">cjessen</str>
<arr name="content_type">
<str>application/pdf</str>
</arr>
<str name="id">2</str>
<date name="last_modified">2012-05-07T17:09:32Z</date>
</doc>
</result>
</response>
How do i get the contents to be returned as well as the file name as part of the XML response?? What field should be added to the schema.xml so that i can view the text from the pdf surrounding the word that i searched which is BOEHRINGER part of the XMl response.
Check for the field mapping attributes.
The Content of the file is usually mapped to text field, which is not stored by default.
Check ExtractingRequestHandler, the default is for the file contents are fmap.content=text which can be overridden.
If you want to just check the content with the query highlighted, you can use the highlight feature of solr.
For the title of the document, you would either need to pass the title when you index the document or there should be an inbuilt file name field provided by Tika as a metadata field which you can use.
<!-- Solr Cell Update Request Handler
http://wiki.apache.org/solr/ExtractingRequestHandler
-->
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<!-- All the main content goes into "text"... if you need to return
the extracted text or do highlighting, use a stored field. -->
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
This is my solrconfig.xml file. All the fields in the schema.xml file have indexed and stored =true. I am still trying to get the text part of my response followed by the words around it. If sanjay was searched then i want part of my resposne to be "Sanjay is 6 ft tall" , also "sanjay is a good boy". Assuming both the sentences existed in the file that was indexed.
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
generateNumberParts="1" catenateWords="1"
catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"
protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" enerateWordParts="1"
generateNumberParts="1" catenateWords="0"
catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
And the field is <field name="text" type="text_general" indexed="true" stored="true" multiValued="true"/>

Resources