Multi Phrase Query does not find my documents

Multi Phrase Query does not find my documents - solr

As part of an upgrade plan from version 6.2.1, I'm setting up a new Solr (7.6.0).
Surprisingly, one of our simple tests failed - inserting a document with some text and then trying to search for it.
The text that was inserted is:
I will think about it.
Request handler configuration:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">20</int>
<str name="df">text_en</str>
<str name="hl.fragsize">100000</str>
<str name="hl.maxAnalyzedChars">100000</str>
</lst>
</requestHandler>
This is how the field is configured:
<field name="text_en" type="text_en" indexed="true" stored="true" multiValued="true" />
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
</analyzer>
</fieldType>
"will" and "it" appears in stopwords_en.txt file.
According to Analysis in the Admin Web App, those are the final tokens that are inserted into the index:
text: i i_will will_think think about about_it
position: 1 1 2 3 4 4
Search for text_en:"I will think about it" doesn't find the document.
Strange thing is, the query "I will think think about it" does work...
Using debugQuery, I noticed a difference compared to our current version.
6.2.1 is using MultiPhraseQuery
7.6.0 is using SpanNearQuery
In version 6.2.1:
"rawquerystring":"text_en:\"I will think about it\"",
"querystring":"text_en:\"I will think about it\"",
"parsedquery":"MultiPhraseQuery(text_en:\"(i i_will) will_think think (about about_it)\")", ...
In 7.6.0 (btw, also in 7.5.0):
"rawquerystring":"text_en:\"I will think about it\"",
"querystring":"text_en:\"I will think about it\"",
"parsedquery":"SpanNearQuery(spanNear([spanOr([spanNear([text_en:i, text_en:will_think], 0, true), spanNear([text_en:i_will, text_en:think], 0, true)]), text_en:about_it], 0, true))", ...

I've found the culprit.
Sharing it to support future googlers.
There was a mistake in the field configuration in schema.xml file.
In the "query" analyzer, instead of "CommonGramsFilterFactory", it should be "CommonGramsQueryFilterFactory".

Related

autocomplete with ngrams generates duplicates

I am writing an autocomplete feature in solr. Ideally autocomplete would
display suggestions if target occurs in any of the words, but prefer exact match over KeywordTokenzierFactory ngram edge match, KeywordTokenzierFactory ngram edge match over StandardTokenizer (or UAX29URLEmailTokenizerFactory) ngram edge match
serve the document along with the suggestion.
show unique suggestions only
This is my attempt at autocompleting:
<field name="category" type="string" indexed="true" stored="true" docValues="true"/>
<field name="categoryAutocompleteExactEdge" type="autocomplete_exact_edge" indexed="true" stored="false"/>
<field name="categoryAutocompleteTermsEdge" type="autocomplete_terms_edge" indexed="true" stored="false"/>
<copyField source="category" dest="categoryAutocompleteExactMatch"/>
<copyField source="category" dest="categoryAutocompleteTermsEdge"/>
<fieldType name="autocomplete_exact_edge" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="autocomplete_terms_edge" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="30" minGramSize="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.UAX29URLEmailTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
</analyzer>
</fieldType>
handler:
<requestHandler name="/suggest_category" class="org.apache.solr.handler.component.SearchHandler">
<lst name="defaults">
<str name="wt">json</str>
<str name="defType">edismax</str>
<str name="rows">5</str>
<str name="fl">category</str>
<str name="qf">category^30 categoryAutocompleteExactEdge^10 categoryAutocompleteTermsEdge</str>
</lst>
</requestHandler>
I think the above handles the order of suggestion in accordance with the first requirement. It also allows you to fetch the document data along with the suggestion by changing fl. The problem I have is the duplication of suggestion.
If there are many documents with category:"GASTROENTEROLOGIST", then it is possible that category: "GASTRO APPOINTMENT" is never served. If faceting is enabled and rows set to 0, then the qf ordering is lost.
I am looking for all in one solution, but it appears to me that serving unique suggestions and also displaying document data is mutually exclusive. For example, if I move the categories to a new core, then the suggestion duplication problem vanishes, because I can force uniqueness. But lookups to the new core can't display additional document info.
This is my first time creating an autocomplete functionality and I am not exactly sure how to tackle it. It would be really helpful if someone experienced could explain the best strategies for handling autocompletion. Is creating a new core for every field with autosuggestion the way to go?

Extracting date from TextField and sorting with it in Solr

I have a a scheme like this in Solr:
<?xml version="1.0" encoding="UTF-8"?>
<schema name="dovecot-fts" version="3.0">
<fieldType name="ytext" class="solr.TextField" autoGeneratePhraseQueries="true" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="20"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterGraphFilterFactory" catenateNumbers="1" generateNumberParts="1" splitOnCaseChange="1" generateWordParts="1" splitOnNumerics="1" catenateAll="1" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="hdr" type="ytext" indexed="true" stored="true"/>
<uniqueKey>id</uniqueKey>
</schema>
And a record looks like this:
{
"id": "339/9821f61c4fa04b62fa030s002df11e39/user#example.com",
"hdr": "...................Some header information...................Date: Fri, 23 Sep 2022 15:24:43 +0300...................Some other header information..................."
}
Now, I want an additional field containing the date in the HDR field to sorting results. To achieve this I tried copying and manipulate hdr field with PatternTokenizerFactory and I can't.
<fieldType name="ts" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="Date:\s(\w{3},\s\d{2}\s\w{3,4}\s\d{4}\s\d{2}:\d{2}:\d{2}\s\+\d{4})" group="1"/>
</analyzer>
</fieldType>
<field name="hdr" type="ytext" indexed="true" stored="true"/>
<field name="received" type="ts" indexed="true" stored="false" required="false"/>
<copyField source="hdr" dest="received"/>
So I'm waiting for your help, thanks.

In your schema, you are creating a copy field named received and copying the content of hdr into it. The received field has a specific text analysis but this doesn't change the value of the field. It only changes the way it is indexed. Moreover, you cannot sort documents using a text field.
To achieve your goal you need to transform the value of hdr and copy it into another field of type StrField or Date. It is important to use a non-text type if you want to sort using that field. You can use a custom Update Request Processor Chain: https://solr.apache.org/guide/solr/latest/configuration-guide/update-request-processors.html.
You must insert your custom update chain into the solrconfig.xml. Your best chance is to use the following request processors:
solr.CloneFieldUpdateProcessorFactory
solr.RegexReplaceProcessorFactory
This is an example that you can use as a starting point:
<updateRequestProcessorChain name="extract_hdr_date">
<processor class="solr.CloneFieldUpdateProcessorFactory">
<str name="source">hdr</str>
<str name="dest">hdr_date</str>
</processor>
<processor class="solr.RegexReplaceProcessorFactory">
....
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
Given that you want to use that field for sorting, you might find it useful to enable docvalues on hdr_date: https://solr.apache.org/guide/solr/latest/indexing-guide/docvalues.html

Solr term search not searching all values from multifields value

I have solr field
<field name="AllTitles" type="text_general" indexed="true" stored="false" multiValued="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time -->
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Example of Value for AllTitles entered is
AllTitles: [ "Anything", "wuhan coronavirus", "anything" ]
AllTitles: [ "wuhan coronavirus", "anything", "anything" ]
It searches from first index but if any matching term on index other than 1st then it's not searching
For example when I search
q="wuhan coronavirus"
I get 2 results. When I search using field name "AllTitles"
q=AllTitles:"wuhan coronavirus"
I get 7 results correctly.
Can anybody help me identifying the issue?

First, in your SolrConfig.xml check what field has been defined in the "df". In the below example it is "text".
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">text</str>
</requestHandler>
Second, in the schema.xml or managed-schema, whichever you are using, make sure you have copied "AllTitles" to "text". Like this,
<copyField source="AllTitles" dest="text"/>
You might as well test it by adding "AllTitles" to your "df" parameter when you query, before doing all these, like raghu777 has mentioned.

Migrated from Solr4 to Solr6, field queries return no results

I migrated my site to a completely new instance with new versions of everything (LAMP on Ubuntu). When I do a search on the old site (Solr 4) using a query such as
authortext:fred
I get lots of results. With the new site (Solr 5), I get zero results (note the data is the same). However, if I use a very specific query such as
authortext:Fred Smith FredSmith
Then I get the same results on both Solr 4 & 6. So IOWs, the Solr6 implementation only supports exact strings for field searches. Note that this particular field is defined in the schmema as "text". The schemas are (mostly) identical in both implementations - new version having more stuff in it. I'm running these queries using the Solr admin.
Here's the relevant section of schema.xml:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"
catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<field name="text" type="text" indexed="true" stored="false" multiValued="true"/>
<field name="authortext" type="text" indexed="true" stored="true" />
On the old Solr 4, I get errors if I surround fred with wildcards *fred*, but it works if I escape them \*fred\*. On the new Solr 6, it actually returns all the same results as the bare "fred" on the old Solr 4 site, BUT, it's still case sensitive - have to use *Fred*. So, it seems I could alter my php code to add wildcards to the field string that gets passed to Solr (and clip the 1st char), but that is a horrible hack. I'm presuming that there is some mystery new config that has an unpleasant default value (this phenomenon got me many times setting up the new instance)?
Debug output for Solr6:
"responseHeader":{
"status":0,
"QTime":0,
"params":{
"q":"authortext:fred",
"indent":"on",
"wt":"json",
"debugQuery":"on",
"_":"1507235204701"}},
"response":{"numFound":0,"start":0,"docs":[]
},
"debug":{
"rawquerystring":"authortext:fred",
"querystring":"authortext:fred",
"parsedquery":"authortext:fred",
"parsedquery_toString":"authortext:fred,
"explain":{},
"QParser":"LuceneQParser",
(I omitted the timing section since it's just a bunch of 0.0 sec)
Debug from Solr4:
[Lots of response elements omitted]
<lst name="debug">
<str name="rawquerystring">authortext:fred</str>
<str name="querystring">authortext:fred</str>
<str name="parsedquery">authortext:fred</str>
<str name="parsedquery_toString">authortext:fred</str>
<lst name="explain">
<str name="68665">
4.998859 = (MATCH) fieldWeight(authortext:fred in 45574), product of: 1.4142135 = tf(termFreq(authortext:fred)=2) 8.079376 = idf(docFreq=131, numDocs=110520) 0.4375 = fieldNorm(field=authortext, doc=45574)
</str>
....many more "MATCHES" omitted`

solr facet search truncate words

have a solr configured for french content. Search is fine, but when i activate facet search, words are truncated in a special way.
All e disappear, for eg automobil instead of automobile, montagn instead of montagne, styl instead of style , homm => homme etc....
<lst name="keywords">
<int name="automobil">1</int>
<int name="citroen">1</int>
<int name="minist">0</int>
<int name="polit">0</int>
<int name="pric">0</int>
<int name="shinawatr">0</int>
<int name="thailand">0</int>
</lst
here is the query q=fulltextfield:champpions&facet=true&facet.field=keywords
the keyword content :
<arr name="keywords">
<str>Ski</str>
<str>sport</str>
<str>Free style</str>
<str>automobile</str>
<str>Rallye</str>
<str>Citroen</str>
<str>montagne</str>
</arr>
here is the schema used :
<fieldtype name="text_fr" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" />
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
</fieldtype>
the field def :
If somebody have an idea about that issue....
Thanks for your answer.
regards
Jerome longet

Generally, if you want to use a field as a facet, it should be stored as a string.
You're faceting on a tokenized and filtered field, so the individual values are the processed words in your keywords field.

All above said is correct, I just want to add one thing one facets. The facet values are the indexed terms, and not the stored ones. One recommendation for facets is to use a string-type. This is often a good choice. But sometimes you would like to to some things to your facet terms. In that case, you can use a text type, but treat the input only lightly. Avoid in any case your above choices of Stemming (SnowballPorter) or WordDelimiter.
A good choice to start with is KeywordTokenizerFactory, you could to PatternReplace to clean up your terms and input, and do a TrimFilter at the end. Don't do lowercasing, if your users are going to see the terms.
An example, my input are alphabetic language codes. The PatternReplace clean up non-alphabetic characters, the second correct an input-mistake:
`
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])"
replacement=""
replace="all" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="fer|xxx"
replacement="und"
replace="all" />
<filter class="solr.LengthFilterFactory" min="3" max="3" />
</analyzer>
`
Have fun with solr
Oliver