SOLR wildcard search not returning results - solr

I have a schema definition as follows:
<fieldType name="textSuggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.PatternReplaceFilterFactory" pattern="([,]+)" replacement=" " replace="all"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.PatternReplaceFilterFactory" pattern="([,]+)" replacement=" " replace="all"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And some data in the format:
17,WALKINGTON,AVENUE,,MARGARET RIVER,WA
If I search for 17 walkington, it shows the above in the results. How can I make sure that if I search for 17 walk, the above shows up in the search results? I have tried appending * at the end of the search query, but can't get it to work. Any suggestions?

In order to get the partial word match you have to change or add the ngram filter.
Try using ngram filter.
Factory class: solr.NGramFilterFactory
for example the Arguments of it:
minGramSize: (integer, default 1) The minimum n-gram size, must be > 0.
maxGramSize: (integer, default 2) The maximum n-gram size, must be >= minGramSize.
Example you can a field type for your field:
<fieldType name="textSuggest" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.PatternReplaceFilterFactory" pattern="([,]+)" replacement=" " replace="all"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="10"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.PatternReplaceFilterFactory" pattern="([,]+)" replacement=" " replace="all"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Note : The ngram causes big number of tokens and hence large index size if you have huge data set.

Related

Synonyms and Stop-words in Solr

I am on Solr 7.4. I have two fields in my schema ("title_en" and "body_en") that are of the delivered field type "text_en", which is defined as:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
I have a synonyms file which has the following definition
usa,united states of america
If I search a query term "USA" or "United States of America" using the EDisMax query parser, then I see the following:
"parsedquery":"+DisjunctionMaxQuery(((body_en:usa) | (title_en:usa)))"
It seems as if the synonyms are being ignored if they include a stop-word (in this case ‘of’ in the synonym ‘united states of america’).
I would have expected something like this:
parsedquery":"+DisjunctionMaxQuery(((((+body_en:unit +body_en:state +body_en:america) body_en:usa))
| (((+title_en:unit +title_en:state +title_en:america) title_en:usa))))
Is there some conflict I'm running into between Stop-words and Synonyms - perhaps based on the order of query parsing operations? Any suggestions as to how to overcome this?

solr search result issue (searching for shirt returning t-shirt)

I'm using solr 6.3.0 to search product from the document, my problem is when i search for "mens shirt" then it also searches the "mens t-shirt" in the result. But i don't want the result "mens t-shirt" for that what shluld id do?
Fields details as given below.
<field name="product_name" type="text_general" indexed="true" stored="true" />
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
</analyzer>
</fieldType>
Thanks
abhay
the StandardTokenizer is splitting at - too, that is why it matches 'shirt'. For this case specifical case, you could just replace StandardTokenizerFactory with ClassicTokenizerFactory.
ClassicTokenizerFactory does not split on -, so t-shit will not match shirt. That said, maybe there are other cases where you will miss StandardTokenizerFactory.
Look at the docs for tokenizers, experiment a bit, and then decide

Prevent solr from using specific synonyms in query

I'm trying to get documents with word "złoto"(gold) in them.
My query looks like this
"querystring":"content:złoto"
"parsedquery":"SynonymQuery(Synonym(content:złoto content:złoty))"
"złoty" is a synonym for "złoto" (inflection to be more specific) but it is also a synonym for "zł"(currency). Word "zł" is far more popular in indexed content, so when I'm trying to get docs with "złoto"(gold) I get more results with "zł"(which is not what I'm looking for).
I have "zł" word in my stopwords file and my field definition looks like this
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.MorfologikFilterFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.MorfologikFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
Is there a way to get solr to look only for specific synonyms for given word, for example:
"złoto" => ["złota", "złoty"] but not "zł"(which is synonym for "złoty")
?
I'm using solr 6.2.0.

Getting most likely documents of the query using phonetic filter in solr

I am using solr for spell checking/ query correction. I have added solr.PhoneticFilterFactory and solr.NGramFilterFactory in fieldType to perform spell checking.
It is working fine but here the problem is that I am getting number of documents of the query. I need only most likely words/documents or in similar words, we can say that nearer words/documents to the query.
Snippet of schema.xml :
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="1000" />
<filter class="solr.LowerCaseFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.TrimFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PhoneticFilterFactory" encoder="DoubleMetaphone" inject="true"/>
</analyzer>
</fieldType>
Example :
For a query "piece". I am getting around 780 NumFound(Number of documents). I need to reduce this counts but with most likely number of documents.

Case sensitive and case insenistive search in solr

I indexed data in solr using following field type configuration. On which I can perform only
case insensitive search. Eg :If I am typing text:Abc or abc is giving same result .
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.StandardFilterFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
But now my requirement got changed.Suppose If I am searching for Abc then it should give all result matching with Abc not abc,reverse also should work.
Is it possible with current configuration? If not then what configuration should I use.
please suggest me .
Just remove the lowercase filter from your tokenizer and it should solve your problem. Then it will not convert the tokens into lowercase and hence give you the desired results.

Resources