Search for partial words using Solr - solr

I'm trying to search for a partial word using Solr, but I can't get it to work.
I'm using this in my schema.xml file.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
Searching for die h won't work, but die hard returns some results.
I've reindexed the database after the above configuration was added.
Here is the url and output when searching for die hard. The debugger is turned on.
Here is the url and output when searching for die h. The debugger is turned on.
I'm using Solr 3.3. Here is the rest of the schema.xml file.

The query you've shared is searching the "title_text" field, but the schema you posted above defines the "text" field. Assuming this was just an oversight, and the title_text field is defined as in your post, I think a probable issue is that the NGramTokenizer is configured with minGramSize="3", and you are expecting to match using a single-character token.
You could try changing minGramSize to 1, but this will inevitably lead to some very inefficient indexes; and I wonder whether you really are keen on having "e" match every movie with an e in the title?

Related

Ignore special characters

I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?
One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.
I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.

How can I have a one-way synonym in Solr?

I am trying to implement one way synonym or one way thesaurus(as in Endeca) in Solr. Where I search for camcorder I get result for camera also but not vice versa. I tried adding following in Synonyms.txt but seems to be not working as it is giving weird results:
camcorder => camera
And my schema.xml is:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
I think what you were looking for is:
camcorder => camera, camcorder
If you don't include camcorder on the right side, camcorder won't return any results if you search for "camcorder".
Since you're only expanding synonyms when you're indexing (where you have the SynonymFilter defined), camcorder will be changed to camera for each document on the way in. When you don't have the same expansion taking place when querying, Solr will still search for camcorder (as there is no SynonymFilter defined for the query analysis chain). There is no camcorder token in the index, so there will be no hit.
You'll have to expand synonyms when querying as well as when indexing to achieve what you want with one-way synonyms.

Some characters breaks phrase search in text field

I have a text field, which contains titles of tv-series or movies. In several cases I want to perform a phrase query on what I'd say a pretty normal text field. This works fine for most phrase terms, but in some reproducable cases it doesn't, but simply returns nothing. It seems to be related to some "special" characters, but not all special characters I'd assume are affected.
Title:("Mission: Impossible") works
Title:("Disney A.N.T.") doesn't work
Title:("Stephen King's Shining") doesn't work
Title:("Irgendwie L. A.") works
After trying several other titles I'd assume, that it is somehow related to dot . and apostroph ' and maybe other I don't know yet. I have no idea, where to look know
relevant schema.xml
<fieldType name="title" class="solr.TextField" sortMissingLast="true"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0"
generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="0" catenateAll="0" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Your Question is about phrases on a field where the analyzer of type "index" contains a solr.WordDelimiterFilterFactory but in type "query" it does not.
MatsLindh told us, the first step is to open the analysis screen.
In this case the position value is important.
With your attributes in solr.WordDelimiterFilterFactory the token "King's" is converted to "king's" "king" "kings" "s" and the last "s" is on !second! position.
This does not explain
solr.StandardTokenizerFactory
So if you are search for the phrase "Stephen King's Shining" without solr.WordDelimiterFilterFactory the token "Shining" is on position three but if you are indexing with solr.WordDelimiterFilterFactory the token "Shining" is on position four, so only "Stephen King's Shining"~2 (with Slop) will match, but not "Stephen King's Shining".
This does not explain your problem with "Disney A.N.T.". But be aware that solr.StandardTokenizerFactory would remove the last dot, and solr.WhitespaceTokenizerFactory does not.

Solr difficulty getting fuzzy search with multiple terms

Suppose someone's name is Alessia Keeling. I'm having difficulty getting the following queries to work
q=Alessia Keeling returns a result
q=Alessia returns a result
q=Alessia Keel returns a result
however,
q=Alessia Keeli and q=Alessia Keelin returns no results
I've tried quite a few things here in my schema.xml file, but I don't have much METHOD to my MADNESS.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="20" side="front"/>
</analyzer>
</fieldType>
Solr Admin Analyzer shows that it will match both "Alessia" and various forms of "Keeling", but Sunspot is still returning no results.
Edit 1
Here is console testing
(byebug) Sunspot.commit
(byebug) Sunspot.index
(byebug) User.search {|q| q.fulltext "Alessia Keeling" }.hits
[#<Sunspot::Search::Hit:User 4>]
(byebug) User.search {|q| q.fulltext "Alessia Keelin" }.hits
[]
Edit 2
I was finally able to get somewhere. I looked in some of my log files and noticed that the call my app was making to solr was using the query string
"http://localhost:8981/solr/select?fq=type%3AUser&q=Eli+Donnelly+I&fl=%2A+score&qf=email+first_name_text+last_name_text+username_text+name_text+description_text&defType=dismax&start=0&rows=30&debugQuery=true
This printed out some useful information, most useful being "parsedQuery" I was able to see that another field was conflicting. I have another field that handles emails and in this latter case where my query string was "Eli Donnely I", the sole letter token "I" was breaking the query because of the email field. Adding a length filter fixed it.
I've tried this from the Solr side with an example core, and the match is returned with the NGram filter while indexing. You might want to check the server side logs to see that you're actually reindexing, at least.
The field definition is as follows:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReversedWildcardFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="20" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
data.json:
[{"id": 1, "text": "Alessia Keeling"}, {"id": 2, "text": "Alessia Fubar"}]
Populate it with data:
curl http://localhost:8983/solr/collection1/update\?commit\=true --data-binary #data.json -H 'Content-type:application/json'
Searching:
GET http://localhost:8983/solr/collection1/select\?q\=alessia%20keelin\&q.op\=AND
[..]
<result name="response" numFound="1" start="0"><doc><int name="id">1</int><str name="text">Alessia Keeling</str><long name="_version_">1473248002863792128</long></doc> </result>
.. which returns the promised document, while keeping the non-matching document out of the result.
As mentioned in the comment, you need to switch the EdgeNGramFilterFactory to the index instead of the query.
For me, this has apparently worked. Edit the file schema.xml as below:
<?xml version="1.0" encoding="UTF-8"?>
<schema name="sunspot" version="1.0">
... (other stuff)
<solrQueryParser defaultOperator="AND|OR"/>
... (other stuff)
</schema>
Before, I have the defaultOperator only configured as AND, after I changed it, searches were getting more flexible.
Also, I'd suggest giving a look at this page.

Searching for Solr Stop words

On of my solr fields is configured in the following manned,
<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This works in cases where i don't want stemming, but now there is another use case which is causing a problem, people are beginning to seach for the following combinations,
The Ivy : In this case results with just ivy is being returned, when the expected result would be with The. I understand that this is because of the stop word but is the way to achieve this. For example if they search for "the ivy" within quotes than this should work.
(Mom & Me) OR ("mom and me"): In this case also & is dropped or results including both mom and me in some part of the statement is returned.
I am ok if only new data behaves in the right way but wouldnt be able to reindex. Also, would changing the schema.xml file trigger a full replication?
Regards,
Ayush
You are using the white space tokenizer.
So "The Ivy" is slitted into 2 words.
You could use an less agressive tokenize an followed by the WordDelimiterFilterFactory in order to activate the protected="protwords.txt" options, where you can set "the ivy" as an protected word so that solr will not tokenize that.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Resources