I am trying to implement one way synonym or one way thesaurus(as in Endeca) in Solr. Where I search for camcorder I get result for camera also but not vice versa. I tried adding following in Synonyms.txt but seems to be not working as it is giving weird results:
camcorder => camera
And my schema.xml is:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
I think what you were looking for is:
camcorder => camera, camcorder
If you don't include camcorder on the right side, camcorder won't return any results if you search for "camcorder".
Since you're only expanding synonyms when you're indexing (where you have the SynonymFilter defined), camcorder will be changed to camera for each document on the way in. When you don't have the same expansion taking place when querying, Solr will still search for camcorder (as there is no SynonymFilter defined for the query analysis chain). There is no camcorder token in the index, so there will be no hit.
You'll have to expand synonyms when querying as well as when indexing to achieve what you want with one-way synonyms.
Related
I am using Solr 8.3 and trying to pass a synonym file in wordnet format, such as-
s(300880586,1,'augmented',s,1,0).
s(300880765,1,'enhanced',s,1,0).
s(300881030,1,'hyperbolic',s,1,2).
s(300881030,2,'inflated',s,1,1).
In the managed-schema file, I have configured the Synonym Graph Filter as-
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="wn_s.pl" format="wordnet" ignoreCase="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="wn_s.pl" format="wordnet" expand="true" ignoreCase="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
However, it did not work. Perhaps I might have missed some configuration or maybe some issue with format.
So I tried converting the file into Solr format, it works that way somehow.
I wanted to use the wordnet format only, so if anyone can help me understand the mistake I am making here, it would be helpful.
I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. The pattern used is actually from an example in the SOLR documentation and I do not understand where I made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during indexing, the input gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters "r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also tried to use more than one backspace in the pattern (e.g. \t and \\t), but this did not change how the input is tokenized during indexing.
What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Update found this post on the "Solr - User" mailing list archive:
http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>
Found this post on the "Solr - User" mailing list archive: http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>
Can someone help me with highlighting issue that I'm having when I search for 'cars' it is highlighting 'car','cars' expected behavior and also all the words that start with car for example 'cards','carriers' etc.
user requirement is we don't want to highlight anything that starts with 'car'?? here is my schema.xml
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[({.,\[\]})]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" preserveOriginal="1" catenateAll="1" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
The problem is that when you're indexing cards with an edgengramfilter, you get the tokens c, ca, car, card and cards. When you're then searching for cars and you have the same edgengramfilter for the field, youll search for any document matching any of the tokensc,ca,car, andcars`.
The solution is to either drop the edgengramfilter when indexing (so that you don't get a hit for c, ca or car), or use a different field for highlighting (with hl.fl) that only have standard tokenization / whitespace tokenization applied, together with possibly a stemmer (I'd go with solr.EnglishMinimalStemFilterFactory to only remove plural indicators).
I have used a CharFilterFactory in my schema.xml for fileType text_general, so that queries for cafe and café return the same results. It works correctly. Here's the relevant part of my schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...however, the analysis tool within the admin user interface of solr seems to suggest that the tokenizers and filters that come after the charfilterfactory aren't doing anything. That's because if I analyse text_general with any field values for index and query, after MCF (MappingCharFilter), the output for ST, SF and LCF are all empty (grrr - a screen dump would be useful here, but I'm not allowed to post one because my 'reputation' isn't high enough apparently). Is that expected behaviour? Could someone please explain that analysis tool output please?
I am using solr, set up at localhost:8983
I am basically using the out of the box example.
I have entered one document with a name "Car", and another with a name "Cars".
If I visit either:
http://localhost:8983/solr/select?q=Car
or
http://localhost:8983/solr/select?q=Cars
I would expect to get both documents. At the moment, I don't.
In the fields tag of "schema.xml", the entry for "name" is:
"text_general" has the following "analyzers" (without the stemmers):
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
I tried to add a stemmer to each analyzer. I tried:
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
Doing so makes it such that searching for "Cars" will find "Car", but I can never find "Cars".
Should it be possible to find "Cars"?
Any help would be greatly appreciated. Thank you.
It is possible, just add porter filter at the end (after LowerCaseFilterFactory):
<filter class="solr.SnowballPorterFilterFactory" language="English" />
Read more:
Snowball docs with example of use in analyser
Solr LanguageAnalysis
The English (Porter2) stemming algorithm
If there is no special need, I would not divide analyser to index and query time. Your query time analyser looks perfectly good to use it in both cases.
I found that changing from text_general to text_en in the shema.xml fields took care of this plurality problem