Challenge with hyphens/dashes in Solr Lucene - solr

I'm trying to cause Solr to extract only the second 7 digit portion of a ticket formatted like n-nnnnnnn
Originally I hoped to keep the full ticket together. According to documentation digits with numbers should be kept together but after hammering away a this problem for some time and looking at the code I don't think that's the case. Solr always generates two terms. So rather than large numbers of matches for the first digit of n- I'm thinking I can get better query results from just the second portion. Substituting an A for a dash:
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b\d[A](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all"
maxBlockChars="20000"/>
will parse 1A1234567 fine
But
-\b" replacement="$1" replace="all"
maxBlockChars="20000"/>
will not parse 1-1234567
So it looks like just a problem with the hyphen. I've tried -(escaped) and [-] and \u002D and \x{45} and \x045 without success.
I've tried putting char filters around it:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\b\d[-](\d\d\d\d\d\d\d)\b" replacement="$1" replace="all" maxBlockChars="20000"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping2.txt"/>
with mappings:
"-" => "z"
and then
"z" => "-"
I looks like the hyphen is eaten up in the Flex tokenization and isn't even available to the char filter.
Has anyone had more success with hyphen/dash in Solr/Lucene? Thanks

If your Solr is using a recent Lucene (3.x+ I think), you will want to use a ClassicAnalyzer rather than a StandardAnalyzer, as StandardAnalyzer now always treats hyphens as a delimiter.

Related

How to use Solr MinHashQParser

Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs:
The queries measure Jaccard similarity between the query string and MinHash fields
How to correctly implement it?
As docs say, I added <fieldType> and <field> like so:
<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false" />
<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
</fieldType>
I tired saving some text to that new min_hash_analysed field and then trying to query very similar text using query provided in the doc.
{!min_hash field="min_hash_analysed" sim="0.5" tp="0.5"}Very similar text to already saved document text
I was hoping to get back all documents that have higher similarity score than sim="0.5", but no matter what I get "numFound":0
Surely I'm doing some thing wrong. How should I correctly integrate Solr's MinHash Query Parser?
According to the response it seems you're sending {!min_hash field..} directly as a query parameter, not as a Solr query as given by the the q= parameter.
q={!min_hash ..}query text here
.. would be the correct syntax in the URL (and apply URL escaping as required).

"Is-a" relationship in Solr Synoyms

I have two texts:
Text-1
Our new Android Smartphone
Text-2
Our new iPhone Smartphone
I would like to tell Solr that "android" is a "smartphone".
Expected results:
If the user searches for "Android" only the first text should be found.
If the user searches for "Smartphone" both texts should be found.
If I use "equal synonyms" (SolrSynonymParser) (during indexing), then the term "Smartphone" would get expanded to "Smartphone, Android, iPhone" in both texts.
In addition to the comment of MatsLindh, to add an unidirectional synonym like: "android=>android,smartphone", you should also consider to add the Synonym Filter only at index time and not both at index and query time
For example:
<analyzer type="index">
...
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt"/>
...
</analyzer>
This way, both "android" and "smartphone" will be indexed for any occurrence of "android"

Solr synonym graph filter not working after other filter

I'm trying to convert 15.6" searches to 15.6 inch. The idea was first replace 15.6" to 15.6 " and then match the " with the synonym rule " => inch.
I created the type definition:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern='^([0-9]+([,.][0-9]+)?)(")$' replacement="$1 $3" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" />
</analyzer>
</fieldType>
but it's not working! If I input 15.6" I get 15.6 ", but when I input 15.6 " I get what I want - 15.6 inch.
Why doesn't it work? Am I missing something?
EDIT:
Solr Analysis:
The issue is that 15.6 " is still a single token after your pattern replace filter - just creating a token with a space in it will not split it.
You can see that it's still kept as a single token as there is no | on the line (which separates the tokens).
Add a Word Delimiter Filter after it (it seems from your analysis chain that you already have one, it's just not included in your question), or, better, do the replacement in a PatternReplaceCharFilterFactory before the tokenizer gets the task to split the input into separate tokens:
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern='^([0-9]+([,.][0-9]+)?)(")$' replacement="$1 $3" />
<tokenizer ...>
You might have to massage the pattern matcher a bit (i.e. lose the ^ and $ which isn't respected by Solr any way, iirc) depending on your input (since it'll now be applied to the whole input string - make sure that "Macbook 15.6" 256GB" is matched approriately).

Expanding Solr search: "volcano" to match "volcanic"

I have websolr setup on my rails app running on heroku. I just noticed that the search for "volcano" did not return all the results I would have expected. Specifically, it did return a result which included both "volcanic" and "stratovolcanoes".
How do I need to modify the solr configuration to address this?
This is the relevant section from my schema.xml
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
Addition: I don't think this is relevant, but just in case:
My Rails Photo.rb model is setup like this:
searchable do
text :caption, :stored => true
text :category do
category.breadcrumb
end
integer :user_id
integer :category_id
string :caption
string :rights
end
Caption and category are the two text fields I'm searching on. Caption is free-form text, whereas Category is a text string like "Earth Science > Volcanoes"
This is my synonyms config that shows in websolr (I added the last line):
#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
volcano => volcanic,stratovolcanoes
I believe this is caused by the introduction of SnowballPorterFilterFactory
Including this in your analyzer lists causes Solr to apply Stemming to your terms. Particularly, in this case Solr does Porter Stemming
If you do not need stemming, you could remove that analyzer.
If you do not get desired results for specific cases with stemming, you could add a solr.SynonymFilterFactory filter like descibed here:
<fieldtype name="syn" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldtype>
You will then be able to maintain a synonym file:
volcano => volcanic, stratovolcanoes

Chaining a Solr HTMLStripCharFilter with a Lucene Analyzer

I want to index Html text with the FrenchAnalyzer so I need to strip Html before analyzing it.
I want to highlight keywords after searching so solution like this one doesn't work because I want to preserve character position information.
I found the SolR HTMLStripCharFilter class which looks perfect but I am not able to chain it with the FrenchAnalyzer.
I tried to rewrite the FrenchAnalyzer but I don't know how to use HtmlStripCharFilter and it doesn't work as a standard Lucene filter.
I am using Lucene 3.5.0 without Solr
In your Analyzer subclass try to override initReader. You may want to add a stripHtml boolean param to your Analyzer's constructor and then use this conditional inside initReader.
/**
* Override this if you want to add a CharFilter chain.
*/
#Override
protected Reader initReader(Reader reader) {
if (stripHtml) {
return new HTMLStripCharFilter(CharReader.get(reader));
} else {
return reader;
}
}
What about trying something like that:
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
Read more:
HTMLStripCharFilterFactory
SnowballPorterFilterFactory
Solr LanguageAnalysis - French

Resources