Pattern Tokenizer Factory doesn't work properly - solr

I'm trying parse input line using PatternTokenizerFactory.
So according to doc:
https://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html
My schema looks like:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="("bbb": ")([[a-zA-Z ]+)" group="2"/>
</analyzer>
</fieldType>
So, this pattern should work: https://regex101.com/r/9Ep6qO/6
According to schema I'm trying to get value of from particular part of the "test" field ('bbb'). As I understand now I can search doc just writing in Solr "test":"Acc Hs"
But I can search only using such construction: "test":"'bbb': 'Acc Hs'"
My solution was to split this input and then use the filter:
<tokenizer class="solr.PatternTokenizerFactory" pattern="(.*\"bbb\": \")" />
<filter class="solr.PatternCaptureGroupFilterFactory"
pattern="(^[a-zA-Z ]+)"
preserve_original="false"/>
So, could you explain why the first option isn't working.(There were no difference when I put e.g. group="1")

Related

How to use Solr MinHashQParser

Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs:
The queries measure Jaccard similarity between the query string and MinHash fields
How to correctly implement it?
As docs say, I added <fieldType> and <field> like so:
<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false" />
<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
</fieldType>
I tired saving some text to that new min_hash_analysed field and then trying to query very similar text using query provided in the doc.
{!min_hash field="min_hash_analysed" sim="0.5" tp="0.5"}Very similar text to already saved document text
I was hoping to get back all documents that have higher similarity score than sim="0.5", but no matter what I get "numFound":0
Surely I'm doing some thing wrong. How should I correctly integrate Solr's MinHash Query Parser?
According to the response it seems you're sending {!min_hash field..} directly as a query parameter, not as a Solr query as given by the the q= parameter.
q={!min_hash ..}query text here
.. would be the correct syntax in the URL (and apply URL escaping as required).

Solr synonym graph filter not working after other filter

I'm trying to convert 15.6" searches to 15.6 inch. The idea was first replace 15.6" to 15.6 " and then match the " with the synonym rule " => inch.
I created the type definition:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern='^([0-9]+([,.][0-9]+)?)(")$' replacement="$1 $3" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" />
</analyzer>
</fieldType>
but it's not working! If I input 15.6" I get 15.6 ", but when I input 15.6 " I get what I want - 15.6 inch.
Why doesn't it work? Am I missing something?
EDIT:
Solr Analysis:
The issue is that 15.6 " is still a single token after your pattern replace filter - just creating a token with a space in it will not split it.
You can see that it's still kept as a single token as there is no | on the line (which separates the tokens).
Add a Word Delimiter Filter after it (it seems from your analysis chain that you already have one, it's just not included in your question), or, better, do the replacement in a PatternReplaceCharFilterFactory before the tokenizer gets the task to split the input into separate tokens:
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern='^([0-9]+([,.][0-9]+)?)(")$' replacement="$1 $3" />
<tokenizer ...>
You might have to massage the pattern matcher a bit (i.e. lose the ^ and $ which isn't respected by Solr any way, iirc) depending on your input (since it'll now be applied to the whole input string - make sure that "Macbook 15.6" 256GB" is matched approriately).

Partial search on Solr in the end of the string

I'm using Solr with partial search with this configuration:
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" />
So I've indexed a word like "mountain", if I search "mount" I find the content.
But is there a way to perform a partial search in the middle or the and of a string?
I would like to search "ountain" and match "mountain".
Or to search "ounta" and match "mountain".
Thanks
Sure, you use an NGramFilterFactory
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="25"/>
Of course, the size of your index will increase...

Expanding Solr search: "volcano" to match "volcanic"

I have websolr setup on my rails app running on heroku. I just noticed that the search for "volcano" did not return all the results I would have expected. Specifically, it did return a result which included both "volcanic" and "stratovolcanoes".
How do I need to modify the solr configuration to address this?
This is the relevant section from my schema.xml
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
Addition: I don't think this is relevant, but just in case:
My Rails Photo.rb model is setup like this:
searchable do
text :caption, :stored => true
text :category do
category.breadcrumb
end
integer :user_id
integer :category_id
string :caption
string :rights
end
Caption and category are the two text fields I'm searching on. Caption is free-form text, whereas Category is a text string like "Earth Science > Volcanoes"
This is my synonyms config that shows in websolr (I added the last line):
#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
volcano => volcanic,stratovolcanoes
I believe this is caused by the introduction of SnowballPorterFilterFactory
Including this in your analyzer lists causes Solr to apply Stemming to your terms. Particularly, in this case Solr does Porter Stemming
If you do not need stemming, you could remove that analyzer.
If you do not get desired results for specific cases with stemming, you could add a solr.SynonymFilterFactory filter like descibed here:
<fieldtype name="syn" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldtype>
You will then be able to maintain a synonym file:
volcano => volcanic, stratovolcanoes

Chaining a Solr HTMLStripCharFilter with a Lucene Analyzer

I want to index Html text with the FrenchAnalyzer so I need to strip Html before analyzing it.
I want to highlight keywords after searching so solution like this one doesn't work because I want to preserve character position information.
I found the SolR HTMLStripCharFilter class which looks perfect but I am not able to chain it with the FrenchAnalyzer.
I tried to rewrite the FrenchAnalyzer but I don't know how to use HtmlStripCharFilter and it doesn't work as a standard Lucene filter.
I am using Lucene 3.5.0 without Solr
In your Analyzer subclass try to override initReader. You may want to add a stripHtml boolean param to your Analyzer's constructor and then use this conditional inside initReader.
/**
* Override this if you want to add a CharFilter chain.
*/
#Override
protected Reader initReader(Reader reader) {
if (stripHtml) {
return new HTMLStripCharFilter(CharReader.get(reader));
} else {
return reader;
}
}
What about trying something like that:
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
Read more:
HTMLStripCharFilterFactory
SnowballPorterFilterFactory
Solr LanguageAnalysis - French

Resources