How to use Solr MinHashQParser - solr

Currently I'm trying to integrate Jaccard similarity search using MinHash and I stumbled upon solr's 8.11 MinHash Query Parser and it says in the docs:
The queries measure Jaccard similarity between the query string and MinHash fields
How to correctly implement it?
As docs say, I added <fieldType> and <field> like so:
<field name="min_hash_analysed" type="text_min_hash" multiValued="false" indexed="true" stored="false" />
<fieldType name="text_min_hash" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="5" outputUnigrams="false" outputUnigramsIfNoShingles="false" maxShingleSize="5" tokenSeparator=" "/>
<filter class="org.apache.lucene.analysis.minhash.MinHashFilterFactory" bucketCount="512" hashSetSize="1" hashCount="1"/>
</analyzer>
</fieldType>
I tired saving some text to that new min_hash_analysed field and then trying to query very similar text using query provided in the doc.
{!min_hash field="min_hash_analysed" sim="0.5" tp="0.5"}Very similar text to already saved document text
I was hoping to get back all documents that have higher similarity score than sim="0.5", but no matter what I get "numFound":0
Surely I'm doing some thing wrong. How should I correctly integrate Solr's MinHash Query Parser?

According to the response it seems you're sending {!min_hash field..} directly as a query parameter, not as a Solr query as given by the the q= parameter.
q={!min_hash ..}query text here
.. would be the correct syntax in the URL (and apply URL escaping as required).

Related

solr fuzzy search with edit distance above 1

Enviornment- java version "11.0.12" 2021-07-20 LTS, solr-8.9.0
I have the following field declaration for my Solr index:
<field name="Field1" type="string" multiValued="false" indexed="false" stored="true"/>
<field name="author" type="text_general" multiValued="false" indexed="true" stored="true"/>
<field name="Field2" type="string" multiValued="false" indexed="false" stored="true"/>
Field type:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Solr-core has been created using command : ./solr create -c fuzzyCore
The .csv file used to indexed the data is https://drive.google.com/file/d/1z684x2GKsSQWGAdyi6O4uKit4a96iiuh/view
I understand that "Lucene supports fuzzy searches based on the Levenshtein Distance, or Edit Distance algorithm. To do a fuzzy search the tilde, "~", symbol at the end of a Single word Term is used.
~ operator is used to run fuzzy searches. We need to add ~ operator after every single term and can also specify distance which is optional after that as below."
{FIELD_NAME:TERM_1~{Edit_Distance}
Since 'KeywordTokenizer' keeps the whole input as a single token and I want each word to be searchable, so 'StandardTokenizer' is used.
request looks like as mentioned below :
curl "http://localhost:8983/solr/fuzzyCore/select" --data-urlencode "q=author:beaeb~' AND Field1:(w1 x)" --data-urlencode "rows=20"
{
"responseHeader":{
"status":0,
"QTime":14,
"params":{
"q":"author:beaeb~' AND Field1:(w1 x)",
"rows":"20"}},
"response":{"numFound":12,"start":0,"numFoundExact":true,"docs":[
{
"Field1":"x",
"author":"bbaeb",
"Field2":"o",
"id":"f8fbb58d-9e0d-47b2-aa3c-e3920e25a7d1",
"_version_":1746912583192936455},
{
"Field1":"x",
"author":"beabe",
"Field2":"p",
"id":"7d73e7ba-8455-4eb4-818f-1e19b1d35a22",
"_version_":1746912583244316680},
{
"Field1":"x",
"author":"baeeb",
"Field2":"n",
"id":"b4e86fc3-7ecc-407b-b638-88d167a66934",
"_version_":1746912583292551181},
{
"Field1":"x",
"author":"beaea",
"Field2":"o",
"id":"131ad4de-eaa2-47b8-b58b-e690316eed1c",
"_version_":1746912583314571267},
{
"Field1":"x",
"author":"bbaeb",
"Field2":"q",
"id":"d034e66c-a302-4b24-a186-5a2bafecab40",
"_version_":1746912583392165900},
{
"Field1":"x",
"author":"beacb",
"Field2":"n",
"id":"c0ab3e48-2b2d-438d-8cc2-1acfcf6efde8",
"_version_":1746912583490732036},
{
"Field1":"x",
"author":"aeabe",
"Field2":"m",
"id":"4472ec5d-eace-446f-b1d6-c8911be24368",
"_version_":1746912583266336776},
{
"Field1":"x",
"author":"baeab",
"Field2":"q",
"id":"b4c24da3-9199-4eba-a8a3-e30fc17d9167",
"_version_":1746912583274725377},
{
"Field1":"x",
"author":"aeaea",
"Field2":"n",
"id":"bb17bc26-e392-4fed-ae46-bbdd40af0ac0",
"_version_":1746912583294648329},
{
"Field1":"x",
"author":"aeceb",
"Field2":"p",
"id":"5e5cfe21-ff19-464f-8adf-8b5888c418e4",
"_version_":1746912583296745472},
{
"Field1":"x",
"author":"baeab",
"Field2":"p",
"id":"54a3c8e6-137d-47c3-9192-a5ed1904dc55",
"_version_":1746912583357562889},
{
"Field1":"x",
"author":"aeeeb",
"Field2":"m",
"id":"200694a0-6248-49fd-8182-dac79657e045",
"_version_":1746912583385874444}]
}}
,
The above request is not retrieving output as 'author:bebbeb',although there is author:'bebbeb' is present in data with Field1:w1. This can be
verified with following two commands
curl "http://localhost:8983/solr/fuzzyCore/select" --data-urlencode "q=author:beaeb~' AND Field1:w1"
{
"responseHeader":{
"status":0,
"QTime":4,
"params":{
"q":"author:beaeb~' AND Field1:w1"}},
"response":{"numFound":0,"start":0,"numFoundExact":true,"docs":[]
}}
Although output of following command is
curl "http://localhost:8983/solr/fuzzyCore/select" --data-urlencode "q=Field1:w1"
{
"responseHeader":{
"status":0,
"QTime":1,
"params":{
"q":"Field1:w1"}},
"response":{"numFound":1,"start":0,"numFoundExact":true,"docs":[
{
"Field1":"w1",
"author":"bebbeb",
"Field2":"p",
"id":"4356dff2-ab93-4bab-a4dc-1797db38240c",
"_version_":1746912583504363523}]
}}
so I tried to post everything you need to understand my problem. Any ideas? Why author:'bebbeb' is not resulting as output for input:beaeb~
After debugging Lucene we discovered that there is a parameter called maxExpansions set to 50 by default, which could be extended to 1024.
However, looking at the Solr code, we can see that the FuzzyQuery constructor is only called twice and always uses the default maxExpansions value (for performance reasons); this means fuzzy searches take at most the 50 most similar terms and discard the others. That's why when many documents are indexed and most of the terms are similar (as in your case), some documents may not be returned.
A Solr open-source contribution would be needed to expose this parameter and make the use of this feature more flexible (allowing different values to be set).

Solr tokenizer does not do anything

I want to tokenize one solr string field "content" to another field "tokenized".
So e.g.:
{
"content":"Hello World this is a Test",
"tokenized":["hello", "world", "this", ...]
}
For that i use
<field name="content" type="string" indexed="true" stored="true"/>
<field name="tokenized" type="customType" indexed="true" stored="true"/>
<copyField source="content" dest="tokenized"/>
and the custom field type
<fieldType name="customType" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My understanding was that upon committing all contents are tokenized with the specified tokenizer and then put, as a list of tokens, into the tokenized field. However the tokenized field only contains the content in a list, e.g.:
{
"content":"Hello World this is a Test",
"tokenized":["Hello World this is a Test"]
}
Is there some global configuration i need to make to get tokenizers to work?
Tokens are only stored internally in Lucene and Solr. They do not change the stored text that gets returned to you in any way. The text is stored verbatim - i.e. the text you sent in is what gets returned to you.
The tokens generated in the background and stored in the index affect how you can search against the content you've stored and how it's processed, it does not affect the display value of the field.
You can use the Analysis page under Solr's admin page to see exactly how text for a field gets processed into tokens before being stored in the index.
The reason for this is that you're usually interested in returning the actual text to the user, making the tokenized and processed values visible doesn't really make sense for a document that gets returned to a human.

Solr synonym graph filter not working after other filter

I'm trying to convert 15.6" searches to 15.6 inch. The idea was first replace 15.6" to 15.6 " and then match the " with the synonym rule " => inch.
I created the type definition:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern='^([0-9]+([,.][0-9]+)?)(")$' replacement="$1 $3" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" />
</analyzer>
</fieldType>
but it's not working! If I input 15.6" I get 15.6 ", but when I input 15.6 " I get what I want - 15.6 inch.
Why doesn't it work? Am I missing something?
EDIT:
Solr Analysis:
The issue is that 15.6 " is still a single token after your pattern replace filter - just creating a token with a space in it will not split it.
You can see that it's still kept as a single token as there is no | on the line (which separates the tokens).
Add a Word Delimiter Filter after it (it seems from your analysis chain that you already have one, it's just not included in your question), or, better, do the replacement in a PatternReplaceCharFilterFactory before the tokenizer gets the task to split the input into separate tokens:
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern='^([0-9]+([,.][0-9]+)?)(")$' replacement="$1 $3" />
<tokenizer ...>
You might have to massage the pattern matcher a bit (i.e. lose the ^ and $ which isn't respected by Solr any way, iirc) depending on your input (since it'll now be applied to the whole input string - make sure that "Macbook 15.6" 256GB" is matched approriately).

Pattern Tokenizer Factory doesn't work properly

I'm trying parse input line using PatternTokenizerFactory.
So according to doc:
https://lucene.apache.org/core/4_1_0/analyzers-common/org/apache/lucene/analysis/pattern/PatternTokenizerFactory.html
My schema looks like:
<fieldType name="text_ptn" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="("bbb": ")([[a-zA-Z ]+)" group="2"/>
</analyzer>
</fieldType>
So, this pattern should work: https://regex101.com/r/9Ep6qO/6
According to schema I'm trying to get value of from particular part of the "test" field ('bbb'). As I understand now I can search doc just writing in Solr "test":"Acc Hs"
But I can search only using such construction: "test":"'bbb': 'Acc Hs'"
My solution was to split this input and then use the filter:
<tokenizer class="solr.PatternTokenizerFactory" pattern="(.*\"bbb\": \")" />
<filter class="solr.PatternCaptureGroupFilterFactory"
pattern="(^[a-zA-Z ]+)"
preserve_original="false"/>
So, could you explain why the first option isn't working.(There were no difference when I put e.g. group="1")

Expanding Solr search: "volcano" to match "volcanic"

I have websolr setup on my rails app running on heroku. I just noticed that the search for "volcano" did not return all the results I would have expected. Specifically, it did return a result which included both "volcanic" and "stratovolcanoes".
How do I need to modify the solr configuration to address this?
This is the relevant section from my schema.xml
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
Addition: I don't think this is relevant, but just in case:
My Rails Photo.rb model is setup like this:
searchable do
text :caption, :stored => true
text :category do
category.breadcrumb
end
integer :user_id
integer :category_id
string :caption
string :rights
end
Caption and category are the two text fields I'm searching on. Caption is free-form text, whereas Category is a text string like "Earth Science > Volcanoes"
This is my synonyms config that shows in websolr (I added the last line):
#some test synonym mappings unlikely to appear in real input text
aaa => aaaa
bbb => bbbb1 bbbb2
ccc => cccc1,cccc2
a\=>a => b\=>b
a\,a => b\,b
fooaaa,baraaa,bazaaa
# Some synonym groups specific to this example
GB,gib,gigabyte,gigabytes
MB,mib,megabyte,megabytes
Television, Televisions, TV, TVs
#notice we use "gib" instead of "GiB" so any WordDelimiterFilter coming
#after us won't split it into two words.
# Synonym mappings can be used for spelling correction too
pixima => pixma
volcano => volcanic,stratovolcanoes
I believe this is caused by the introduction of SnowballPorterFilterFactory
Including this in your analyzer lists causes Solr to apply Stemming to your terms. Particularly, in this case Solr does Porter Stemming
If you do not need stemming, you could remove that analyzer.
If you do not get desired results for specific cases with stemming, you could add a solr.SynonymFilterFactory filter like descibed here:
<fieldtype name="syn" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="syn.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldtype>
You will then be able to maintain a synonym file:
volcano => volcanic, stratovolcanoes

Resources