ES "minimum_should_match" query how to transform to vespa query? - vespa

we decide migrate es to vespa ES "minimum_should_match" query how to transform to vespa query ?
eg: es query:
"query_string": {
"minimum_should_match": "75%",
"fields": [],
"query": "may is test"
}"
the above query mean three hits match two at least
I try to transform this query by vespa "or" operator but 75% Can't express

Thank you for your interest in Vespa,
It's a rather odd text matching feature as for instance dropping a high significant term just because some other none-significant terms matched is IMHO questionable. For example for the query "what is text ranking", requiring 2/3 terms matching makes it ok to match 'what' and 'is' but the significant part is discarded. I would rather look hard at using weakAnd query operator https://docs.vespa.ai/documentation/using-wand-with-vespa.html
There is no direct replacement but you can drop documents in the first-phase ranking function using the rank-score-drop-limit.
https://docs.vespa.ai/documentation/reference/schema-reference.html#rank-score-drop-limit
https://docs.vespa.ai/documentation/reference/rank-features.html
rank-profile odd-ranking {
first-phase {
rank-score-drop-limit: -5
expression: if(matchCount(text)/queryTermCount < 0.75, -10, bm25(text))
}
}
In this case if the matchCount(text)/queryTermCount is below 0.75 the document is assigned a first-phase rank score of -10 and will be dropped from the result set, if it's larger than 0.75 it uses a bm25 score over the text.
But again, look at weakAnd instead of this is for text matching, it will focus on the significant query terms.

Related

Solr Search for term with under certain price

I am working on an e-commerce app whose search is powered by Solr. How can I accomplish something like: 'DSLR under $600' gives me products that are found with DSLR and price is under $600?
Lucene and Solr query parser support less than or equal to (<=):
?q=name:DSLR AND price:[* to 600]
Note : Assuming your field is "name"
This would give you the less than 600 dollars.
Square brackets [ & ] denote an inclusive range query that matches values including the upper and lower bound.
Curly brackets { & } denote an exclusive range query that matches values between the upper and lower bounds, but excluding the upper and lower bounds themselves.
For more details please refer Range Searches
You can use
protocol://{solr-url}/solr/{collection-name}/select?
fq=name:DSLR
&fq=price:[0 TO 600]
&q=*:*
&wt=json
In my understanding
Using fq over q is recommended here.

Combine solr's document score with a static, indexed score in solr 7.x

I have people indexed into solr based on structured documents. For simplicity's sake, let's say they have the following schema
{
personName: text,
games :[ { gamerScore: int, game: text } ]
}
An example of the above would be
{
personName: john,
games: [
{ gamerScore: 80, game: Zelda },
{ gamerScore: 20, game: Space Invader },
{ gamerScore: 60, game: Tetris},
]
}
'gamerScore' (a value between 1 and 100 to indicate how good the person is in the specified game).
Relevance matching in solr is all done through the Text field 'game'. However, I want my final result list to be a combination of relevance to the query as provided by solr and my own gamerScore. Namely, I need to re-rank the results based on the following formula:
personFinalScore = (0.8 * solrScore) + (0.2 * gamerScore)
What am trying to achieve is the combination of two different scores in a weighted manner in solr. This question was asked a long time ago, and was wondering if there is something in solr v7.x. that can tackle this.
I can change the schema around if a solution requires it.
In effect your formula can be simplified to applying your gamerScore with 0.25 - the absolute value of the score is irrelevant, just how much the gamerScore field affects the score of the document.
The dismax based handlers supports bf:
The bf parameter specifies functions (with optional boosts) that will
be used to construct FunctionQueries which will be added to the user’s
main query as optional clauses that will influence the score.
Since bf is an addtive boost, you can use bf=product(gamerScore,0.25) to make the gamerScore count 20% of the total score.

Solr 7 - How to do Full Text Search w/ Geo Spatial Search

How to do Full Text combined w/ Geo Spatial in Solr 7?
In regards to this: https://lucene.apache.org/solr/guide/7_2/spatial-search.html
I have to do queries that COMBINE full text w/ geo spatial. For example:
box AND full text or spatial sort AND full text.
I was not able to figure out a good query string example that produces this desired result. I would like this as a pure query string rather than some Java method as I'm consuming this on different tech other than Java. Solr is very deep and confusing and I know I must read more but there was no good examples for this anywhere online that I found.
desired query string example
[solr]/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
So in that case, would sort by distance yet also filter by some full text "some text" value.
If this cannot be done, I believe it is possible in Elastic Search but these (Solr and Elastic Search) are both built on top of Lucene so seems like it should work on both if works on one but feel free to supply an answer for Elastic Search as well.
example returned
{
"responseHeader": {
"status": 0,
"QTime": 2,
"params": {
"q": "Panini",
"pt": "34.04506799999999,-118.260849",
"d": "10000",
"{!geofilt}": "",
"fq": "{!bbox sfield=point}",
"sort": "geodist() asc",
"sfield": "point"
}
},
"response": {
"numFound": 0,
"start": 0,
"docs": []
}
}
Docs do contain this phrase 'Panini' but none returned. May be due to default way full text is handled in Solr 7? It is using the same point where the term 'Panini' is used and the field point is of type org.apache.solr.schema.PointType.
UPDATE
I ended up abandoning Solr for Elastic Search. Solr is just very annoying in its strange ways compared with the very easy to use Elastic Search. Things just work as you expect without having to dig into quirks.
I adapted my answer to the solr 7.2.1 example:
Start solr by: ./bin/solr start -e techproducts
I've also visualized the data in google maps:
https://www.google.com/maps/d/u/0/viewer?ll=42.00542239270033%2C-89.81213734375001&hl=en&hl=en&z=4&mid=16gaLvuWdE9TsnhcbK-BMu5DVYMzR9Vir
You need these query parameters:
Bound by Box Filter:
fq={!bbox}
The geo filter query parser bbox needs further parameters:
Solr field: sfield=store
Point to search/sort from: pt=36.35,-97.51
Distance for filter: d=1200
Sort:
sort=geodist() asc
Fulltext query:
q=some+text
Full example queries for solr example data:
Simple:
http://localhost:8983/solr/techproducts/select?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod&fl=name,store
UI:
http://localhost:8983/solr/techproducts/browse?fq={!bbox}&sort=geodist()%20asc&sfield=store&pt=36.35,-97.51&d=1200&q=ipod
The result is as expected:
Apple 60 GB iPod
Belkin Power Cord for iPod
Filtered by distance: iPod & iPod Mini USB 2.0 Cable
Hints
The field store must be of type location:
You might Urlencode the special characters:
e.g. fq=%7B%21bbox%20sfield%3DgeoLocation%7D
In your case, you have to combine the full-text search scoring with the spatial distance.
So if your query looks like this:
/select?fq={!bbox sfield=point}&pt=34.04506799999999,-118.260849&d=10000&sort=geodist() asc&{!geofilt}&sfield=point&q=Panini
You should change the sort parameter and either remove it or just set it to score desc. That way you sort by the score given from the full-text search query.
To take the spatial part into consideration you need to include a boosting function to your query. In majority of the cases - the closer the document is from the point of interest the better, so you would probably like to include a boosting function that does X/distance. The X can be as simple as 1 and the function itself can also be more complicated. To do that in dismax query you would use the bf parameter, like bf=div(1,geodist()).
Try that out, it should work, but of course will need some adjustments.

Solr5 search not displaying results based on score

I am implementing Solr search, the search order is not displaying on the basis of score. Lets say if use the search keywords as .net ios it's returning the results based on score. I have a field title which holds the following data
KeySkills:Android, ios, Phonegap, ios
KeySkills:.net, .net, .net, MVC, HTML, CSS
Here when i search .net ios as search keyword net, .net, .net, MVC, HTML, CSS should come first in the results and the score should be higher because it contains .net 3 times, but i am getting reverse result.
Is there any setting needs to be done in solr config file or in schema.xml file to achieve this or how can i sort the results based on max no of occurrence of the the search string. please help me to solve this.
Following is the result i get
{
"responseHeader": {
"status": 0,
"QTime": 0,
"params": {
"indent": "true",
"q": ".net ios",
"_": "1434345788751",
"wt": "json"
}
},
"response": {
"numFound": 2,
"start": 0,
"docs": [
{
"KeySkills": "Android, ios, Phonegap, ios",
"_version_": 1504020323727573000,
"score": 0.47567564
},
{
"KeySkills": "net, net, net, MVC, HTML, CSS",
"_version_": 1504020323675144200,
"score": 0.4726259
}
]
}
}
As you can see in Lucene's doc, score is not only estimated with the number of matching term:
score(q,d) = coord(q,d) · queryNorm(q) · ∑( tf(t in d)·
idf(t)²·t.getBoost()·norm(t,d) )
Where
tf(t in d) correlates to the term's frequency, defined as the number
of times term t appears in the currently scored document d.
idf(t) stands for Inverse Document Frequency. This value correlates
to the inverse of docFreq (the number of documents in which the term t
appears). This means rarer terms give higher contribution to the total
score.
coord(q,d) is a score factor based on how many of the query terms
are found in the specified document.
t.getBoost() is a search time boost of term t in the query q as
specified in the query text.
norm(t,d) encapsulates a few
(indexing time) boost and length factors:
Field boost
lengthNorm
computed when the document is added to the index in accordance with
the number of tokens of this field in the document, so that shorter
fields contribute more to the score.
When a document is added to the index, all the above factors are
multiplied. If the document has multiple fields with the same name,
all their boosts are multiplied together:
norm(t,d) = lengthNorm · ∏ f.boost()
So, here I guess that "KeySkills": "Android, ios, Phonegap, ios" is before your other document because it contains less words than the other one.
To check that, you can use this awesome tool, which is explain.solr.pl.

Solr TF vs All Terms match

I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.

Resources