Solr Relevance Search boost

Solr Relevance Search boost - solr

I am using Solr 5.0.0, I have one question in relevance boost:
If I search for laptop table like words, is there any way to boost results search word before the words like by with or without etc.
I used this query:
? defType = dismax
& q = foo bar
& bq = (*:* -by)^999
But, this will boost negatively those documents having the word by or with etc. How can i avoid this problem?
For example, if I search for laptop table then by the above query the result DGB Cooling Laptop Table by GDB won't boost.
I just need to give a boost to the search words before certain words like by, with, etc.
Is it possible?

In your example you want ...laptop table by... results to score higher than laptop table results without by. And you want ...by laptop table... to be omitted entirely.
Let's call them:
Q1: ...laptop table by... (let's boost by 2 for the exercise)
Q2: ...laptop table... (let's not boost at all)
Q3: ...by laptop table... (we want to omit this)
So, your query in the abstract is: (Q1^2 OR Q2) NOT Q3
The dismax parser may be obscuring your goals by doing too much on your behalf. At least for this piece, you should consider using the Standard (Lucene) Query Parser:
`q=("laptop table by"^2 OR "laptop table") NOT "by laptop table"`
If you want to allow for slop (i.e. extra words between the query and 'by' or 'with') and preserve the order of terms as above, you should look into the ComplexPhraseQueryParser (Solr 4.8+) Yonik Seeley has a nice post about this.
Then you could do something like this:
`q=({!complexphrase inOrder=true}"laptop table by"~2^2 OR "laptop table") NOT {!complexphrase inOrder=true}"by laptop table"~2`

Related

Avoiding keyword stuffing in SOLR

I'm looking for a way to limit the effect (or eliminate it) of "keyword stuffing" in SOLR. (We're currently running a SOLR 6.2.0 server).
I've tried setting omitTermFreqAndPositions="true", but when I do that, some queries throw phrase query errors (specifically queries with search terms such as G1966B - likely due to word splitting and such). I could go down the road of disabling the word splitting and try to avoid the phrase query errors, but this is simply going to mess up more things than I'm trying to fix.
Does anyone have any suggestions on how to limit the affect of multiple keyword matches in a single field?
Example: If we have a description field with something like this:
BrandX 1200 Series G1924B LC/MSD SL XBC System.
This BrandX 1200 Series G1924B ( G 1924 B , G1924 B , G 1924B ) LC/MSD SL XBC >System is in excellent condition.
When someone does a search for "G1924B" I would like to avoid scoring this document higher just because it happens to have G1924B (or a variation of that) in there several times.
In theory someone could repeat the keyword many times in their description to try to trick the system into ranking their search results higher.
Any suggestions?
Thanks!

This happens to appear as a more frequent requirement than initially thought.
If you remove both term freq and positions, you lose phrase search capability.
I would recommend to write a custom similarity that ignores TF ( Term Frequency).
At the moment the default BM25 take TF in consideration.
You can just pick that class and adjust the similarity calculus to consider TF as a constant.
e.g.
org.apache.lucene.search.similarities.BM25Similarity.BM25DocScorer#score
[1] org.apache.lucene.search.similarities.BM25Similarity

Indexing and searching words and word-parts

I just indexed a bunch of text data from our products DB. My goal is evaluating Apache Solr for production use.
This is a document example:
{
"shape":"Geometric",
"color":"MATTE BLACK",
"gender":"unisex",
"model":"CLUBMASTER RX 5154",
"sales":10,
"lens":"rugged",
"material":"plastic",
"brand":"Ray-Ban"
}
The most important thing in our search app is fuzzy matching, because inaccurate search terms are very frequent.
So, I'm a little disappointed with results found by Solr.
For example:
clubmaster -> many results
club master -> no results
Why?!
ray ban -> many results
rayban -> no results
I also tried putting ~1 or even ~2 after my term, with no luck!
All fields are indexed '*_txt_en' predefined field.

You can't just run a serious production setup without customizing schema/solrconfig to fit your specific needs. From what I can guess, you would get the results you want by:
copy your text fields into different versions with different analysis, for example:
one as a string type, hard to match
one field that is using EdgeNgram to match prefixes.
another with WordDelimiterFilterFactory to match ray-ban/rayban
...
using edismax as the query parser
in edismax, there are many things to tweak in it. But the most important is: search on all the fields above, but weight then in different way, the less analysis, the more weight

Solr negative boost

I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights on query fields and boost query on source
qf=text^6 title^15 IndexTerm^8
As you can see, title has a higher weight.
Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.
What I'm looking is to see if there's a way to deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title. I tried something like, but didn't make any difference.
Source:simplecontent^10 Source:Help^20 (-ContentGroup-local:("Developer"))^99
I'm using edismax query parser.
Any pointers will be appreciated.
Thanks,
Shamik

You're onto something with your last attempt, but you have to start with *:*, so that you actually have something to subtract the documents from. The resulting set of documents (those not matching your query) can then be boosted.
From the Solr Relevancy FAQ
How do I give a negative (or very low) boost to documents that match a query?
True negative boosts are not supported, but you can use a very "low" numeric boost value on query clauses. In general the problem that confuses people is that a "low" boost is still a boost, it can only improve the score of documents that match. For example, if you want to find all docs matching "foo" or "bar" but penalize the scores of documents matching "xxx" you might be tempted to try...
q = foo^100 bar^100 xxx^0.00001 # NOT WHAT YOU WANT
...but this will still help a document matching all three clauses score higher then a document matching only the first two. One way to fake a "negative boost" is to give a large boost to everything that does not match. For example...
q = foo^100 bar^100 (*:* -xxx)^999
NOTE: When using (e)dismax, people sometimes expect that specifying a pure negative query with a large boost in the "bq" param will work (since Solr automatically makes top level purely negative positive queries by adding an implicit ":" -- but this doesn't work with "bq", because of how queries specified via "bq" are added directly to the main query. You need to be explicit...
?defType=dismax&q=foo bar&bq=(*:* -xxx)^999

How do I create a Solr query that returns results even if one field in my query has no matches?

Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)

You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation

You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!

I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.

To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Solr Relevance Search boost - solr

Related

Avoiding keyword stuffing in SOLR

Indexing and searching words and word-parts

Solr negative boost

How do I create a Solr query that returns results even if one field in my query has no matches?

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

Categories

Resources