I am using Solr to do a fuzzy search (e.g., foo~2 bar~2). Highlighting allows me to highlight matching document fragments from the resultset.
For example:
Result 1: <em>food</em> <em> bars</em>
Result 2: mars <em>bar</em>
and so on.
For each match highlighted from the document, I need to figure out which query terms did these fragments matched against along with offsets of those query terms in the query. Something like:
Result 1: {<em>food</em> MATCHED_AGAINST foo QUERY_OFFSET 0,2} {<em> bars</em> MATCHED_AGAINST bar QUERY_OFFSET 3,5}
Result 2: mars {<em>bar</em> MATCHED_AGAINST bar QUERY_OFFSET 3,5}
Is there a way to do this in Solr?
One of the possibility would be to customize Highlighter that will produce needed information. Idea is simple - you have method
org.apache.lucene.search.highlight.Highlighter#getBestTextFragments
in this method you have low-level access to the QueryScorer which consists of several useful attributes like
private Set<String> foundTerms;
private Map<String,WeightedSpanTerm> fieldWeightedSpanTerms;
private Query query;
I'm pretty much sure, that using this information you should be able to produce needed output
One hack I could figure out is to use different (unique) boost factors for each term in the query, and then retrieving boost factors for each matched term from the debug score so as to deduce which term that score came from.
For example, we can query with foo~2^3.0 bar~2^2.0 (boost scores from bar by 2.0, keep scores from matching against foo untouched). From the debug score output, check the boost factors:
Result 1: food bars: score <total score 1> = food * 3.0 * <other scoring terms> + bars * 2.0 * <other scoring terms>
Result 2: mars bar: score <total score 2> = bar * 2.0 * <other scoring terms>
From which it is clear that food matched with boost factor of 3.0, and bars as well as bar matched with boost factor of 2.0. Maintaining a lookup dictionary for which term had what boost to begin with, it is easy to figure out which terms matched.
Two factors to consider:
If the boost factor is 1.0, solr debug score does not print it.
Solr might incorporate some default boost factor for the term based on fuzzy matching, TF-IDF, etc. In this case, the boost factor that shows up will not match against the boosts we supplied in the query. For this reason, we need to execute our query twice - once without any boosting (to understand default boosting for every term), and once with boosting (to see how much it has changed now).
Hope this helps someone.
Related
I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights on query fields and boost query on source
qf=text^6 title^15 IndexTerm^8
As you can see, title has a higher weight.
Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.
What I'm looking is to see if there's a way to deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title. I tried something like, but didn't make any difference.
Source:simplecontent^10 Source:Help^20 (-ContentGroup-local:("Developer"))^99
I'm using edismax query parser.
Any pointers will be appreciated.
Thanks,
Shamik
You're onto something with your last attempt, but you have to start with *:*, so that you actually have something to subtract the documents from. The resulting set of documents (those not matching your query) can then be boosted.
From the Solr Relevancy FAQ
How do I give a negative (or very low) boost to documents that match a query?
True negative boosts are not supported, but you can use a very "low" numeric boost value on query clauses. In general the problem that confuses people is that a "low" boost is still a boost, it can only improve the score of documents that match. For example, if you want to find all docs matching "foo" or "bar" but penalize the scores of documents matching "xxx" you might be tempted to try...
q = foo^100 bar^100 xxx^0.00001 # NOT WHAT YOU WANT
...but this will still help a document matching all three clauses score higher then a document matching only the first two. One way to fake a "negative boost" is to give a large boost to everything that does not match. For example...
q = foo^100 bar^100 (*:* -xxx)^999
NOTE: When using (e)dismax, people sometimes expect that specifying a pure negative query with a large boost in the "bq" param will work (since Solr automatically makes top level purely negative positive queries by adding an implicit ":" -- but this doesn't work with "bq", because of how queries specified via "bq" are added directly to the main query. You need to be explicit...
?defType=dismax&q=foo bar&bq=(*:* -xxx)^999
I am confused about the qf and tie parameters in eDisMax
According to the document:
The qf is used to specify which field to search, while tie is use to specify all other field(except the highest score field)'s affect to the total score.
My confusion is since we already specify which field(suppose we only specify only one field) to search, why we still be able to get other fields to affect the total results(I guess this must be my misunderstanding to how edismax works, but this is also my confuse.)?
Or does that mean each time, edismax will calculate all the score across all fields and apply them with tie to the final score(even we only specify one field)?
No, tie parameter is not about fields. Let me explain basic stuff that eDisMax doing - when it works against multiple fields it didn't sum score across fields (as boolean query did, for example), instead it choose maximum.
E.g. if we have fields A and B and score for field A is 3.0, and for B - 5.0, then eDisMax will get score 5.0, completely ignoring other score.
The "tie" param let's you configure how much the final score of the query will be influenced by the scores of the lower scoring fields compared to the highest scoring field.
So, if tie = 0.1, then final score of previous example will be 5.0 + 0.1 * 3.0 = 5.3
More information about tie param: https://wiki.apache.org/solr/ExtendedDisMax#tie_.28Tie_breaker.29
I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.
I am using a WildcardQuery on documents and I see that the result documents all of them have a score of 0.5. I read that queries like WildcardQuery do not affect the scoring of documents and now I am wondering what is the cause of the score to be 0.5.
I am using this simple query:
WildcardQuery wq = new WildCardQuery("filed_name", "book");
WildcardQuery certainly does affect scoring. It uses a CONSTANT_SCORE_AUTO_REWRITE, which may be what you are referring to. That means that fields that match the WildcardQuery each have a equal boost to the score added by that match. There is, however, none of the typical Similarity logic (tf-idf, for instance) applied for the WildcardQuery's matches.
Using Sphinx I can rank document any way I want.
SELECT *
FROM someIndex
WHERE MATCH('foo bar')
OPTION ranker=expr('<any rank expression>')
How can I achieve same behavior with Solr? Is {!boost q=<some_boost_expression>} is the only way? For example, I need to documents with more number of words have higher score:
A: foo bar blah blah blah
B: foo bar
I need A to be more relevant for foo bar query. Right now B have higher score.
You can apply boost functions (bf attribute) to customize your scoring in a more complex way than a simple query term boost. This is available in the DisMax query parser, and, as you might expect, is further extended in the Extended dismax query parser
The norm is where you would normally expect to find information readily available about the length of the field, although it will be combined with any field level boost found, and you logic (to weigh more heavily the longer field) is the reverse of the default scoring. That will make supporting field boosts and that logic difficult, unless you create a custom Similarity. Norms, by the way, are stored at index time, not calculated at query time, if you decide to take that route.