Solr TF vs All Terms match

Solr TF vs All Terms match - solr

I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?

I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax

Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.

Related

Solr search relevancy

i use solr and i have a trouble with result score. For example
i have such docs with one field (for example "content"):
content = car
content = cars
content = carable awesome
content = awful for carable
And i make search query with such params ":{
"mm":"1",
"q":"car",
"tie":"0.1",
"defType":"dismax",
"fl":"*, score",}
i expect to see the result like this:
car: 5 score
cars: 4.8 score
carable awesome: 3
awful for carable: 3
Word without "s" should be highter, but i have strange things. How i can boost absolute match (like a car)

This happens because the field type you're using for the field has a stemming filter (or an ngramfilter) attached (which makes cars and car generate hits against each other). You can't boost "exact hits" inside such a field, since for Lucene they are the same value. What's stored in the index is the same for both car and cars - the latter is processed down to car as well.
To implement this and get exact hits higher, you add a second field without that filter present that only tokenizes (splits) your content on whitespace and lowercases the token. That way you have a field where cars and car are stored as different tokens, and tokens won't contribute to the score if they're not being matched.
You can use qf in Solr to tell Solr which fields you want to search against, and you can give a boost at the same time - so in your case you'd have qf=exact_field^10 text_field where hits in exact_field would be valued ten times higher than hits in the regular field (the exact boost values will depend on your use case and how you want the query profile to behave).
You can also use the different boost arguments (bq and boost) to apply boosts outside of your regular query (i.e. add a query to bq that replicates your original query), but the previous suggestion will probably work just fine.

Solr Highlight matching query terms

I am using Solr to do a fuzzy search (e.g., foo~2 bar~2). Highlighting allows me to highlight matching document fragments from the resultset.
For example:
Result 1: <em>food</em> <em> bars</em>
Result 2: mars <em>bar</em>
and so on.
For each match highlighted from the document, I need to figure out which query terms did these fragments matched against along with offsets of those query terms in the query. Something like:
Result 1: {<em>food</em> MATCHED_AGAINST foo QUERY_OFFSET 0,2} {<em> bars</em> MATCHED_AGAINST bar QUERY_OFFSET 3,5}
Result 2: mars {<em>bar</em> MATCHED_AGAINST bar QUERY_OFFSET 3,5}
Is there a way to do this in Solr?

One of the possibility would be to customize Highlighter that will produce needed information. Idea is simple - you have method
org.apache.lucene.search.highlight.Highlighter#getBestTextFragments
in this method you have low-level access to the QueryScorer which consists of several useful attributes like
private Set<String> foundTerms;
private Map<String,WeightedSpanTerm> fieldWeightedSpanTerms;
private Query query;
I'm pretty much sure, that using this information you should be able to produce needed output

One hack I could figure out is to use different (unique) boost factors for each term in the query, and then retrieving boost factors for each matched term from the debug score so as to deduce which term that score came from.
For example, we can query with foo~2^3.0 bar~2^2.0 (boost scores from bar by 2.0, keep scores from matching against foo untouched). From the debug score output, check the boost factors:
Result 1: food bars: score <total score 1> = food * 3.0 * <other scoring terms> + bars * 2.0 * <other scoring terms>
Result 2: mars bar: score <total score 2> = bar * 2.0 * <other scoring terms>
From which it is clear that food matched with boost factor of 3.0, and bars as well as bar matched with boost factor of 2.0. Maintaining a lookup dictionary for which term had what boost to begin with, it is easy to figure out which terms matched.
Two factors to consider:
If the boost factor is 1.0, solr debug score does not print it.
Solr might incorporate some default boost factor for the term based on fuzzy matching, TF-IDF, etc. In this case, the boost factor that shows up will not match against the boosts we supplied in the query. For this reason, we need to execute our query twice - once without any boosting (to understand default boosting for every term), and once with boosting (to see how much it has changed now).
Hope this helps someone.

Solr negative boost

I'm looking into the possibility of de-boosting a set of documents during
query time. In my application, when I search for e.g. "preferences", I want
to de-boost content tagged with ContentGroup:"Developer" or in other words,
push those content back in the order. Here's the catch. I've the following
weights on query fields and boost query on source
qf=text^6 title^15 IndexTerm^8
As you can see, title has a higher weight.
Now, a bunch of content tagged with ContentGroup:"Developer" consists of a
title like "Preferences.material" or "Preferences Property" or
"Preferences.graphics". The boost on title pushes these documents at the
top.
What I'm looking is to see if there's a way to deboost all documents that are
tagged with ContentGroup:"Developer" irrespective of the term occurrence is
text or title. I tried something like, but didn't make any difference.
Source:simplecontent^10 Source:Help^20 (-ContentGroup-local:("Developer"))^99
I'm using edismax query parser.
Any pointers will be appreciated.
Thanks,
Shamik

You're onto something with your last attempt, but you have to start with *:*, so that you actually have something to subtract the documents from. The resulting set of documents (those not matching your query) can then be boosted.
From the Solr Relevancy FAQ
How do I give a negative (or very low) boost to documents that match a query?
True negative boosts are not supported, but you can use a very "low" numeric boost value on query clauses. In general the problem that confuses people is that a "low" boost is still a boost, it can only improve the score of documents that match. For example, if you want to find all docs matching "foo" or "bar" but penalize the scores of documents matching "xxx" you might be tempted to try...
q = foo^100 bar^100 xxx^0.00001 # NOT WHAT YOU WANT
...but this will still help a document matching all three clauses score higher then a document matching only the first two. One way to fake a "negative boost" is to give a large boost to everything that does not match. For example...
q = foo^100 bar^100 (*:* -xxx)^999
NOTE: When using (e)dismax, people sometimes expect that specifying a pure negative query with a large boost in the "bq" param will work (since Solr automatically makes top level purely negative positive queries by adding an implicit ":" -- but this doesn't work with "bq", because of how queries specified via "bq" are added directly to the main query. You need to be explicit...
?defType=dismax&q=foo bar&bq=(*:* -xxx)^999

If WildcardQuery doesn't affect the scoring of documents, why does it return 0.5 constantly?

I am using a WildcardQuery on documents and I see that the result documents all of them have a score of 0.5. I read that queries like WildcardQuery do not affect the scoring of documents and now I am wondering what is the cause of the score to be 0.5.
I am using this simple query:
WildcardQuery wq = new WildCardQuery("filed_name", "book");

WildcardQuery certainly does affect scoring. It uses a CONSTANT_SCORE_AUTO_REWRITE, which may be what you are referring to. That means that fields that match the WildcardQuery each have a equal boost to the score added by that match. There is, however, none of the typical Similarity logic (tf-idf, for instance) applied for the WildcardQuery's matches.

Solr Fuzzy search in multiValued field with max distance between terms

Hello stackOverflowers
I have a field in a Solr document collection with a field called
names_txt - this is a multiValue="true" field.
This field contains all the names of the associated persons to a document
I want to be able to both do a fuzzy search and at the same time limit the number of terms between the to matching terms.
The query
names_txt:("markus foss"~2)
Will return all documents where you find the terms markus and foss where theres max 2 terms between them.
But when i search in a fuzzy way AND want to also specify the max number of terms between the matches, I cant get the syntax right.
The query:
names_txt:(markus~0.7 foss~0.7)
This does work, but returns false postives, since it will return a document with "markus something" in one value, and "foss somethingElse" in another.
What I would like to write is:
(markus~0.7 foss~0.7)~2
but this syntax is illegal in solr.
Anyone out there have a solution for my problem?

Since in one single query term Solr can either process a word distance restraint or a fuzzy search restraint, we will need two terms for this:
names_txt:("markus foss"~2) AND names_txt:(markus~0.7 foss~0.7)
Note that quantifying fuzzyness by a float number is deprecated. Internally, lucene converts converts the float number to an int between 0 and 2 anyway, so we should use this integer (Damereau Levenshtein) edit distance right from the beginning in our search terms. So my final proposal states:
names_txt:("markus foss"~2) AND names_txt:(markus~1 foss~1)
(For those who are interested: The deprecated, somewhat quirky function that converts the similarity float to an edit distance int can be found at the end of this code file.)

I think you could do that using SpanQuery The issue is that the usual query parsers in Solr dont support them. Look at this article that mentions those that support spans: Surround, Xml-Query-Parser and Qsol. But check the status of each in current solr version.