Boosting documents in Solr based on the vote count - solr

I have a field in my schema which holds the number of votes a document has. How can I boost documents based on that number?
Something like the one which has the maximum number has a boost of 10, the one with the smallest number has 0.5 and in between the values get calculated automatically.
What I do now is this, but it doesn't give the desired results:
recip(rord(vote_count),1,1000,1000)^10.0
Thanks.

i tend to build my indexes using raw lucene, in which case it is extremely easy,
doc.setBoost(boost_val);

I'm just starting on this and it looks like either a linear boost or log based boost will help most: i.e. log(votecount)^10 (don't forget ^10 means boost times 10, not to the tenth power.

Related

Apache Solr's bizarre search relevancy rankings

I'm using Apache Solr for conducting search queries on some of my computer's internal documents (stored in a database). I'm getting really bizarre results for search queries ordered by descending relevancy. For example, I have 5 words in my search query. The most relevant of 4 results, is a document containing only 2 of those words multiple times. The only document containing all the words is dead last. If I change the words around in just the right way, then I see a better ranking order with the right article as the most relevant. How do I go about fixing this? In my view, the document containing all 5 of the words, should rank higher than a document that has only two of those words (stated more frequently).
What Solr did is a correct algorithm called TF-IDF.
So, in your case, order could be explained by this formula.
One of the possible solutions is to ignore TF-IDF score and count one hit in the document as one, than simply document with 5 matches will get score 5, 4 matches will get 4, etc. Constant Score query could do the trick:
Constant score queries are created with ^=, which
sets the entire clause to the specified score for any documents
matching that clause. This is desirable when you only care about
matches for a particular clause and don't want other relevancy factors
such as term frequency (the number of times the term appears in the
field) or inverse document frequency (a measure across the whole index
for how rare a term is in a field).
Possible example of the query:
text:Julian^=1 text:Cribb^=1 text:EPA^=1 text:peak^=1 text:oil^=1
Another solution which will require some scripting will be something like this, at first you need a query where you will ask everything contains exactly 5 elements, e.g. +Julian +Cribb +EPA +peak +oil, then you will do the same for combination of 4 elements out of 5, if I'm not mistaken it will require additional 5 queries and back forth, until you check everything till 1 mandatory clause. Then you will have full results, and you only need to normalise results or just concatenate them, if you decided that 5-matched docs always better than 4-matched docs. Cons of this solution - a lot of queries, need to run them programmatically, some script would help, normalisation isn't obvious. Pros - you will keep both TF-IDF and the idea of matched terms.

How can I sort facets by their tf-idf score, rather than popularity?

For a specific facet field of our Solr documents, it would make way more sense to be able to sort facets by their relative "interesting-ness" i.e. their tf-idf score, rather than by popularity. This would make it easy to automatically get rid of unwanted common English words, as both their TF and DF would be high.
When a query is made, TF should be calculated, using all the documents that participate in teh results list.
I assume that the only problem with this approach would be when no query is made, resp., when one searches for ":". Then, no term will prevail over the others in terms of interestingness. Please, correct me if I am wrong here.
Anyway,is this possible? What other relative measurements of "interesting-ness" would you suggest?
facet.sort
This param determines the ordering of the facet field constraints.
count - sort the constraints by count (highest count first) index - to
return the constraints sorted in their index order (lexicographic by
indexed term). For terms in the ascii range, this will be
alphabetically sorted. The default is count if facet.limit is greater
than 0, index otherwise.
Prior to Solr1.4, one needed to use true instead of count and false
instead of index.
This parameter can be specified on a per field basis.
It looks like you couldn't do it out of the box without some serious changes on client side or in Solr.
This is a very interesting idea and I have been searching around for some time to find a solution. Anything new in this area?
I assume that for facets with a limited number of possible values, an interestingness-score can be computed on the client side: For a given result set based on a filter, we can exclude this filter for the facet using the local params-syntax (!tag & !ex) Local Params - On the client side, we can than compute relative compared to the complete index (or another subpart of a filter). This would probably not work for result sets build by a query-parameter.
However, for an indexed text-field with many potential values, such as a fulltext-field, one would have to retrieve df-counts for all terms. I imagine this could be done efficiently using the terms component and probably should be cached on the client-side / in memory to increase efficiency. This appears to be a cumbersome method, however, and doesn't give the flexibility to exclude only certain filters.
For these cases, it would probably be better to implement this within solr as a new option for facet.sort, because the information needed is easily available at the time facet counts are computed.
There has been a discussion about this way back in 2009.
Currently, with the larger flexibility of facet.json, e.g. sorting on stats-facets (e.g. avg(price)) of another field, I guess this could be implemented as an additional sort-option. At least for facets of type term, the result-count (df for current result-set) only needs to be divided by the df of that term for the index (docfreq). If the current result-set is the complete index, facets should be sorted by count.
I will probably implement a workaround in the client for fields with a fixed and rather small vocabulary, e.g. based on a second, cashed query on the complete index. However, for term-fields and similar this might not scale.

Solr: character proximity ranks misspellings higher because of inverse document frequency

I'm using character proximity to allow for some misspellings, for example:
text:manager~1
This allows both 'manager' and 'managre' to be matched. The problem is, the misspellings are always ranked higher than the proper spelling because there are fewer of those in the index. For example, let's say I have 3 documents as follows:
1) text:manager
2) text:manager
3) text:managre
Then the character proximity query above will give an inverse document frequency (idf) of 1.7 to 'managre' and 1.2 to 'manager', effectively ranking the misspelled 'managre' higher. From a technical perspective, this makes sense (there are fewer occurances of 'managre' than 'manager'), but in reality, this doesn't make sense. Is there a way to get Solr to set the idf of misspelled words to match that of the correct spelling?
Short answers is No. Long answer is you have good options here, You need to solve this in a different way.
To begin with take the power of query time boosting. So you can query something like:
text:manager^1.2 OR text:manager~1^0.8
Here you are saying my user is smart so i will give higher boost to user query, but just incase I will give it's variance a bit lower boost. You need to do a boolean query of exact match with higher boost with a Boolean OR query of fuzzy query so that exact matches ranks higher. Do not worry about extra work for solr. It is built for very complex Lucene query trees. Using a combination of queries to get expected relevancy ranking is common practice.
TF , IDF and solr's in built relevancy ranking arbitrary and framing query with boosts, boolean queries, and context based filters is where power and flexibility of solr exists.

what this `^` mean here in solr

I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.

Explanation of solr max score

When I choose to view the score field in solr results I see the score assigned by solr to every document returned and a maxscore value that is the score of the topmost returned document.
I need to know is there a cut-off to the solr score or not. I mean if the maxscore is 6.89343 or 2.34365, so does this mean that it is 6.89343 of 10 as the final score? or how can I decide that I'm close to the most correct result.
If possible, I need a simple explanation of the scoring algorithm used by solr.
The maxscore is the scoring of the topmost document in the search results.
There is no cutoff for the maxscore, and depends upon the scoring calculations and normalization done by Lucene/Solr.
The topmost document would have the maxscore, while you would get an idea from the scores of the documents below it, as to how off they are from the topmost.
For Scoring explaination you can check link
If it is indeed a z-score from a normal distribution then you can calculate the CDF (as it appears here ). The CDF will give you a bounded score from 0 to 1. Its hard for me to interpret what the CDF really means in this case given the un-normalized score is calculated in several steps, but you can sort of think of it as the probability that you got the right answer as long as your collection is well populated with the relevant material.

Resources