In Apache Solr there is an omitTermFreqAndPositions property, and there is an omitPositions property. Is there a built in way to omit term frequency but preserve term positions when a field's score is calculated, or is it otherwise simple to do so?
No, not unless you use a custom similarity class. These are field specific from Solr 4.x, so you can have a custom similarity for one field if you don't want term frequency to contribute to the score by returning 1.0f for the termfreq regardless of how many times the term occurs in the field.
Related
We're having some relevance issues with Solr results. In this particular example we have product A showing up above product B. Product A's title contains the search term. Product B's title also contains the search term along with its Description and Category Name. So logically, Product B should be more relevant and appear above Product A, but it does not.
The schema is configured to take all of these extra fields into account. After analyzing the debug info of the query with ...&debugQuery=true&debug.explain.structured=trueit appears that both products have achieved the same score. Looking further, I can see these extra fields having scores calculated, but for some reason, the parser only takes the maximum of these scores instead of the sum which causes it to be the same:
Is there a reason that Solr behaves this way? Is there any way to change this behavior to use the sum instead of the max? (Just like in the parent element in the images)
You can control how the score is calculated using the tie parameter, provided that you are using Dismax/eDismax query parser.
Solr documentation explains it very well :
tie (Tie Breaker) parameter :
The tie parameter specifies a float value (which should be something
much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user’s input is tested against multiple fields,
more than one field may match. If so, each field will generate a
different score based on how common that word is in that field (for
each document relative to all other documents).
The tie parameter lets
you control how much the final score of the query will be influenced
by the scores of the lower scoring fields compared to the highest
scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction
max query": that is, only the maximum scoring subquery contributes to
the final score.
A value of "1.0" makes the query a pure "disjunction
sum query" where it doesn’t matter what the maximum scoring sub query
is, because the final score will be the sum of the subquery scores.
Typically a low value, such as 0.1, is useful.
Learning how to use Lucene!
I have an index in Lucene which is configured to store term vectors.
I also have a set of documents I have already constructed custom term vectors for (for an unrelated purpose) not using Lucene.
Is there a way to insert them directly into the Lucene inverted index in lieu of the original contents of the documents?
I imagine one way to do this would be to generate bogus text using the term vector with the appropriate number of term occurrences and then to feed the bogus text as the contents of the document. This seems silly because ultimate Lucene will have to convert the bogus text back into a term vector in order to index.
I'm not entirely sure what you want to do with these term vectors ultimately (score? just retrieve?) but here's one strategy I might advocate for.
Instead of focusing on faking out the text attribute of term vectors, consider looking into payloads which attach arbitrary metadata to each token. During analysis, text is converted to tokens. This includes emitting a number of attributes about each token. There's standard attributes like position, term character offsets, and the term string itself. ALL of these can be part of the uninverted term vector. Another attribute is the payload which is arbitrary metadata you can attach to a term.
You can store any token attribute uninverted as a "term vector" including payloads, which you can access at scoring time.
To do this you need to
Configure your field to store term vectors, including term vectors with payload
Customize analysis to emit payloads that correspond to your terms. You can read more here
Use an IndexReader.getTermVector to pull back Terms. From that you can get a TermsEnum. You can then use that to get a DocsAndPositionEnum which has an accessor for the current payload
If you want to use this in scoring, consider a custom query or custom score query
We are consolidating all our collected content on a record in a single content field, which is the main source for SOLR. The problem is that for some records the content field has only 100K characters, for others 10M or more.
As a result, a search on any term will push 10M character records to the top of the result list.
We would like to limit/counterbalance that by introducing something like "relative term frequency" eg the number of occurrences divided by total number of words in the content field.
Since we don't know what terms people will search on, (I think) we cannot calculate this at indexing time.
Any suggestions/ideas on how to do this?
You can start with the Custom Similarity class.
This would allow you to modify the above parameters and scoring factors.
You need to check the tf (term frequency) method and customized it.
The Custom Similarity class can be refereed from the Schema.xml file.
Check the lucene DefaultSimilarity class for reference which is the actual implementation.
Also check Changing Similarity
Is it possible to do a custom multiplication of the Score returned by Solr? We have a factor in the range of 1.00-1.30 based on our own formula and I wish to just multiply the "final" Solr score with this - without having it normalized.
I've tried using various boosts in DisMax, but none of them produce the desired result, because 1) custom value is added (not multiplied) to the score and 2) they are normalized (queryNorm) before addition.
I found a way to do this. Using the Extended DisMax query parser, introduced in 3.1, it offers all the same features as the normal DisMax, but with a few useful enhancements.
The one I needed was the boost parameter. It acts the same way as the bf parameter from DisMax, but instead of adding a normalized value to the score, it multiplies the boost into the score (without any normalization).
For more info, see the Solr Wiki on ExtendedDisMax
I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.