I am now using solr autocomplete and search functions, I want to use the popularity of searched terms in ranking the autocomplete suggestions.
For example, if 'usb' was searched 10 times last week, and 'user' was searched 100 time last week, when typing 'us', user should be ranked higher than usb.
Is there any way to fulfill this requirement? Thanks
In short, you need to use a Index time boosting to boost the value of a 'search queries index' - and periodically refresh it.
Lets you there is an index of all searched queries. That index can be created with an index time boost for each doc as a function of number of times searched. The boost factor could just be the number of times searched. https://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
Eg. search queries - 'foo', 'foo', 'bar' , 'bar', 'bar' , 'abcd' will be added to a new index with 'foo' having a boost of 2,'bar' with 3, 'abcd' with 1.
You can do dynamic search on this index (starting with :) - and adding typed ahead characters to the query. The document score reflects the index time boost.
Eg. : will return docs with highest score first. After user types an 'f' - the term 'f*' returns 'foo' above others because of its higher index time boost.
I don't know 'terms component' behaves here. Its score is based on term frequency - not based on index time boost.
As you accumulate more search requests, you have to re-index with updated boost values that reflect the newer search counts.
Eg. if there is a new search for 'bar' - then you reindex 'bar' document with a boost of 4.
Related
As I understand it, for Multi Value fields Solr boosts scores based on a few things.
Specifically scoring shorter field lengths higher than longer ones (even if the search string is nearer the beginning).
The scoring factors I found in the above link:
termFreq: how often a term appears in the document
idf: how often the term appears across the index
fieldNorm: importance of the term, depending on index-time boosting and field length
However I would like to boost values in a multi value field where the value is nearer the start of the list. For example.
When searching for a document with herceptin PRODUCT 1 should rank higher than PRODUCT 2 - except PRODUCT 2 socres higher due to it's shorter field length.
PRODUCT 1
"herceptin",
"succinimidyl",
"radiolabeling",
"labeling",
"stability",
"discovery",
"potent",
"cb2 agonists",
"agonists",
"linkers",
"yield",
"esters",
"agent",
"syntheses",
"elimination",
"ligands",
"analogue",
"chemistry",
"functionality",
"formation",
"proteins",
"product",
"oxidizing",
"agonist",
"conjugated",
"receptor",
"activity",
"model".
PRODUCT 2
"trastuzumab",
"breast",
"cancer",
"patients",
"breast cancer",
"treatment",
"growth",
"antibody",
"receptor",
"human",
"clinical",
"chemotherapy",
"herceptin",
"combination",
"results".
Any ideas on how I could achieve this?
Thanks
i use solr and i have a trouble with result score. For example
i have such docs with one field (for example "content"):
content = car
content = cars
content = carable awesome
content = awful for carable
And i make search query with such params ":{
"mm":"1",
"q":"car",
"tie":"0.1",
"defType":"dismax",
"fl":"*, score",}
i expect to see the result like this:
car: 5 score
cars: 4.8 score
carable awesome: 3
awful for carable: 3
Word without "s" should be highter, but i have strange things. How i can boost absolute match (like a car)
This happens because the field type you're using for the field has a stemming filter (or an ngramfilter) attached (which makes cars and car generate hits against each other). You can't boost "exact hits" inside such a field, since for Lucene they are the same value. What's stored in the index is the same for both car and cars - the latter is processed down to car as well.
To implement this and get exact hits higher, you add a second field without that filter present that only tokenizes (splits) your content on whitespace and lowercases the token. That way you have a field where cars and car are stored as different tokens, and tokens won't contribute to the score if they're not being matched.
You can use qf in Solr to tell Solr which fields you want to search against, and you can give a boost at the same time - so in your case you'd have qf=exact_field^10 text_field where hits in exact_field would be valued ten times higher than hits in the regular field (the exact boost values will depend on your use case and how you want the query profile to behave).
You can also use the different boost arguments (bq and boost) to apply boosts outside of your regular query (i.e. add a query to bq that replicates your original query), but the previous suggestion will probably work just fine.
Suppose I want to create a recommendation system to suggest people you should connect with based off of certain attributes that I know about you and attributes I have about other people that are stored in a Solr index. Is it possible to query the index with a list of attributes (along with boosts for each attribute) and have Solr return scored results even if some of my fields return no matches? The way that I understand that Solr works is that if one of your fields doesn't contain a match in any documents found in your index, you get zero results for the entire query (even if other fields in the query matched) - is that right? What I would hope is that I could query the index and get a list of results back in order of a score given based on how many (and which) fields matched to something, even if some fields have no matches, for example:
Say that there are 2 people documents stored in the index as follows (figuratively):
Person 1:
Industry: Manufacturing
City: Oakland
Person 2:
Industry: Manufacturing
City: San Jose
And say that I perform a pseudo-Solr query that basically says "Search for everyone whose industry is equal to manufacturing and whose city is equal to Oakland". What I would like is to receive both results back in the result set, even though one of the "Persons" does not reside in Oakland. I just want that person to come back as a result with a lower score than Person1. Is this possible? What might a solr query look like to handle this? Assume that I have many more than 2 attributes for each person (so saying that I can use "And" and "Or" in my solr query isn't really feasible.. or is it?) Thanks in advance for your helpful input! (PS I'm using Solr 3.6)
You mention using the AND operator, which is likely your problem.
The default behavior of Lucene, and Solr, query syntax is exactly what you are asking for. A query like:
industry:manufacturing city:oakland
Will match either, with scoring preference on those that match both. See the lucene query syntax documentation
You can use the bq parameter (boost query) does not affect matching, but affects the scores only.
http://localhost:8983/solr/persons/select?q=industry:manufacturing&bq=City:Oakland^2
play with the boosting factor at the end to get the correct balance between matching score, and boosting score.
I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.
I've got a text field that can potentially have multiple values.
doc 1:
field a:"X Y"
doc 2:
field a:"X"
I want to be able to do :
a:X^5
And have both doc 1 and 2 get an identical score.
I've been messing around with all the field options, but I always end up with doc 2 getting double the score of doc 1.
I've tried setting multiValued="true", but get the same result.
Is there someway that I can set my search or the field definition so that it will boost just based upon the existence of the search term and not be effected by the rest of the field's contents.
Disable norms by setting omitNorms=true in your schema and reindex - it should disable the length normalization for the field and give you the desired results.
For more details of what omitNorms does, see this.
The field a of doc 2 has only one term as compared to doc 1 which has two.
Solr DefaultSimilartiy implementation takes into account the length norm, number of terms in the field, for the fields when calculating the score.
LenghtNorm is 1.0 / Math.sqrt(numTerms)
LengthNorm allows you to make shorter documents score higher.
You can provide your own implementation of Similarity class which doesn't take into account the lengthNorm.
Check computeNorm method implementation.
You can turn of the Norms using omitNorms=false.
Norms allow for index time boosts and field length normalization. This allows you to add boosts to fields at index time and makes shorter documents score higher.
So you would lose both of the above if you use it.