Boost rare results in solr - solr

In a collection there are several different categories of documents. I want the highest ranked search results to be the documents from categories where, for the specific query, there are fewest matching documents.
Concrete example
Let the categories be "foo", "bar", and "baz". If I were to search for "Fred", faceted by category, I would get back the following counts:
foo: 17
bar: 1
baz: 201312
I want to construct a search and/or configure the index such that the one match from the "bar" category would be top of the search results, the 17 "foo" matches would be next, and finally the "baz" matches.
One way I think I could do this would be first to do a faceted search to get the count of matching documents in each category, and then do a second search with boosts based on the category counts - something along the lines of bq=category:bar^10000&bq=category:foo^100; the boosts of 10000 and 100 would obviously be derived from the facet counts and inserted into the query.
I would like to know if something roughly equivalent to this could be achieved in a more efficient way using only a single query, i.e. avoiding the need for a pre-query to fetch the facet counts.

Related

Solr search relevancy

i use solr and i have a trouble with result score. For example
i have such docs with one field (for example "content"):
content = car
content = cars
content = carable awesome
content = awful for carable
And i make search query with such params ":{
"mm":"1",
"q":"car",
"tie":"0.1",
"defType":"dismax",
"fl":"*, score",}
i expect to see the result like this:
car: 5 score
cars: 4.8 score
carable awesome: 3
awful for carable: 3
Word without "s" should be highter, but i have strange things. How i can boost absolute match (like a car)
This happens because the field type you're using for the field has a stemming filter (or an ngramfilter) attached (which makes cars and car generate hits against each other). You can't boost "exact hits" inside such a field, since for Lucene they are the same value. What's stored in the index is the same for both car and cars - the latter is processed down to car as well.
To implement this and get exact hits higher, you add a second field without that filter present that only tokenizes (splits) your content on whitespace and lowercases the token. That way you have a field where cars and car are stored as different tokens, and tokens won't contribute to the score if they're not being matched.
You can use qf in Solr to tell Solr which fields you want to search against, and you can give a boost at the same time - so in your case you'd have qf=exact_field^10 text_field where hits in exact_field would be valued ten times higher than hits in the regular field (the exact boost values will depend on your use case and how you want the query profile to behave).
You can also use the different boost arguments (bq and boost) to apply boosts outside of your regular query (i.e. add a query to bq that replicates your original query), but the previous suggestion will probably work just fine.

Solr - Nested Edismax Query

I am using Solr (with pySolr) to search products in my database, returning products, facets and facet.pivots:
result = solr.search(query_s, **{
'rows': '24',
'sort': formatted_sort,
'facet': 'on',
'facet.limit': '-1',
'facet.mincount': '1',
'facet.field': ['gender', 'material'],
'facet.pivot': 'brand,series',
'fq': '-in_stock:(0 OR 99 OR 100 OR 101)'
})
The query_s selects specific fields, for example: brand:Target AND gender:Men's.
I would like to combine the above query with a DisMax query which will allow me to combine the above query with a full text search over specified fields. I found an article which demonstrates nested queries. I have tried to implement something like this:
q: "gender:* AND _query_:"{!edismax qf=brand series}Summer""
For some reason 'Target' will return results for Target brand shirts, but only with correct capitalization. 'Summer' which is a series of Target, won't return any results. Why am I not seeing a list of docs ordered by relevancy?
Am I overcomplicating things by using Dismax altogether?
The dismax parsers are useful for making sense of more "natural" queries, i.e. queries where the user is used to just type what they're looking for, and how most search engines work.
In your case it sounds like brand:Target AND gender:Men's are filters for which documents should be shown, while the query is the part that the user has typed. Usually you'll want to have the filters in fq as they don't affect score (i.e. they're exact values matching a field value), and the query in q.
I assume that Summer is what the user would have typed into your search box, which would give you:
q=Summer&defType=edismax&qf=series
But this assumes that the series field is defined as a text field that has an analyzer attached, so that the values are lowercased and split appropriately.
If you also have a description field you'd like to search, you can do:
q=Summer&defType=edismax&qf=series^20 description
.. which would search for Summer in both the series and description fields, but give 20 times more weight to a hit in the series field. This is a good way to naturally boost documents that match more exact data in your documents. If you also include the brand field, you'd be able to let your users search for "target summer" and similar queries.

dynamic fields as facet in solr

I am trying to develop a filter system using dynamic fields in solr. These dynamic fields may vary from product to product and have a prefix attribute_filter_ to help me recognize the filter field. So given a search query, I want to get faceted results based on these dynamic fields.
For example, I have 3 products as docs in solr
{ID:1, attribute_filter_color:"white", attribute_filter_brand:"Dell"}
{ID:2, attribute_filter_color:"red", attribute_filter_category:"electronics"}
{ID:3, attribute_filter_size:"mobiles", attribute_filter_brand:"samsung"}
When my search query matches doc 1 and doc2, I want only filters color, brand and category and so facet fields are attribute_filter_color, attribute_filter_brand and attribute_filter_category.
When my search query matches doc 2 and doc3, I want filters color, size, category and brand and so facet fields are attribute_filter_color, attribute_filter_size, attribute_filter_category and attribute_filter_brand.
When my search query matches doc 1 and doc3, I want filters color, brand and size and so facet fields are attribute_filter_color,attribute_filter_brandand attribute_filter_size.
Also these filters can be ~300 total over 10^5 products. This creates another problem for making a GET URL with 300 facet fields which might cross the limit for GET URL.
This jira ticket shows how regex could have helped in this situation.
My solution would be to index the field names to an additional field, so that you have "facet_fields": ["attribute_filter_color","attribute_filter_brand"] for the documents containing the fields as well.
Generate a facet across your document result set, then use that result in a new query to generate facets across the fields you're interest in. It will be an extra query, but should scale decently. The part that will be expensive will be the larger number of different fields you're faceting on anyway - the facet_fields field will be quick to calculate and return.

Boosting search results for numbers in solr

Suppose I have two documents with just one field as follows:
Document 1: foo bar 1
Document 2: foo baz 2
And a user searches for "foo baz 1"
Doucment 1 matches "foo" and "1" and Document 2 matches "baz" and "foo" so they would ordinarily be tied. Is there any way to weight a match on a number higher than a match on text that would cause Document 1's match to be preferred over Document 2?
I don't want to boost by the number that matched, I want all numbers to be boosted by the same amount.
Your question is about boosting numbers in a query.
At query time you can boosting a term or you could use payloads at index time: Adding Boost to Score According to Payload of Multivalued Field at Solr

Difference between Solr Facet Fields and Filter Queries

I am using SolrMeter to test Apache Solr search engine. The difference between Facet fields and Filter queries is not clear to me. SolrMeter tutorial lists this as an exapmle of Facet fields :
content
category
fileExtension
and this as an example of Filter queries :
category:animal
category:vegetable
categoty:vegetable price:[0 TO 10]
categoty:vegetable price:[10 TO *]
I am having a hard time wrapping my head around it. Could somebody explain by example? Can I use SolrMeter without specifying either facets or filters?
Facet fields are used to get statistics about the returned documents - specifically, for each value of that field, how many returned documents have that value for that field. So for example, if you have 10 products matching a query for "soft rug" if you facet on "origin," you might get 6 documents for "Oklahoma" and 4 for "Texas." The facet field query will give you the numbers 6 and 4.
Filter queries on the other hand are used to filter the returned results by adding another constraint. The thing to remember is that the query when used in filtering results doesn't affect the scoring or relevancy of the documents. So for example, you might search your index for a product, but you only want to return results constrained by a geographic area or something.
A facet is an field (type) of the document, so category is the field. As Ansari said, facets are used to get statistics and provide grouping capabilities. You could apply grouping on the category field to show everything vegetable as one group.
Edit: The parts about searching inside of a specific field are wrong. It will not search inside of the field only. It should be 'adding a constraint to the search' instead.
Performing a filter query of category:vegetable will search for vegetable in the category field and no other fields of the document. It is used to search just specific fields rather than every field. Sometimes you know that the term you want only is in one field so you can search just that one field.

Resources