Solr complicated faceting - solr

I have problems with faceting. Imagine this situation. Product can be in more than one category. This is common behavior for faceting:
Category
Android (25)
iPhone (55)
other (25)
Now when I select "Android", I make new query with "fq" => "category:Android", I will get:
Category
Android
iPhone (15)
other (2)
But this means that there is 15 products, that are in categories "Android" AND "iPhone". I would like something like this: ("Android" OR "iPhone")
Category
Android
iPhone (+5)
other (+1)
Meaning I will get 25 results by selecting "Android (25)" and another 5 by selecting "iPhone (+5)", so finally I will get 30 search results..
Does anyone know if this is possible with SOLR's faceting? Or perhaps with more than one query and calculate it manually?
Thanks for advice!

Try a new query with the negative of the selections, like "fq" => "-category:Android" - you should then get the facet counts you are looking for.

Depending on all the permutations you need, you probably want to look into query facets that enable you to get counts for arbitrary queries. For instance, you can do facet.query=category:("Android" OR "iPhone") and get a count results keyed on category:("Android" OR "iPhone"). And, you can do this for any number of queries you want counts for. So, in your case, you can probably get to a final solution with some combination of straight field facets and query facets.
Edit: Re-reading you question, you may also want to look into tagging and excluding parts of an extra fq, depending on how you are allowing your users to "select into" the choices. (The example in the docs is fairly close to your original setup, although I'm not sure the end behavior is exactly as you desire).

Related

Solr 8.8.2 reduce recall and improve precision for multi token queries - mm, qs, shingles

I'm facing a issue wherein I have huge amount of data in Solr and as a result, searching for a multi token query is generating a big recall set. For ex - if i search for "apple watch series 4 42mm", i get back 4 million results. My parser is edismax, minimum match setting is 2 as of now, and am using WhiteSpace Tokenizer with a bunch of filters. The goal here is to reduce this recall set to display more relevant results.
Things that I explored are -
MinimumMatch - Am trying setting mm to 2>2 4>3 to see how it results. Also tried finding out if i could apply mm on individual fields and found out that it used to be possible with local params in Solr but has been discontinued since Solr 7.2. I do not want to get into writing a custom parser or tweaking Solr's code since that could lead to other problems. Nor do i want to change the default parser to Lucene. Is there any other way that i could apply mm separately to category_name, product_name, product_description, brand_name, etc?
Query slop - Am not using qs as of now, tried a few examples converting my query into phrase query and applying qs. It does reduce recall but i have a problem there. Suppose i have a product which has "apple" in brand_name and "watch series 4 42mm" in product name, that is a relevant result but will not be returned because the phrase query has to have all tokens in the field. Is there a way to apply qs to suit my purpose?
ShingleFilterFactory - I'm trying this filter with outputUnigrams true because i do not want the individual terms to not be indexed. But with that, index size would explode and result set won't be that good either. Can i use other levers like mm or something else along with this to make it work? Also, is there a way to make outputUnigrams a query param?
Explored pf2, pf3, ps also but those will be used for boosting. Right now, my aim is filtering the most relevant results.
Can someone please help me with the above? Thanks

Solr morelikethis returns no documents

I try to build a product recommender using the Text from the productdescription as input for recommendations.
But for some reasons I don't get any results. I setup the productdescription as textfield in the Schema.XML . I also marked it as a vector field.
My query looks like this select?q=id:189&mlt=true&mlt.fl=productdescription&mlt.mintf=1&mlt.mindf=0
From my understanding this query should somehow alsways bring me some similar items even if the score would be very low as df is set to 0.
But the only result I get is sometimes a duplicate of a product with the same description but an different ID (the dataset is not perfect).
So my question is: how can I always get the next nearest document even if there is no 1:1 match from the whole Text
I had a similar issue where no results were showing until I added mlt.mintf=1, which I see you have, but perhaps play around with those mintf mindf parameters to see if something yields results. I've actually been looking for more in depth examples for MoreLikeThis..

how to get resultcounts for each word if multiword-search was without results

On our webshop I want to implement a feature which should do the following:
If a user e.g. searches for "phone magnum", there will be no results.
If there were no results I want to give him the possibility to see
that search for "phone" will give him 139 results
and search for "magnum" will get 12 results.
I don't want to start several queries only for getting those counts. But at the moment I have no Idea how to do that.
I read the Solr-wiki for faceting, but didn't find anything useful for my problem. Maybe I missed something ....
Not sure why you want to avoid multiple queries. If your first search on the phrase "phone magnum" does not return any results, you could issue one query per search keyword with rows=0 which will give you only the counts. This should be efficient, since you are not building any result documents and only getting the result counts.
However, if you really want to avoid the subsequent queries, here is one apporach: Have a field in your index which does not take IDF into account. (See this on how to do that.) Once that field is available (call it say name_no_idf) issue a query against this field name_no_idf:(phone magnum). Notice that this is not a phrase search.
The documents which contain both phone and magnum in the name_no_idf field will get a score of 2, while the docs matching only one word will get a score of 1. To this query you add facet=true&facet.field=name. Then the facet counts you get for these two words will be the counts you are looking for. But few warnings:
if one of the words is very infrequent, you may need to increase facet.limit
facet queries are expensive

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

Resources