Solr: ranking of results when querying multiple shards - solr

If I'm querying across two shards and first shard returned 10 rows and second one returned 100 rows, how is the combined result set ranked? Will I end up with results from first shard (the one with least result) appearing first?

When each of the shard returns result for a given query, the results are sorted by the similarity score for each document. The similarity score is a relative measure of how well the document matches to the search query.
Now these results from different shards are merged by the similarity score and presented to the user/application. The similarity scores are calculated within shards before the merge of results happen.
You can include parameters &shard.info=true and fl=*,score into the query and see the result. Then observe what is the maxScore returned by each shard and look at each document with score. You will get the insight how the result are merged.

Related

Dynamic facet limits using Solr

How can I group my Solr query results using a numeric field into x buckets, where the bucket start and end values are determined when the query is run?
For example, if I want to count and group documents into 5 buckets by a wordCount field, the results should be:
250-500 words: 3438 results
500-750 words: 4554 results
750-1000 words: 9854 results
1000-1250 words: 3439 results
1250-1500 words: 38 results
Solr's faceting API docs assume that the facet buckets are known in advance, but this isn't possible for numeric fields because the lower and upper buckets depend on the search results.
My current query (which doesn't work) is:
curl http://localhost:8983/solr/pages/query -d '
q=*:*&
rows=0&
json.facet={
wordCount : {
type: range,
field : wordCount,
start : max(wordCount),
end : min(wordCount),
gap : 1000
}
}'
I have read this question, which suggests calculating the buckets in the application code prior to sending them to Solr for counting. This is not ideal because it involves querying the database multiple times, and also the answer is several years out of date and since then Solr has added the JSON faceting API, which allows more complicated faceting settings.
In SQL, this type of dynamic bucketing is possible with union queries, in which each query in the union which calculates a specific bucket's lower and upper bounds and counts the results in that bucket. So it seems weird that in Solr, where a lot of effort has gone into making faceting easy, this kind of query is not possible.

SOLR Faceting not returning all facets when searching for "All" (*:*)

I've noticed something curious with our SOLR 7 results.
We have faceting enabled on, for example, a manufacturer field.
When a search is performed for a particular manufacturer, the facet data will include a number of results for that manufacturer (in this case, 99 results). Also, all the facet results add up to match the total number of documents matching the query (which makes sense).
If a "blank" search is performed (resulting in a : query), all documents are returned from SOLR (~242,000). The facet results for the manufacturer field are no longer adding up to the total number of documents returned, however. It ends up being ~36,000 documents short. The specific manufacturer that I searched for in the prior example, which DID return a count of 99 in the facet data for that manufacturer, now returns nothing for that manufacturer. There is no facet result shown for that manufacturer.
If I query solr for the specific manufacturer value in the specific field we're faceting on, then it finds the 99 matches, and the facet data also shows the 99 results.
I think this problem is only happening when a : (or blank q) query is done.
Any suggestions?
Please let me know if you require more information.
Thanks,
Bill
I'm not sure I get your problem true but I suggest you some typical solution.
you can use "enum" facet method for huge facets.
facet.method=enum
Furthermore you need to control facet counts with:
facet.limit=10000 //maximum number of returned facets
facet.offset= 0
for more information about Solr facet params go to:
https://wiki.apache.org/solr/SimpleFacetParameters

Using group.ngroups during query search in Solr

I would like to check, will using the results grouping with group.ngroups (which will include the number of groups that have matched the query) in the search affects the performance of the Solr? I found that the searching speed has slowed down quite significantly after I added in the group.ngroups parameters.
I required the value of the number of groups that have matched the query. Besides this, is there other way which I can retrieve that value?
I have more than 10 million documents, with an index size of more than 500GB, and I'm using Solr 5.4.0.
Regards,
Edwin
Yes, it will affect performance. Everything that needs to be done to a result set (such as grouping) will affect performance in some way. How much depends on way too many factors to say exactly how much (but you've already observed that).
You can get the number of unique values (which should be the same as grouping for that field and counting the number of groups) for a field in a number of ways, which Yonik shows in his Count Distinct Values blog post.
The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
x : "unique(manu_exact)" // manu_exact is the manufacturer indexed as a single string
}'

Is it possible to know the individual scores of the components of a query?

I have a boolean query consisting of an OR between two query components, both of which are of type TermQuery.
Once the result documents are retrieved, a final score is associated, which is derived by a combination of the scores returned by the two term queries that are part of the main query.
Is there a way to know what were the two individual scores that resulted in the final score?
Check the debugQuery=true parameter to get the break up of the scores for each individual document and fields that match and contribute to the score.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

Resources