Using group.ngroups during query search in Solr - solr

I would like to check, will using the results grouping with group.ngroups (which will include the number of groups that have matched the query) in the search affects the performance of the Solr? I found that the searching speed has slowed down quite significantly after I added in the group.ngroups parameters.
I required the value of the number of groups that have matched the query. Besides this, is there other way which I can retrieve that value?
I have more than 10 million documents, with an index size of more than 500GB, and I'm using Solr 5.4.0.
Regards,
Edwin

Yes, it will affect performance. Everything that needs to be done to a result set (such as grouping) will affect performance in some way. How much depends on way too many factors to say exactly how much (but you've already observed that).
You can get the number of unique values (which should be the same as grouping for that field and counting the number of groups) for a field in a number of ways, which Yonik shows in his Count Distinct Values blog post.
The unique facet function is Solr’s fastest implementation to calculate the number of distinct values.
$ curl http://localhost:8983/solr/techproducts/query -d '
q=*:*&
json.facet={
x : "unique(manu_exact)" // manu_exact is the manufacturer indexed as a single string
}'

Related

Solr facet performance

I am working with Solr facet fields and come across a performance problem I don't understand. Consider these two queries:
q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
The only difference is an empty facet.prefix in the first query.
The first query returns after some 20 seconds (QTime 20000 in the result) while the second one takes only 80 msec (QTime 80). Why is this?
And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT.
This is with Solr 1.4.
From this doc:http://docs.lucidworks.com/display/solr/Faceting
The facet.prefix parameter limits the terms on which to facet to those
starting with the given string prefix.
that means that you facet by less terms.
Now, I'm quite sure the faceting time is included in the Qtime (as seems demonstrated by this post: http://www.mail-archive.com/solr-user#lucene.apache.org/msg39859.html).
So that means less terms, less time.
Maybe not facet on CONTENT as this probably has many different terms and makes no sense faceting on. Try faceting on a category field or some other field with less unique terms.
Have you tried executing them in the opposite order after a fresh restart of Solr server?
Usually the first query takes more time and if the next queries happen to have more in common with any of the previous, there'd be cache-hits and response time would be incredible.
In addition, please note that 'enum' is more suitable for facet-fields with less number of unique terms within.
Also, try increasing filter-cache. to a really big number and check your cache-hit ratio at
SOLR_DOMAIN:PORT/solr/#/collection1/plugins/cache?entry=fieldValueCache,filterCache

How can I limit my Solr search to an arbitrary set of 100,000 documents?

I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

SOLR index time boost depending on the field value

Is it possible to boost a document on the indexing stage depending on the field value?
I'm indexing a text field pulled from the database. I would like to boost results that are shorter over the longer ones. So the value of boost should depend on the length of the text field.
This is needed to alter the standard SOLR behavior that in my case tends to return documents with multiple matches first.
Considering I have a field that stores the length of the document, the equivalent in the query of what I need at indexing would be:
q={!boost b=sqrt(length)}text:abcd
Example:
I have two items in the DB:
ABCDEBCE
ABCD
I always want to get ABCD first for the 'BC' query even though the other item contains the search query twice.
The other solution to the problem would be ability to 'switch off' the feature that scores multiple matches higher at query time. Don't know if that is possible either...
Doing this at index time is important as the hardware I run the SOLR on is not too powerful and trying to boost on query time returns with OutOfMemory Exception. (Even If I could work around that increasing memory for java I prefer to be on the safe side and implement the index the most efficient way possible.)
Yes and no - but how you do it depends on how you're indexing your documents.
As far as I know there's no way of resolving this only on the solr server side at the moment.
If you're using the regular XML based interface to submit documents, let the code that generates the submitted XML add boost=".." values to the field or to the document depending on the length of the text field.
You can check upon DIH Special Commands which has a $docBoost command
$docBoost : Boost the current doc. The value can be a number or the
toString of a number
However, there seems no $fieldBoost Command.
For you case though, if you are using DefaultSimilarity, shorter fields are boosted higher then longer fields in the Score calculation.
You can surely implement your own Simiarity class with a changed TF (Term Frequency) and LengthNorm Calculation as your needs.

Can SOLR/Lucene report calculated score of extra named documents, even if they're not in top N results?

I'd like to submit a query to SOLR/Lucene, plus a list of document IDs. From the query, I'd like the usual top-N scored results, but I'd also like to get the scores for the named documents... no matter how low they are.
Can anyone think of an easy/supported way to do this in a single index scan, where the scores for the 'added' (non-ranking/pinned-for-inclusion) docs are comparable/same-scaled as those for the top-N results? (Patching SOLR with specialized classes would be OK; I figure that's what I may have to do if there's no existing support.)
Or failing that, could it be simulated with a followup query, ideally in a way that the named-document scores could be scaled to be roughly comparable to the top-N for the reference query?
Alternatively -- and perhaps as good or better for my intended use -- could I make a single request against a SOLR/Lucene index which includes M (with M=2 or more) distinct queries, and return the results that are in the top-N for any of the M queries, and for every result include its score against all M of the distinct queries?
(Even in my above formulation, the list of documents that I want scored along with a new query will typically have been the results from a prior query.)
Solutions or even just fragments of possible approaches appreciated!
I am not sure if I understand properly what you want to achieve but wouldn't a simple
q: (somequery) OR id: (1 OR 2 OR 4)
be enough?
If you would want both parts to be boosted by the same scale (I am not sure if this isn't the default behaviour of Solr) you would want to use dismax or edismax and your query would change to something like:
q: (somequery)^10 OR id: (1 OR 2 OR 4)^10
You would then have both the elements defined by the IDs and the query results scored the same way.
To self-answer, reporting what I've found since posting...
One clumsy option is the explainOther parameter, which takes another query. (This query could be a OR list of interesting document IDs.) The response will then include a full scoring explanation for documents which match this other query. explainOther only has effect when combined with the also-required debugQuery parameter.
All that debug/explain information is overkill for the need, but may be useful, or the code paths that implement it might provide a guide to making a hypothetical new more narrowly-focused 'scoreOther' option.
Another option would be to make use of pseudo-field calculated using the query() function to report how any set of results score on some other query/queries. So if for example the original document set was the top-N from query_A, and then those are the exact documents that you also want to score against query_B, you would execute query_A again with a reporting-field …&fl=bscore:query({!dismax v="query_B"})&…. Then the document's scores against query_B would be included in the output (as bscore).
Finally, the result-grouping functionality can be used both collect the top-N for one query and scores for lesser documents intersecting with other queries in one go. For example, if querying for query_B and adding …&group=true&group.query=query_B&group.query=query_A&…, you'll get back groups that satisfy query_B (ranked by query_B), and that satisfy both query_B and query_A (but again ranked by query_B). This could be mixed with the functional field above to get the scores by another query (like query_A) as well.
However, all groups will share the same sort order (from either the master query or something specified by a group.sort parameter), so it's not currently possible (SOLR-4.0.0-beta) to get several top-N results according to different scorings, just the top-Ns according to one scoring, limited by certain groups. (There's a comment in the source code suggesting alternate sorts per group may be envisioned as a future capability.)

Order solr documents with same score by date added descending

I want to have search results from SOLR ordered like this:
All the documents that have the same score will be ordered descending by date added.
So when I query solr I will have n documents. In this results set there will be groups of documents with the same score. I want each of this group of documents to be ordered descending by date added.
I discovered I can accomplish this using function queries, more exactly using rord function http://wiki.apache.org/solr/FunctionQuery#rord, but as it is stated in the documentation
WARNING: as of Solr 1.4, ord() and rord() can cause excess memory use
since they must use a FieldCache entry at the top level reader, while
sorting and function queries now use entries at the segment level.
Hence sorting or using a different function query, in addition to
ord()/rord() will double memory use.
it will cause excess memory use.
What other options do I have ?
I was thinking to use recip(ms(NOW,startTime),1,1,0). Is this the best approach ?
Is there any negative performance impact if I use recip and ms ?
You can use multiple SORT conditions:
Multiple sort orderings can be separated by a comma, ie: sort=+[,+]...
http://wiki.apache.org/solr/CommonQueryParameters
So, in your case would be:
sort=score DESC, date_added DESC
Since your questions says:
All the documents that have the same score will be ordered descending
by date added.
the other answer you got is perfect.
Anyway, I'd suggest you to make sure that you really want to sort by date only for document with the same score. In my experience this has always been wrong. In fact, the solr score is not absolute but just relative to other documents, and each document is different.
Therefore I wouldn't sort by score and then something else, because it's hard to predict when you'll have the same score for different documents.
I would personally sort only on score and use a function to boost recent documents. You can find a good example on the solr wiki, the function used there is recip(ms(NOW,date_field),3.16e-11,1,1).
If you're worried for performance you can try index time boosting, which should be faster than query time boosting. Have a look here.

Resources