Improve Slow Solr Query Performance - solr

We are using Solr on Windows with multiple collections. Collections are having multiple stored and indexed fields with appx 200k documents. The use case is for e-commerce website search. The size of index is appx. 200 MB
While the normal search takes less than few ms, the query where I need to find all data for multiple categories are taking somewhere around 1100ms to 1200ms. The query includes appx. 400 categories with OR something like..
Category:(5 OR 33 OR 312 OR 1192 OR 1193 OR 1196 OR .....)
I have increased Heap Size to 4gb, and configured Solr cache value to be on higher size, this reduced the query time from 2000ms to 1100ms, but we are looking for more improvement.
I also found following on Solr UI:
lockFactory=org.apache.lucene.store.NativeFSLockFactory#56761b2a; maxCacheMB=48.0 maxMergeSizeMB=4.0
But not sure does that impact? And if Yes, how to change that?
Can you advise what else we can do? Let me know if you need more details.
Thank you in anticipation.

you should add your full request so it's easier to give some advice. But, from your sentence "The query includes appx. 400 categories with OR something like.." I understand you are putting your huge clause in q param? That is not the right approach.
Instead use q=* :* and put your clause in fq. This way, it will be cached, and subsequent queries will be much faster. If you get a good cache hit rate, queries will be significantly faster.
As a second thing you might try (but go first with the above) could be transforming the big OR clause into a (or a combination of) range clauses, as:
Category:[5 TO 1190] OR Category:[1192 TO 1196]
If your type is an tint, and you can transform the clause into ranges combination by significantly decreasing it's size, it might work too

Related

Solr query long execution time for multiple ORs

I have a question regarding solr queries.
My query basically contains thousand of OR conditions for authors (author:name1 OR author:name2 OR author:name3 OR author:name4 ...)
The execution time on my index is huge (around 15 sec). When i tag all the associated documents with a custom field and value like authorlist:1 and then i change my query to just search for authorlist:1 it executes in 78 ms. How come there is such a big difference in exec-time?
Can somebody please explain why there is sucha difference (maybe the query parser?) and if there is a way to speed this up?
Thx for the help

Solr: How to resort top documents from query?

I am running Solr 4.10.3 and trying to resort top 10 documents come from Solr. How can i do that? I am thinking of sub-query but don't know how to do that, needed help.
Example:
Suppose on query of "car" Solr return 250 documents on the basis of high score of relevancy. Now from 250 documents take top 10 documents and resort them on the basis of custom field.
i can't do that:
select?q=car&sort=score desc, pr desc
Because it will do sorting on entire 250 documents. So is there any solution?
I think you mean query reranking ? This is what is used when you want to:
use a first, more lightweight query to get a result based on the score
get a fair top x number of matches, like 2000 for example
use a heavier query to resort only those again, to get the final score
That is what you need to use if you want to do it in two steps, like you state. Now, I am not sure you need to use query reranking for your use case, maybe just boosting by that field, or sorting on a function should be enough for you.

Solr facet performance

I am working with Solr facet fields and come across a performance problem I don't understand. Consider these two queries:
q=word&facet.field=CONTENT&facet=true&facet.prefix=&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
q=word&facet.field=CONTENT&facet=true&facet.prefix=a&facet.limit=10&facet.mincount=1&facet.method=enum&rows=0
The only difference is an empty facet.prefix in the first query.
The first query returns after some 20 seconds (QTime 20000 in the result) while the second one takes only 80 msec (QTime 80). Why is this?
And as side note: facet.method=fc makes the queries run 'forever' and eventually fail with org.apache.solr.common.SolrException: Too many values for UnInvertedField faceting on field CONTENT.
This is with Solr 1.4.
From this doc:http://docs.lucidworks.com/display/solr/Faceting
The facet.prefix parameter limits the terms on which to facet to those
starting with the given string prefix.
that means that you facet by less terms.
Now, I'm quite sure the faceting time is included in the Qtime (as seems demonstrated by this post: http://www.mail-archive.com/solr-user#lucene.apache.org/msg39859.html).
So that means less terms, less time.
Maybe not facet on CONTENT as this probably has many different terms and makes no sense faceting on. Try faceting on a category field or some other field with less unique terms.
Have you tried executing them in the opposite order after a fresh restart of Solr server?
Usually the first query takes more time and if the next queries happen to have more in common with any of the previous, there'd be cache-hits and response time would be incredible.
In addition, please note that 'enum' is more suitable for facet-fields with less number of unique terms within.
Also, try increasing filter-cache. to a really big number and check your cache-hit ratio at
SOLR_DOMAIN:PORT/solr/#/collection1/plugins/cache?entry=fieldValueCache,filterCache

How can I limit my Solr search to an arbitrary set of 100,000 documents?

I've got an 11,000,000-document index. Most documents have a unique ID called "flrid", plus a different ID called "solrid" that is Solr's PK. For some searches, we need to be able to limit the searches to a subset of documents defined by a list of FLRID values. The list of FLRID values can change between every search and it will be rare enough to call it "never" that any two searches will have the same set of FLRIDs to limit on.
What we're doing right now is, roughly:
q=title:dogs AND
(flrid:(123 125 139 .... 34823) OR
flrid:(34837 ... 59091) OR
... OR
flrid:(101294813 ... 103049934))
Each of those FQs parentheticals can be 1,000 FLRIDs strung together. We have to subgroup to get past Solr's limitations on the number of terms that can be ORed together.
The problem with this approach (besides that it's clunky) is that it seems to perform O(N^2) or so. With 1,000 FLRIDs, the search comes back in 50ms or so. If we have 10,000 FLRIDs, it comes back in 400-500ms. With 100,000 FLRIDs, that jumps up to about 75000ms. We want it be on the order of 1000-2000ms at most in all cases up to 100,000 FLRIDs.
How can we do this better?
Things we've tried or considered:
Tried: Using dismax with minimum-match mm:0 to simulate an OR query. No improvement.
Tried: Putting the FLRIDs into the fq instead of the q. No improvement.
Considered: dumping all the FLRIDs for a given search into another core and doing a join between it and the main core, but if we do five or ten searches per second, it seems like Solr would die from all the commits. The set of FLRIDs is unique between searches so there is no reuse possible.
Considered: Translating FLRIDs to SolrID and then limiting on SolrID instead, so that Solr doesn't have to hit the documents in order to translate FLRID->SolrID to do the matching.
What we're hoping for:
An efficient way to pass a long set of IDs, or for Solr to be able to pull them from the app's Oracle database.
Have Solr do big ORs as a set operation not as (what we assume is) a naive one-at-a-time matching.
A way to create a match vector that gets passed to the query, because strings of fqs in the query seems to be a suboptimal way to do it.
I've searched SO and the web and found people asking about this type of situation a few times, but no answers that I see beyond what we're doing now.
solr search within subset defined by list of keys
Searching within a subset of data - Solr
http://lucene.472066.n3.nabble.com/Filtered-search-for-subset-of-ids-td502245.html
http://lucene.472066.n3.nabble.com/Search-within-a-subset-of-documents-td1680475.html

Solr * vs *:* query performance

We're running Solr 3.4 and have a relatively small index of 90,000 documents or so. These documents are split over several logical sources, and so each search will have an applied filter query for a particular source, e.g:
?q=<query>&fq=source:<source>
where source is a classic string field. We're using edismax and have a default search field text.
We are currently seeing q=* taking on average 20 times longer to run than q=*:*. The difference is quite noticeable, with *:* taking 100ms and * taking up to 3500ms. A search for a common word in the document set (matching nearly 50% of all documents) will return a result in less than 200ms.
Looking at the queries with debugQuery on, we can see that * is parsed to a DisjunctionMaxQuery((text:*)), while *:* is parsed to a MatchAllDocsQuery(*:*). This makes sense, but I still don't feel like it accounts for a slowdown of this magnitude (a slowdown of 2000% over something that matches 50% of the documents).
What could be causing this? Is there anything we can tweak?
When you are passing just * you are ordering to check every value in the field and match it against * and that is a lot to do. However when you are using * : * you are asking Solr to give you everything and skip any matching.
Solr/Lucene is optimized to do * : * fast and efficient!

Resources