Solr PERFORMANCE WARNING: Overlapping onDeckSearchers - solr

We've been having a number of problems with our solr search engine in our test environments. We have a solr cloud setup on version 4.6, single shard, 4 nodes. We see the CPU flat lines to 100% on the leader node for several hours, then the server starts to throw OutOfMemory errors, 'PERFORMANCE WARNING: Overlapping onDeckSearchers' starts appearing in the logs, the leaders enter recovery mode, the filter cache and query cache warmup times hit around 60 seconds (normally less than 2 secs), the leader node goes down, and we suffer a outage for the whole cluster for a few mins while it recovers and elects a new leader. We think we're hitting a number of solr bugs with the 4.6 and 4.x branch, and so are looking to move to 5.3. We also recently dropped our soft commit time down from 10 mins to 2 mins. I am seeing regular CPU spikes every 2 mins on all nodes, but the spikes are low, from 20-50% (max 100) on a 2 min cycle. When CPU's maxed out obviously I can't see those spikes. Hard commits are every 15 seconds, with opennewsearcher set to false. We have a heavy query and index load type of scenario.
I am wondering whether the frequent soft commits are having a significant effect on this issue, or whether the long auto warm times on the caches are caused by the other issues we are experiencing (cause or symptom)? We recently increased the indexing load on the server, but we need to address these issues in the test environment before we can promote to production.
Cache settings:
<filterCache class="solr.FastLRUCache"
size="5000"
initialSize="5000"
autowarmCount="1000"/>
<queryResultCache class="solr.LRUCache"
size="20000"
initialSize="20000"
autowarmCount="5000"/>

We had this problem with Solr 4.10 (and, very rarely, 5.1). In our case, we were indexing quite frequently and commits were starting to become too close together. Sometimes our optimize command would run a bit longer than expected.
We solved it by making sure no indexing or commits occurred for at least ten minutes after the optimize operation started. We also auto warmed fewer queries for our caches. The following links will probably be useful to you if you haven't found them already:
Overlapping onDeckSearchers--Solr mailing list
The Solr Wiki

Related

Solr cloud: No registered leader was found after waiting for 4000ms

I have created 6 collections, each collection is having 3 shards and 2 replica( solr version 5.5.0). For a few days my setup was working fine. But after some days I am getting the following error:
Error while trying to recover.
core=Collection1_shard3_replica2:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
Collection1 slice: shard3 at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried to restart zookeeper and solr both, also increase heap memory to 10 GB. But still getting the issue.
We are experiencing the same problem with a 3 node machine (6 cpu and 30gb memory per node). Below are the steps I tried to come to a solution.
What we already tried and did not work:
Stop solr processes and restart
Increase/decrease memory of the Solr JVM
Recreated collection, but this was only a temporary fix for a day or so
Solr GC tuning:
https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
What fixed the problem of "No registered leader found":
Reduced the number of shards, basically we were oversharding. We reduced the number of shards from 6 to 3 shards and kept 3 replica's. This means that every node now has 3 shards.
However, since we were indexing 10.000s of messages per second. I was also wondering what our CPU was doing. So I monitored the cpu load and cpu IO. I found that the CPU's were working on their maximum all the time, causing a high IO wait, which I thought was causing the biggest trouble (see images below).
The replica's had a hard time to keep in sync, because of this high IO wait.
I reduced the workload (messages being send to solr), so that the indexes didn't grow as fast as before. This helped bringing everything back to normal. My Solr cluster is green now for a while, and didn't experience any "election problems". IO wait was reduced to below 25ms and cpu usage was around 70% instead of being almost 100% all the time.
Generally, it is very difficult to tackle such a problem. Since a Solr cluster may work fine for a few days (even seen months in other posts). Monitor IO wait or even traffic coming into your Solr nodes. If traffic spikes occur, (daily!) indices may grow too large. You may also add more nodes and split shards, which reduces the load on one machine. I choose to reduce the traffic to the Solr machines, since we use Solr as an audit store and do not need part of the audit logging.

Solr indexing slowdown

I'm working on a product which index high volume of small documents.
When starting Solr it provide indexing rate of 35k/sec for around 20 minutes and then start to slowdown down to 24k/sec after a while.
If I restart the server, the server will index again 35k/sec for 20 minutes and then slow down again.
I have a softCommit every 5 seconds and a hard commit every minute.
I was wondering if someone might have some insight about this?
I don't think it is related to merges since I see merger threads kicking in after 2-3 minutes.
you should check the usual suspects:
There is a problem with your Java (or whatever language you are using) application that you're using to index. If that's the case please specify the implementation details and I will provide more guidelines;
You're NRT cache fills up after 20 minutes and the hard commit doesn't happen quickly enough. In order to check this option, please set the maximum number of documents to index before writing the docs from cache to the disc in the following way: <autoCommit> <maxDocs>10000</maxDocs></autoCommit> in case this is the issue then you can tune up the autocommit or the NRT cache management.

Solrcloud replicas goes in recovery mode right after update

We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows
*<autoSoftCommit>
<maxDocs>500000</maxDocs>
<maxTime>180000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>2000000</maxDocs>
<maxTime>180000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>*
We indexed roughly 90 Million docs. We have two different ways to index documents
a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second
b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second
We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run.
When we do incremental indexing we do it in the search1 collection which is serving live traffic.
All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 # 2.93GHz
We have observed the following issue when we trigger indexing .
In about 10 minutes after we trigger indexing on 14 parallel hosts, the replicas goes in to recovery mode. This happens to all the shards . In about 20 minutes more and more replicas start going into recovery mode. After about half an hour all replicas except the leader are in recovery mode. We cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, we remove all the replicas before we trigger the indexing and then add them back after the indexing finishes.
We observe the same behavior of replicas going into recovery when we do incremental indexing. We cannot remove replicas during our incremental indexing because it is also serving live traffic. We tried to throttle our indexing speed , however the cluster still goes into recovery .
If we leave the cluster as it , when the indexing finishes , it eventually recovers after a while. As it is serving live traffic we cannot have these replicas go into recovery mode because it degrades the search performance also , our tests have shown.
We have tried different commit settings like below
a) No auto soft commit, no auto hard commit and a commit triggered
at the end of indexing
b) No auto soft commit, yes auto hard
commit and a commit in the end of indexing
c) Yes auto soft commit , no auto hard commit
d) Yes auto soft commit , yes auto hard commit
e) Different frequency setting for commits for above
Unfortunately all the above yields the same behavior . The replicas still goes in recovery
We have increased the zookeeper timeout from 30 seconds to 5 minutes and the problem persists.
Is there any setting that would fix this issue ?
Garbage Collection pauses can exceed clientTimeout, resulting in the Zookeeper connection being broken, which causes an infinite cycle of recovery.
Frequent optimizes, commits, or updates, and poorly tuned segment merge configuration can result in excessive overhead when recovering. This overhead can cause a recovery loop.
Lastly, there seems to be some type of bug that can be encountered during recovery that our organization has experienced. It's rare but it seems to happen during times when network connections are flapping or unreliable. Zookeeper disconnects trigger a recovery, and the recovery spikes memory, sometimes this can even cause an out of memory condition.
Update BEWARE GRAPH QUERIES
Organization I work at experienced pauses from graph queries within Solr. The graph queries were apart of a type-ahead plugin/component. When someone submitted long strings for type-ahead, the graph query grew complex, and caused huge memory usage and gc pauses.

SOLR autoCommit vs autoSoftCommit

I'm very confused about and . Here is what I understand
autoSoftCommit - after a autoSoftCommit, if the the SOLR server goes down, the autoSoftCommit documents will be lost.
autoCommit - does a hard commit to the disk and make sure all the autoSoftCommit commits are written to disk and commits any other document.
My following configuration seems to be only with with autoSoftCommit. autoCommit on its own does not seems to be doing any commits. Is there something I am missing ?
<updateHandler class="solr.DirectUpdateHandler2">
<updateLog>
<str name="dir">${solr.ulog.dir:}</str>
</updateLog>
<autoSoftCommit>
<maxDocs>1000</maxDocs>
<maxTime>1200000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>10000</maxDocs>
<maxTime>120000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>
</updateHandler>
why is autoCommit working on it's own ?
I think this article will be useful for you. It explains in detail how hard commit and soft commit work, and the tradeoffs that should be taken in account when tuning your system.
I always shudder at this, because any recommendation will be wrong in some cases. My first recommendation would be to not overthink the problem. Some very smart people have tried to make the entire process robust. Try the simple things first and only tweak things as necessary. In particular, look at the size of your transaction logs and adjust your hard commit intervals to keep these “reasonably sized”. Remember that the penalty is mostly the replay-time involved if you restart after a JVM crash. Is 15 seconds tolerable? Why go smaller then?
We’ve seen situations in which the hard commit interval is much shorter than the soft commit interval, see the bulk indexing bit below.
These are places to start.
HEAVY (BULK) INDEXING
The assumption here is that you’re interested in getting lots of data to the index as quickly as possible for search sometime in the future. I’m thinking original loads of a data source etc.
Set your soft commit interval quite long. As in10 minutes. Soft commit is about visibility, and my assumption here is that bulk indexing isn’t about near real time searching so don’t do the extra work of opening any kind of searcher.
Set your hard commit intervals to 15 seconds, openSearcher=false. Again the assumption is that you’re going to be just blasting data at Solr. The worst case here is that you restart your system and have to replay 15 seconds or so of data from your tlog. If your system is bouncing up and down more often than that, fix the reason for that first.
Only after you’ve tried the simple things should you consider refinements, they’re usually only required in unusual circumstances. But they include:
Turning off the tlog completely for the bulk-load operation
Indexing offline with some kind of map-reduce process
Only having a leader per shard, no replicas for the load, then turning on replicas later and letting them do old-style replication to catch up. Note that this is automatic, if the node discovers it is “too far” out of sync with the leader, it initiates an old-style replication. After it has caught up, it’ll get documents as they’re indexed to the leader and keep its own tlog.
etc.
INDEX-HEAVY, QUERY-LIGHT
By this I mean, say, searching log files. This is the case where you have a lot of data coming at the system pretty much all the time. But the query load is quite light, often to troubleshoot or analyze usage.
Set your soft commit interval quite long, up to the maximum latency you can stand for documents to be visible. This could be just a couple of minutes or much longer. Maybe even hours with the capability of issuing a hard commit (openSearcher=true) or soft commit on demand.
Set your hard commit to 15 seconds, openSearcher=false
INDEX-LIGHT, QUERY-LIGHT OR HEAVY
This is a relatively static index that sometimes gets a small burst of indexing. Say every 5-10 minutes (or longer) you do an update
Unless NRT functionality is required, I’d omit soft commits in this situation and do hard commits every 5-10 minutes with openSearcher=true. This is a situation in which, if you’re indexing with a single external indexing process, it might make sense to have the client issue the hard commit.
INDEX-HEAVY, QUERY-HEAVY
This is the Near Real Time (NRT) case, and is really the trickiest of the lot. This one will require experimentation, but here’s where I’d start
Set your soft commit interval to as long as you can stand. Don’t listen to your product manager who says “we need no more than 1 second latency”. Really. Push back hard and see if the user is best served or will even notice. Soft commits and NRT are pretty amazing, but they’re not free.
Set your hard commit interval to 15 seconds.
In my case (index heavy, query heavy), replication master-slave was taking too long time, slowing don the queries to the slave. By increasing the softCommit to 15min and increasing the hardCommit to 1min, the performance improvement was great. Now the replication works with no problems, and the servers can handle much more requests per second.
This is my use case though, I realized I don'r really need the items to be available on the master at real time, since the master is only used for indexing items, and new items are available in the slaves every replication cycle (5min), which is totally ok for my case. you should tune this parameters for your case.
You have openSearcher=false for hard commits. Which means that even though the commit happened, the searcher has not been restarted and cannot see the changes. Try changing that setting and you will not need soft commit.
SoftCommit does reopen the searcher. So if you have both sections, soft commit shows new changes (even if they are not hard-committed) and - as configured - hard commit saves them to disk, but does not change visibility.
This allows to put soft commit to 1 second and have documents show up quickly and have hard commit happen less frequently.
Soft commits are about visibility.
hard commits are about durability.
optimize are about performance.
Soft commits are very fast ,there changes are visible but this changes are not persist (they are only in memory) .So during the crash this changes might be last.
Hard commits changes are persistent to disk.
Optimize is like hard commit but it also merge solr index segments into a single segment for improving performance .But it is very costly.
A commit(hard commit) operation makes index changes visible to new search requests. A hard commit uses the transaction
log to get the id of the latest document changes, and also calls fsync on the index files to ensure they have
been flushed to stable storage and no data loss will result from a power failure.
A soft commit is much faster since it only makes index changes visible and does not fsync index files or write
a new index descriptor. If the JVM crashes or there is a loss of power, changes that occurred after the last hard
commit will be lost. Search collections that have NRT requirements (that want index changes to be quickly
visible to searches) will want to soft commit often but hard commit less frequently. A softCommit may be "less
expensive" in terms of time, but not free, since it can slow throughput.
An optimize is like a hard commit except that it forces all of the index segments to be merged into a single
segment first. Depending on the use, this operation should be performed infrequently (e.g., nightly), if at all, since
it involves reading and re-writing the entire index. Segments are normally merged over time anyway (as
determined by the merge policy), and optimize just forces these merges to occur immediately.
auto commit properties we can manage from sorlconfig.xml files.
<autoCommit>
<maxTime>1000</maxTime>
</autoCommit>
<!-- SoftAutoCommit
Perform a 'soft' commit automatically under certain conditions.
This commit avoids ensuring that data is synched to disk.
maxDocs - Maximum number of documents to add since the last
soft commit before automaticly triggering a new soft commit.
maxTime - Maximum amount of time in ms that is allowed to pass
since a document was added before automaticly
triggering a new soft commit.
-->
<autoSoftCommit>
<maxTime>1000</maxTime>
</autoSoftCommit>
References:
https://wiki.apache.org/solr/SolrConfigXml
https://lucene.apache.org/solr/guide/6_6/index.html

Solr appears to block update requests while committing

We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,
According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)
For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.

Resources