Solr cloud: No registered leader was found after waiting for 4000ms

Solr cloud: No registered leader was found after waiting for 4000ms - solr

I have created 6 collections, each collection is having 3 shards and 2 replica( solr version 5.5.0). For a few days my setup was working fine. But after some days I am getting the following error:
Error while trying to recover.
core=Collection1_shard3_replica2:org.apache.solr.common.SolrException:
No registered leader was found after waiting for 4000ms , collection:
Collection1 slice: shard3 at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:607)
at
org.apache.solr.common.cloud.ZkStateReader.getLeaderRetry(ZkStateReader.java:593)
at
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:308)
at
org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:224)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266) at
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:231)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I tried to restart zookeeper and solr both, also increase heap memory to 10 GB. But still getting the issue.

We are experiencing the same problem with a 3 node machine (6 cpu and 30gb memory per node). Below are the steps I tried to come to a solution.
What we already tried and did not work:
Stop solr processes and restart
Increase/decrease memory of the Solr JVM
Recreated collection, but this was only a temporary fix for a day or so
Solr GC tuning:
https://wiki.apache.org/solr/SolrPerformanceProblems#GC_pause_problems
What fixed the problem of "No registered leader found":
Reduced the number of shards, basically we were oversharding. We reduced the number of shards from 6 to 3 shards and kept 3 replica's. This means that every node now has 3 shards.
However, since we were indexing 10.000s of messages per second. I was also wondering what our CPU was doing. So I monitored the cpu load and cpu IO. I found that the CPU's were working on their maximum all the time, causing a high IO wait, which I thought was causing the biggest trouble (see images below).
The replica's had a hard time to keep in sync, because of this high IO wait.
I reduced the workload (messages being send to solr), so that the indexes didn't grow as fast as before. This helped bringing everything back to normal. My Solr cluster is green now for a while, and didn't experience any "election problems". IO wait was reduced to below 25ms and cpu usage was around 70% instead of being almost 100% all the time.
Generally, it is very difficult to tackle such a problem. Since a Solr cluster may work fine for a few days (even seen months in other posts). Monitor IO wait or even traffic coming into your Solr nodes. If traffic spikes occur, (daily!) indices may grow too large. You may also add more nodes and split shards, which reduces the load on one machine. I choose to reduce the traffic to the Solr machines, since we use Solr as an audit store and do not need part of the audit logging.

Related

Solr indexing slowdown

I'm working on a product which index high volume of small documents.
When starting Solr it provide indexing rate of 35k/sec for around 20 minutes and then start to slowdown down to 24k/sec after a while.
If I restart the server, the server will index again 35k/sec for 20 minutes and then slow down again.
I have a softCommit every 5 seconds and a hard commit every minute.
I was wondering if someone might have some insight about this?
I don't think it is related to merges since I see merger threads kicking in after 2-3 minutes.

you should check the usual suspects:
There is a problem with your Java (or whatever language you are using) application that you're using to index. If that's the case please specify the implementation details and I will provide more guidelines;
You're NRT cache fills up after 20 minutes and the hard commit doesn't happen quickly enough. In order to check this option, please set the maximum number of documents to index before writing the docs from cache to the disc in the following way: <autoCommit> <maxDocs>10000</maxDocs></autoCommit> in case this is the issue then you can tune up the autocommit or the NRT cache management.

Solr PERFORMANCE WARNING: Overlapping onDeckSearchers

We've been having a number of problems with our solr search engine in our test environments. We have a solr cloud setup on version 4.6, single shard, 4 nodes. We see the CPU flat lines to 100% on the leader node for several hours, then the server starts to throw OutOfMemory errors, 'PERFORMANCE WARNING: Overlapping onDeckSearchers' starts appearing in the logs, the leaders enter recovery mode, the filter cache and query cache warmup times hit around 60 seconds (normally less than 2 secs), the leader node goes down, and we suffer a outage for the whole cluster for a few mins while it recovers and elects a new leader. We think we're hitting a number of solr bugs with the 4.6 and 4.x branch, and so are looking to move to 5.3. We also recently dropped our soft commit time down from 10 mins to 2 mins. I am seeing regular CPU spikes every 2 mins on all nodes, but the spikes are low, from 20-50% (max 100) on a 2 min cycle. When CPU's maxed out obviously I can't see those spikes. Hard commits are every 15 seconds, with opennewsearcher set to false. We have a heavy query and index load type of scenario.
I am wondering whether the frequent soft commits are having a significant effect on this issue, or whether the long auto warm times on the caches are caused by the other issues we are experiencing (cause or symptom)? We recently increased the indexing load on the server, but we need to address these issues in the test environment before we can promote to production.
Cache settings:
<filterCache class="solr.FastLRUCache"
size="5000"
initialSize="5000"
autowarmCount="1000"/>
<queryResultCache class="solr.LRUCache"
size="20000"
initialSize="20000"
autowarmCount="5000"/>

We had this problem with Solr 4.10 (and, very rarely, 5.1). In our case, we were indexing quite frequently and commits were starting to become too close together. Sometimes our optimize command would run a bit longer than expected.
We solved it by making sure no indexing or commits occurred for at least ten minutes after the optimize operation started. We also auto warmed fewer queries for our caches. The following links will probably be useful to you if you haven't found them already:
Overlapping onDeckSearchers--Solr mailing list
The Solr Wiki

SolrCloud disappears after 15-20 minutes

The setup
We have setup a SolrCloud (Solr version 4.10.4) cluster consisting of 6 servers distributed over 2 datacenters (3 on each DC).
The cluster is setup with 3 shards and a replication factor of 2 and handles one core with 45M documents averaging at about 100GB per shard. There are 3 Zookeeper instances regulating the cluster that reside on 3 of the 6 servers (the ones in the first DC).
The core resides on a 6Gb/s SSD drive on all shards.
The intra-DC ping time is in the region of 0.3ms, while the inter-DC one is in the region of 3 ms.
The cluster is setup on Tomcat 7.0.61 and Java 7 with an allocated memory of 26GB while each server has 32GB available while each node is configured to contact the zookeeper every 30 seconds.
The cache configuration for each solr node is as follows
<filterCache class="solr.FastLRUCache"
size="40000"
initialSize="40000"
autowarmCount="0"/>
<queryResultCache class="solr.LRUCache"
size="50000"
initialSize="20000"
autowarmCount="0"/>
<documentCache class="solr.LRUCache"
size="2000000"
initialSize="2000000"
/>
<fieldValueCache class="solr.FastLRUCache"
size="8"
autowarmCount="8"
showItems="8" />
On top of that we have an API application that performs certain search operations that most of the times look like:
q=Fragmento+de+retablo+NOT+DATA_PROVIDER%3A%22CER.ES%3A+Red+Digital+de+Colecciones+de+museos+de+Espa%C3%B1a%22&
rows=12&start=0&
sort=score+desc&
timeAllowed=30000&fl=*%2Cscore&facet.mincount=1
We use one or at most to sort parameters (the second one being the unique id of our schema but not in this example).
The problem
Our API sends around 5-10 queries per second on the cluster. Even that minimal number of requests after a while overwhelms the cluster and nodes start disappearing while at the same time a lot of disk I/O is observed. We do some manual cache warming for about 10 minutes before we make the core available to the API and we notice that after a while (and before the crash of the cluster) the hit ratio on the caches is 1 for all but the queryResultCache=0.67 and documentCache=0.9, while no evictions happen either. The memory consumption is around 88%.
Any ideas what can be wrong or where we should focus will be highly appreciated.

A memory consumption of around 88 percent can quickly jump to 100 and kill the cores.
That happened to us... Look for core dump files in the individual cores logs
SolrCloud is also susceptible to high cpu spikes that can make the ZooKeeper think the node is dead... Recovery is slow and sometimes doesn't happen at all.
You can change the default timeout of the ZooKeeper to prevent this from happening.
You can see this bug for example on the issue...
https://issues.apache.org/jira/browse/SOLR-5565
From you comment I see that you should probably up the timeout to about 2 minutes.
This comes at a price of course - try to read a bit and understand what it means
https://zookeeper.apache.org/doc/r3.1.2/zookeeperStarted.html

Solrcloud replicas goes in recovery mode right after update

We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows
*<autoSoftCommit>
<maxDocs>500000</maxDocs>
<maxTime>180000</maxTime>
</autoSoftCommit>
<autoCommit>
<maxDocs>2000000</maxDocs>
<maxTime>180000</maxTime>
<openSearcher>false</openSearcher>
</autoCommit>*
We indexed roughly 90 Million docs. We have two different ways to index documents
a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second
b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second
We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run.
When we do incremental indexing we do it in the search1 collection which is serving live traffic.
All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 # 2.93GHz
We have observed the following issue when we trigger indexing .
In about 10 minutes after we trigger indexing on 14 parallel hosts, the replicas goes in to recovery mode. This happens to all the shards . In about 20 minutes more and more replicas start going into recovery mode. After about half an hour all replicas except the leader are in recovery mode. We cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, we remove all the replicas before we trigger the indexing and then add them back after the indexing finishes.
We observe the same behavior of replicas going into recovery when we do incremental indexing. We cannot remove replicas during our incremental indexing because it is also serving live traffic. We tried to throttle our indexing speed , however the cluster still goes into recovery .
If we leave the cluster as it , when the indexing finishes , it eventually recovers after a while. As it is serving live traffic we cannot have these replicas go into recovery mode because it degrades the search performance also , our tests have shown.
We have tried different commit settings like below
a) No auto soft commit, no auto hard commit and a commit triggered
at the end of indexing
b) No auto soft commit, yes auto hard
commit and a commit in the end of indexing
c) Yes auto soft commit , no auto hard commit
d) Yes auto soft commit , yes auto hard commit
e) Different frequency setting for commits for above
Unfortunately all the above yields the same behavior . The replicas still goes in recovery
We have increased the zookeeper timeout from 30 seconds to 5 minutes and the problem persists.
Is there any setting that would fix this issue ?

Garbage Collection pauses can exceed clientTimeout, resulting in the Zookeeper connection being broken, which causes an infinite cycle of recovery.
Frequent optimizes, commits, or updates, and poorly tuned segment merge configuration can result in excessive overhead when recovering. This overhead can cause a recovery loop.
Lastly, there seems to be some type of bug that can be encountered during recovery that our organization has experienced. It's rare but it seems to happen during times when network connections are flapping or unreliable. Zookeeper disconnects trigger a recovery, and the recovery spikes memory, sometimes this can even cause an out of memory condition.
Update BEWARE GRAPH QUERIES
Organization I work at experienced pauses from graph queries within Solr. The graph queries were apart of a type-ahead plugin/component. When someone submitted long strings for type-ahead, the graph query grew complex, and caused huge memory usage and gc pauses.

Solr appears to block update requests while committing

We're running a master-slave setup with Solr 3.6 using the following auto-commit options:
maxDocs: 500000
maxTime: 600000
We have approx 5 million documents in our index which takes up approx 550GB. We're running both master and slave on Amazon EC2 XLarge instances (4 virtual cores and 15GB). We don't have a particularly high write throughput - about 100 new documents per minute.
We're using Jetty as a container which has 6GB allocated to it.
The problem is that once a commit has started, all our update requests start timing out (we're not performing queries against this box). The commit itself appears to take approx 20-25mins during which time we're unable to add any new documents to Solr.
One of the answers in the following question suggests using 2 cores and swapping them once its fully updated. However this seems a little over the top.
Solr requests time out during index update. Perhaps replication a possible solution?
Is there anything else I should be looking at regarding why Solr seems to be blocking requests? I'm optimistically hoping there's a "dontBlockUpdateRequestsWhenCommitting" flag in the config that I've overlooked...
Many thanks,

According to bounty reason and the problem mentioned at question here is a solution from Solr:
Solr has a capability that is called as SolrCloud beginning with 4.x version of Solr. Instead of previous master/slave architecture there are leaders and replicas. Leaders are responsible for indexing documents and replicas answers queries. System is managed by Zookeeper. If a leader goes down one of its replicas are selected as new leader.
All in all if you want to divide you indexing process that is OK with SolrCloud by automatically because there exists one leader for each shard and they are responsible for indexing for their shard's documents. When you send a query into the system there will be some Solr nodes (of course if there are Solr nodes more than shard count) that is not responsible for indexing however ready to answer the query. When you add more replica, you will get faster query result (but it will cause more inbound network traffic when indexing etc.)

For those who is facing a similar problem, the cause of my problem was i had too many fields in the document, i used automatic fields *_t, and the number of fields grows pretty fast, and when that reach a certain number, it just hogs solr and commit would take forever.
Secondarily, I took some effort to do a profiling, it end up most of the time is consumed by string.intern() function call, it seems the number of fields in the document matters, when that number goes up, the string.intern() seems getting slower.
The solr4 source appears no longer using the string.intern() anymore. But large number of fields still kills the performance quite easily.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight