We are using Solr 4.2.1 and ZooKeeper 3.4.5 and there are 2 Solr servers.
Solr is reporting "No registered leader was found" and "WARNING ZkStateReader ZooKeeper watch triggered, but Solr cannot talk to ZK".
ZooKeeper is reporting "Exception when following the leader".
But after restarting both, it works for some time and it reports the issue again.
Here are some additional logs from Solr:
SEVERE ZkController There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
org.apache.solr.common.SolrException: No registered leader was found, collection:www-live slice:shard1
SEVERE: shard update error StdNode: http://10.23.3.47:8983/solr/www-live/:org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://10.23.3.47:8983/solr/www-live
SEVERE: Recovery failed - trying again... (5) core=www-live
From ZooKeeper
2016-01-14 11:25:08,423 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower#89] - Exception when following the leader
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
Any help is much appreciated.
Thank you.
How many zookeepers you have?
It must be on odd numbers for leader election. If it is on even number, please update it to odd number and try again.
Three ZooKeeper servers is the minimum recommended size for an
ensemble, and we also recommend that they run on separate machines.
For reliable ZooKeeper service, you should deploy ZooKeeper in a
cluster known as an ensemble. As long as a majority of the ensemble
are up, the service will be available. Because Zookeeper requires a
majority, it is best to use an odd number of machines. For example,
with four machines ZooKeeper can only handle the failure of a single
machine; if two machines fail, the remaining two machines do not
constitute a majority. However, with five machines ZooKeeper can
handle the failure of two machines.
http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html
Related
In our current architecture of the project we are using solr for gathering, storing and indexing documents from different sources and making them searchable in near real-time
Our web applications running on tomcat connecting to solr to create / modify the documents
Solr uses Zookeeper to keep the configuration centralized
There are 5 servers in our cluster where we are running solr
when the zookeeper restarts in one of the server the daemon thread created in the server doesn't complete it's execution due to which
We are getting continuous logs with below exceptions while trying to connect to zookeeper from tomcat instance
org.apache.catalina.loader.WebappClassLoaderBase.checkStateForResourceLoading Illegal access: this web application instance has been stopped already. Could not load [org.apache.zookeeper.ClientCnxn$SendThread]. The following stack trace is thrown for debugging purposes as well as to attempt to terminate the thread which caused the illegal access.
which in some time runs out of thread in the server
can someone help me with the below question please ?
why the daemon thread doesn't complete it's execution when we restart zookeeper
Solr Version : 8.5.1
zookeeper version : 3.5.5
Production server : Solr 5.4.1, Ruby on rails, Ubuntu server.
Solr is suddenly stopped, when I restarted, it work to select/get data but for any update/reindex record job execute, again Solr is stopped. In log also I can not find any error statement.
I have compared the solr log for running system and stopped system and found that after runing DirectUpdateHander2 end_commit_flush, below log does not exist on non-working system log:
97588877 INFO (searcherExecutor-7-thread-1-processing-x:namecol) [x:namecol] o.a.s.c.SolrCore [namecol] Registered new searcher Searcher#1bf35cb6[namecol main{ExitableDirectoryReader(UninvertingDirectoryReader(Uninverting(_3rc22(5.4.1):C68771/19227:delGen=227) Uninverting(_4ee4k(5.4.1):C43777/12974) Uninverting(_4fogn(5.4.1):C13374/2400) Uninverting(_4fopo(5.4.1):c1712/83) Uninverting(_4fomr(5.4.1):c1150/216) Uninverting(_4foqs(5.4.1):c995/64) Uninverting(_4for4(5.4.1):c156) Uninverting(_4for8(5.4.1):c94) Uninverting(_4for9(5.4.1):c3)))}
Which part do I need to check? I have set softCommit to -1 so now solr is not stopped after any frontend changes but also not update the select data also until not restart it again.
As a workaround, I have created a new core and re-index all data again.
And also updated the Solr version to 8.8.2 for the better stable release.
PROBLEM!!
After setting up my Logical Replication and everything is running smoothly, i wanted to just dig into the logs just to confirm there was no error there. But when i tail -f postgresql.log, i found the following error keeps reoccurring ERROR: could not start WAL streaming: ERROR: replication slot "sub" is active for PID 124898
SOLUTION!!
This is the simple solution...i went into my postgresql.conf file and searched for wal_sender_timeout on the master and wal_receiver_timeout on the slave. The values i saw there 120s for both and i had to change both to 300s which is equivalent to 5mins. Then remember to reload both servers as you dont require a restart. Then wait for about 5 to 10 mins and the error is fixed.
We had an identical error message in our logs and tried this fix and unfortunately our case was much more diabolical. Putting the notes here just for the next poor soul but in our case, the publishing instance was an AWS managed RDS server and it managed (ha ha) to create such a WAL backlog that it was going into catchup state, processing the WAL and running out of memory (getting killed by the OS every time) before it caught up. The experience on the client side was exactly what you see here - timeouts and failed WAL streaming. The fix was kind of nasty - we had to drop the whole replication link and rebuild it (fortunately it was a test database so not harm done but it's a situation you want to avoid). It was obvious after looking on the publisher side and seeing the logs but from the subscription side more mysterious.
Solr version 8.5.1
My solr is not starting anymore. I use solr start command to start the Solr. Every time I run this command I see the following error
Java HotSpot(TM) 64-Bit Server VM warning: JVM cannot use large page memory because it does not have enough privilege to lock pages in memory.
Waiting up to 30 to see Solr running on port 8983
ERROR: Solr at http://localhost:8983/solr did not come online within 30 seconds!
There is no error in the log files. But connecting to Solr is failing. This was working earlier.
Could someone please help me to troubleshoot the issue?
I found out what the issue is. Even though the message indicated that the server did not start in 30 seconds, it started after some time.
I closed the console window as the server was running in the background and it killed the server. The server is up as long as I keep the command window that I used to start the server.
I read the documentation which says 7199 is JMX port number and 8983 is solr port number and 9160 is cassandra client port number. But if i start
dse cassandra -s
starts solr. If i start cassandra-client in the same machine
dse cassandra -f
It says
Error: Exception thrown by the agent : java.rmi.server.ExportException: Port already in use: 7199; nested exception is:
java.net.BindException: Address already in use
So I understand that both tries to use same JMX port number.
Is there any way to specify two port numbers one for solr or one for cassandra OR is there any way to start both in the same machine.
I am using datastax 2.2.2 tarball set up.
Any ideas?
You only need to start dse one time. It runs search and c* in the same jvm and serves in all the ports you mentioned above.
As you mention above. Use this command for a tarball install to start dse in search mode. Do this accross your cluster (rolling restart, no downtime required):
bin/dse cassandra -s