SolrCloud with embedded ZooKeeper server says: "ZooKeeperServer not running" - solr

When I start my SolrCloud server, Solr opens a socket connection to the embedded ZooKeeper server but says: "ZooKeeperServer not running".
It doesn't state a reason.
How can I figure out why the ZooKeeper server isn't actually running?
2012-05-30 15:02:36.538 [main] INFO org.apache.solr.cloud.SolrZkServer - STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9983
2012-05-30 15:02:36.545 [Thread-14] INFO o.a.z.server.ZooKeeperServerMain - Starting server
2012-05-30 15:02:36.552 [Thread-14] INFO o.a.zookeeper.server.ZooKeeperServer - Server environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
... [snip] ...
2012-05-30 15:02:37.092 [main-SendThread()] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:9983
2012-05-30 15:02:37.097 [main-SendThread(localhost:9983)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to localhost/127.0.0.1:9983, initiating session
2012-05-30 15:02:37.097 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] INFO o.a.zookeeper.server.NIOServerCnxn - Accepted socket connection from /127.0.0.1:43635
2012-05-30 15:02:37.100 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN o.a.zookeeper.server.NIOServerCnxn - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-05-30 15:02:37.100 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] INFO o.a.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:43635 (no session established for client)
2012-05-30 15:02:37.101 [main-SendThread(localhost:9983)] INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect

In my case specifically, it seemed that having a bunch of extra files in my conf/ directory was causing problems. Try to have the fewest amount of files necessary in that directory to ensure embedded Zookeeper running properly.

Related

Flink cluster unable to boot up - getChildren() failed w/ error = -6

I setup a new Flink cluster (v1.15) in a Kubernetes cluster. This new cluster is setup in the same namespace in which an existing Flink cluster (v1.13) is running fine.
The job-manager of the new Flink cluster is in a CrashLoopBackOff state. job-manager prints the following set of messages continuously, which includes a specific ERROR message:
2022-10-10 23:02:47,214 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,214 ERROR org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState [] - Authentication failed
2022-10-10 23:02:47,215 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established, initiating session, client: /100.98.125.116:57754, server: flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session establishment complete on server flink-zk-client-service/100.65.161.135:2181, sessionid = 0x381a609d51f0082, negotiated timeout = 4000
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: RECONNECTED
2022-10-10 23:02:47,216 INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,218 ERROR org.apache.flink.shaded.curator5.org.apache.curator.framework.recipes.leader.LeaderLatch [] - getChildren() failed. rc = -6 <============
2022-10-10 23:02:47,218 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Unable to read additional data from server sessionid 0x381a609d51f0082, likely server has closed socket, closing socket connection and attempting reconnect
It seems the error message indicates that a specific node of the new cluster in ZK either does not exist or does not have any children. But I could be off. I am using the same zookeeper for the v1.13 cluster, no issues with that cluster.
Content of zookeeper. (ClusterId is - dev-cl2):
ls /dev-cl2/dev-cl2
[leader]
ls /dev-cl2/dev-cl2/leader
[]
get /dev-cl2/dev-cl2/leader
// Nothing printed
Any help or pointers to troubleshooting this issue would be greatly appreciated. Thank you.
Update1:
I noticed the following the 1.15 release notes.
A new multiple component leader election service was implemented that only runs a single leader election per Flink process. If this should cause any problems, then you can set high-availability.use-old-ha-services: true in the flink-conf.yaml to use the old high availability services.
As a test, I set high-availability.use-old-ha-services: true. Did not have any effect.

Using a remote database as spring boot datasource

I'm trying to configure spring boot datasource as a remote IBM DB2 database. I have added the following configurations in my application.properties file:
spring.jpa.hibernate.ddl-auto=none
spring.datasource.url=jdbc:db2://<dbhost>:<dbport>/<db>
spring.datasource.username=<username>
spring.datasource.password=<password>
I even added the same properties in application.yml:
spring:
datasource:
url: jdbc:db2://dashdb-txn-sbox.services.eu-gb.bluemix.net:3000/BLUDB:sslConnection=true;
username: <username>
password: <password>
driverClassName: com.ibm.db2.jcc.DB2Driver
jpa:
properties:
hibernate:
dialect: org.hibernate.dialect.DB2Dialect
However, I'm still getting this error:
A communication error occurred during operations on the connection's underlying socket, socket input stream, or socket output stream. Error location: Reply.fill() - socketInputStream.read (-1). Message: Read timed out. ERRORCODE=-4499, SQLSTATE=08001
This question is more about configuration than programming.
See this FAQ for JDBC ERRORCODE -4499
which mentions:
(A.5) Message: Read timed out
This message is returned when client is waiting for reply from the
server and the server did not reply in time. Could be caused by client
timeout. Ensure no timeouts set in JDBC driver properties:
blockingReadConnectionTimeout=0 (default)
commandTimeout=0 (default)
loginTimeout = 0 (default)
Could also be caused by server or network issues.
If the issue is persistent, ensure you are using the latest jdbc Db2 driver ( at the present date that would be version 4.26.14 or higher).
You can use jdbc trace (follow the instructions in IBM Db2 documentation to enable jdbc trace) to look under the covers to see exactly what is happening.
Ensure the remote Db2-server has sufficient compute resources to respond in time. You may need to open a ticket with your cloud vendor (IBM) if the jdbc trace suggests a server side issue that is not under your direct control.
50001 is the usual (default) port number for ssl connections, not 3000 as you have in your question

Zookeeper errors

I am using solr with zookeeper and see the following errors in zookeeper logs
Using zk 3.4.10 and solr 6.6
EndOfStreamException: Unable to read additional data from client sessionid 0x1XXXXXXX, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:745)
2019-04-28 06:24:59,939 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /10.40.96.193:46260 which had sessionid 0x1XXXXXXX
The zoo keeper config
tickTime=2000
initLimit=10
syncLimit=5
Do these config values result in above exception? If yes, can someone explain whether we should increase or decrease initLimit & syncLimit?
Thanks in advance.
Those 3 config parameters only refer to the ZooKeeper servers (ensemble) and irrelevant to your exception. They are for synchronization between the leader and the followers.
Your client connection exception is more likely caused by a network issue (maybe TCP keep alive settings).
See ZooKeeper Administrator's Guide:Cluster options for more information on initLimit and syncLimit.

org.apache.curator.ConnectionState Connection timed out for connection string

I have been facing this issue for my Solr instance which is managed by Zookeeper.
It appears that Zookeeper is able to send requests to Zookeeper which momentarily accepts the request and then refuses it.
In Zookeeper logs, I have been seeing this error:
INFO org.apache.zookeeper.ZooKeeper.Client.environment:user.dir=/ [1635628661#qtp-2049348234-50]
INFO org.apache.zookeeper.ZooKeeper Initiating client connection, connectString=localhost:2181 sessionTimeout=150000 watcher=org.apache.curator.ConnectionState#c4f2fbd [1635628661#qtp-2049348234-50]
INFO org.apache.zookeeper.ClientCnxn Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) [1635628661#qtp-2049348234-50-SendThread(localhost:2181)]
INFO org.apache.zookeeper.ClientCnxn Socket connection established to localhost/127.0.0.1:2181, initiating session [1635628661#qtp-2049348234-50 SendThread(localhost:2181)]
ERROR org.apache.curator.ConnectionState Connection timed out for connection string (localhost:2181) and timeout (15000) / elapsed (15290) [1635628661#qtp-204934823450]
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:191)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:86)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
at org.apache.curator.framework.imps.CuratorFrameworkImpl getZooKeeper(CuratorF
Any help is appreciated here.
According to log you have opened socket on localhost:2181, so line:
INFO org.apache.zookeeper.ClientCnxn Socket connection established to localhost/127.0.0.1:2181, initiating session [1635628661#qtp-2049348234-50 SendThread(localhost:2181)]
Is states like, ok, we found an opened socket, now we attempt to write some data. And it sends connection request sending sessionId and password. If session is not already established - it is sends 0 as session id, but sends password.
If you will enable debug output you would see in log then something like
Session establishment request sent on <remote address>
Log record you asking about -
ERROR org.apache.curator.ConnectionState Connection timed out for connection string (localhost:2181) and timeout (15000) / elapsed (15290) [1635628661#qtp-204934823450]
related to curator itself. If client not connected - it call checkTimeout() and if check timeout result is 'CONNECTION_TIMEOUT' generates record like above.
Not so much information but I try to guess there is zookeper on your localhost but connection rejected, may be password required or something else.
Hope it will help.
(my answer is based on curator code from master here -> https://github.com/apache/curator)

Connection refused when starting Solr with external Zookeeper

I have setup 3 servers with Amazon EC2, and have each server with the following Zookeeper-config.
tickTime=2000
initLimit=10
syncLimit=5
clientPort=2181
server.1=server1address:2888:3888
server.2=server3address:2888:3888
server.3=server3address:2888:3888
I start zookeeper on each server, and after I start Solr on the servers, I get errors like this in Solr:
3766 [main] INFO org.apache.solr.common.cloud.ConnectionManager – Waiting for client to connect to ZooKeeper
3790 [main-SendThread(*serverAddress*:2181)] WARN org.apache.zookeeper.ClientCnxn – Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
This was apparently coming because Zookeeper wasn't running properly. What I then figured out was that zookeeper was producing this error:
2013-06-09 08:00:57,953 [myid:1] - INFO [ec2amazonaddress.com/ipaddress#amazon:QuorumCnxManager$Listener#493] - Received connection request /ipaddress:60855
2013-06-09 08:00:57,963 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager#368] - Cannot open
channel to 3 at election address ec2amazonaddress/ipaddress#amazon:
3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:35
4)
So the problem is with ZooKeeper. What I did was to start another server before the server I previously started first, and then it worked. However, after some restarts that didn't work anymore. In other words, it seems like the order of when you start the ZK server matters. I was able to see that some servers who were fired up first went into follower mode instead of leader mode right away, and maybe that's the reason. I have deleted and reinstalled my whole setup, but the problem was still there.
I have checked the ports and have killed all processes using ports 2181 and 2888/3888 before launching Zookeeper. What bothers me is that this has worked with the same setup earlier.
Hope some of you guys have some experience with this problem. Any suggestion that could be related to not being able to connect to ZK-servers is also welcomed

Resources