org.apache.curator.ConnectionState Connection timed out for connection string - solr

I have been facing this issue for my Solr instance which is managed by Zookeeper.
It appears that Zookeeper is able to send requests to Zookeeper which momentarily accepts the request and then refuses it.
In Zookeeper logs, I have been seeing this error:
INFO org.apache.zookeeper.ZooKeeper.Client.environment:user.dir=/ [1635628661#qtp-2049348234-50]
INFO org.apache.zookeeper.ZooKeeper Initiating client connection, connectString=localhost:2181 sessionTimeout=150000 watcher=org.apache.curator.ConnectionState#c4f2fbd [1635628661#qtp-2049348234-50]
INFO org.apache.zookeeper.ClientCnxn Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error) [1635628661#qtp-2049348234-50-SendThread(localhost:2181)]
INFO org.apache.zookeeper.ClientCnxn Socket connection established to localhost/127.0.0.1:2181, initiating session [1635628661#qtp-2049348234-50 SendThread(localhost:2181)]
ERROR org.apache.curator.ConnectionState Connection timed out for connection string (localhost:2181) and timeout (15000) / elapsed (15290) [1635628661#qtp-204934823450]
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:191)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:86)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:113)
at org.apache.curator.framework.imps.CuratorFrameworkImpl getZooKeeper(CuratorF
Any help is appreciated here.

According to log you have opened socket on localhost:2181, so line:
INFO org.apache.zookeeper.ClientCnxn Socket connection established to localhost/127.0.0.1:2181, initiating session [1635628661#qtp-2049348234-50 SendThread(localhost:2181)]
Is states like, ok, we found an opened socket, now we attempt to write some data. And it sends connection request sending sessionId and password. If session is not already established - it is sends 0 as session id, but sends password.
If you will enable debug output you would see in log then something like
Session establishment request sent on <remote address>
Log record you asking about -
ERROR org.apache.curator.ConnectionState Connection timed out for connection string (localhost:2181) and timeout (15000) / elapsed (15290) [1635628661#qtp-204934823450]
related to curator itself. If client not connected - it call checkTimeout() and if check timeout result is 'CONNECTION_TIMEOUT' generates record like above.
Not so much information but I try to guess there is zookeper on your localhost but connection rejected, may be password required or something else.
Hope it will help.
(my answer is based on curator code from master here -> https://github.com/apache/curator)

Related

What happens when a server closes the connection and the client sends some data at the same time?

I have a server written in C that closes the connection if the connection is sitting idle for a specific time. I have an issue (that rarely happens). Read is failing on the client side and it says Connection broken. I suspect the server is closing the connection and the client is sending some data at the same time.
Consider the following scenario (A is server, B is the client)
B initiates the connection and the connection between A and B is established.
B is sitting idle and the idle timeout is reached.
A initiates the close
Before B receives the FIN from A, it starts sending request to A
After B sends the request, it will read the response
Since A has already closed the connection, B is not able to read.
My questions are
Is this a possible situation ?
How to handle idle timeout for clients?
How to close the connection between A and B properly (avoid B sending request during the process). In short, how to close the connection atomically?
By my only little more than rudimentary network experience... and assuming that you are talking about a connection-oriented connection like TCP/IP in contrary to UDP/IP that is connection-less.
Yes, of course. You cannot avoid it.
There are multiple ways to do it, but all of them include: Send something from the client before the server's timeout elapses. If the client has no data to send, let it send something like a "life sign". This could be an empty data message, it all depends on your application protocol. Or make the timeout as long as necessary, including some margin. Some protocol timeout only after 3 times of allowed idle time.
You cannot close the connection atomically, because client and server are separated. Each packet on the network needs some time to be transmitted, and both can start sending at the very same moment, the server its closing message, and the client a new data message. There is nothing that you can do about this.
You need to make the client handle this situation properly. For example, it can accept such a broken connection and interpret it as closed. You should have already some reaction, if the server closes the connection while the client is idle.
How to close the connection between A and B properly (avoid B sending request during the process).
Server detects timeout
Server sends timeout detection message to the Client
Server waits for a reply (if timeout, assume Client dead)
if Client receives a timeout detection from the Server, it replies with ACK (or something like that)
if Server receives an ACK from the Client, then 'gracefully' closes the connection
from now on, neither the Server nor the Client should send/receive any messages (after sending the ACK, do not immediately close the connection from the client side, do linger for the agreed timeout - see setsockopt: SO_LINGER)
On top of that, like the other answers suggested, the Client should send a heartbeat if idle (to avoid timeout detections).

The server DISCONNECT and I receive LWT message?

Isn't if the server did not receive any messages from the client within the (1.5) * KeepAlivetime and the client did not send any PINGREQ within the aforementioned period, the server should DISCONNECT?
If yes, why I am receiving LWT message which is should not be received as DISCONNECT occures?
Last will and Testement will be sent if the client does not explicitly disconnect it's self.
If the broker disconnects the client due to a ping time out then the LWT will be sent, this is the specific reason why the LWT feature exists.
Or do you mean your now disconnected client is receiving it's own LWT?

Connection refused when starting Solr with external Zookeeper

I have setup 3 servers with Amazon EC2, and have each server with the following Zookeeper-config.
tickTime=2000
initLimit=10
syncLimit=5
clientPort=2181
server.1=server1address:2888:3888
server.2=server3address:2888:3888
server.3=server3address:2888:3888
I start zookeeper on each server, and after I start Solr on the servers, I get errors like this in Solr:
3766 [main] INFO org.apache.solr.common.cloud.ConnectionManager – Waiting for client to connect to ZooKeeper
3790 [main-SendThread(*serverAddress*:2181)] WARN org.apache.zookeeper.ClientCnxn – Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
This was apparently coming because Zookeeper wasn't running properly. What I then figured out was that zookeeper was producing this error:
2013-06-09 08:00:57,953 [myid:1] - INFO [ec2amazonaddress.com/ipaddress#amazon:QuorumCnxManager$Listener#493] - Received connection request /ipaddress:60855
2013-06-09 08:00:57,963 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager#368] - Cannot open
channel to 3 at election address ec2amazonaddress/ipaddress#amazon:
3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:35
4)
So the problem is with ZooKeeper. What I did was to start another server before the server I previously started first, and then it worked. However, after some restarts that didn't work anymore. In other words, it seems like the order of when you start the ZK server matters. I was able to see that some servers who were fired up first went into follower mode instead of leader mode right away, and maybe that's the reason. I have deleted and reinstalled my whole setup, but the problem was still there.
I have checked the ports and have killed all processes using ports 2181 and 2888/3888 before launching Zookeeper. What bothers me is that this has worked with the same setup earlier.
Hope some of you guys have some experience with this problem. Any suggestion that could be related to not being able to connect to ZK-servers is also welcomed

MQTT recv from a publish and mqtt ping C

i've got this problem, in a test program, where i'm developing a client for MQTT, i'm subscribed on a topic, after that, i wait for "publish" message from the server to my client.
After a good recv (of a publish message) or after a recv timeout i send a mqtt PINGREQ to the server.
After a A PINGREQ i'm going to wait a PINGRESP, then i call a recv as in the case I were waiting for a PUBLISH message.
If the flow is this:
Client -> PINGREQ
Server -> PUBLISH
Server -> PINGRESP
Than the server publish message were lost. How to solve this? I'm using MQTT at QOS 0, it make sense solve this problem on this level of QOS or instead is smart to check this case at QOS1?
I think you've got things a bit confused. PINGREQ/PINGRESP are used when there isn't any other network traffic passing between the client and server, in order to let both the client and server know if the connection drops.
Your client should keep track of the when the last outgoing or incoming communication with the server was, and send a PINGREQ if it is going to exceed the keepalive timer it set with its CONNECT command. The server will disconnect the client at 1.5*keepalive if no communication is received. The client should assume the server has been disconnected if it does not receive a PINGRESP in response to its PINGREQ within keepalive of sending the PINGREQ.
The QoS level isn't that important, you have to ensure the keepalive timeout is maintained regardless.
It also occurs to me that it sounds like you're using blocking network calls - it might be best to move to non-blocking if you can to get more flexibility.

Connection time out of TCP write (netstat shows ESTABLISHED)

I made an experiment:
A server listens on port 8804 accepts a connection of a client and then send data to the client endless. I shutdown the network.
When I run netstat -anotp | grep 8804 ,it shows that the connection is "ESTABLISHED" on both server and client , but there is no data transmission.
After a while , the server throw an error : "Connection time out"
netstat -anotp | grep 8804 and found that the client is still "ESTABLISHED"
So:
1. Why does the server which is blocked on the system call "write" throw the "Connection timeout" error. Why not the client ?
2. How to let the client find the connection is shutdown actually.
3. Why are the server and client's statuses both "ESTABLISHED" when the network does not work ?
Thanks for your answer !
Your server is expecting TCP ACKs for individual data segments that it sends to the client; however, the client has no idea how long the server's data is. Since you shutdown the network the server no longer gets ACKs from the client. Result: Connection timeout on the server (See Note 1)
Use TCP Keepalives on your socket (See Note 2)
You have not enabled TCP Keepalives. If you are using python, you can do so like this (assuming your socket is named s):
# Do this before you accept() anything on the socket
s.setsockopt(socket.SOL_SOCKET, socket.SO_KEEPALIVE, 1)
EDIT:
Since you're using C, a link to the Linux TCP Keepalives Howto
NOTES
RFC 1122: Section 4.2.3.5 "TCP Connection Failures"
RFC 1122: Section 4.2.3.6 "TCP Keepalives"

Resources