Recently my team is using hbase-indexer on CDH for indexing hbase table column to solr . When we deploy hbase-indexer server (which is called Key-Value Store Indexer) and begin testing. We found a situation that when we put data to hbase (We are using apache phoenix, a SQL layer above hbase) frequently, the hbase-indexer process will auto exit.We have checked the log and found ERROR log of Zookeeper Session Expired like this :
2016-04-18 12:17:50,340 INFO org.apache.zookeeper.ClientCnxn: Client session timed out, have not heard from server in 31481ms for sessionid 0x2541e69d8a2001a, closing socket connection and attempting reconnect
2016-04-18 12:17:50,446 WARN com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper: Disconnected from ZooKeeper
2016-04-18 12:17:51,202 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server slave1/192.168.27.166:2181. Will not attempt to authenticate using SASL (unknown error)
2016-04-18 12:17:51,204 INFO org.apache.zookeeper.ClientCnxn: Socket connection established, initiating session, client: /192.168.27.167:59529, server: slave1/192.168.27.166:2181
2016-04-18 12:17:51,211 INFO org.apache.zookeeper.ClientCnxn: Unable to reconnect to ZooKeeper service, session 0x2541e69d8a2001a has expired, closing socket connection
2016-04-18 12:17:51,211 ERROR com.ngdata.hbaseindexer.util.zookeeper.StateWatchingZooKeeper: ZooKeeper session expired, shutting down.
2016-04-18 12:17:51,228 INFO org.mortbay.log: Stopped SelectChannelConnector#0.0.0.0:11060
2016-04-18 12:17:51,336 INFO com.ngdata.hbaseindexer.supervisor.IndexerSupervisor: IndexerWorker.EventWorker interrupted.
2016-04-18 12:17:51,448 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2541e69d8a20020 closed
2016-04-18 12:17:51,448 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-04-18 12:17:51,462 INFO org.apache.hadoop.hbase.ipc.RpcServer: Stopping server on 44594
2016-04-18 12:17:51,463 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.listener,port=44594: stopping
2016-04-18 12:17:51,473 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopped
2016-04-18 12:17:51,473 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopping
2016-04-18 12:17:51,488 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper
2016-04-18 12:17:51,488 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter
2016-04-18 12:17:51,514 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2541e69d8a2001f closed
2016-04-18 12:17:51,515 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-04-18 12:17:51,515 INFO org.apache.hadoop.hbase.ipc.RpcServer: Stopping server on 47364
2016-04-18 12:17:51,516 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.listener,port=47364: stopping
2016-04-18 12:17:51,518 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper
2016-04-18 12:17:51,518 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopped
2016-04-18 12:17:51,519 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopping
2016-04-18 12:17:51,519 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter
2016-04-18 12:17:51,527 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2541e69d8a2001e closed
2016-04-18 12:17:51,527 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-04-18 12:17:51,528 INFO org.apache.hadoop.hbase.ipc.RpcServer: Stopping server on 49605
2016-04-18 12:17:51,528 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.listener,port=49605: stopping
2016-04-18 12:17:51,530 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopped
2016-04-18 12:17:51,530 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopping
2016-04-18 12:17:51,531 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper
2016-04-18 12:17:51,531 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter
2016-04-18 12:17:51,539 INFO org.apache.zookeeper.ClientCnxn: EventThread shut down
2016-04-18 12:17:51,539 INFO org.apache.zookeeper.ZooKeeper: Session: 0x2541e69d8a2001c closed
2016-04-18 12:17:51,540 INFO org.apache.hadoop.hbase.ipc.RpcServer: Stopping server on 39464
2016-04-18 12:17:51,540 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.listener,port=39464: stopping
2016-04-18 12:17:51,546 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopped
2016-04-18 12:17:51,547 INFO org.apache.hadoop.hbase.ipc.RpcServer: RpcServer.responder: stopping
2016-04-18 12:17:51,547 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.morphline.MorphlineResultToSolrMapper
2016-04-18 12:17:51,547 ERROR com.ngdata.sep.util.io.Closer: Do not know how to close object of type com.ngdata.hbaseindexer.uniquekey.StringUniqueKeyFormatter
The software env is :
CDH5.4
HBase1.0
Phoenix4.6
Hbase-Indexer (hbase-solr-1.5-cdh5.4.2)
The java heapsize of hbase-indexer is configured to 1GB .
Is there anyone who meet this situation ?
All right, It due to the bad network of our cluster to make hbase-indexer zookeeper session timeout. And finally it cause hbase-indexer process to be shutdown automatically.
Related
I am trying out a small program with a local Flink cluster, setup according to the instructions here. The sample wordcount program runs fine, but when I attempt to run my own program, it stalls and fails while connecting to the job manager. This is Flink 1.5 with JDK 1.8
The relevant part of the code is
FlinkPipelineOptions options = PipelineOptionsFactory.as(FlinkPipelineOptions.class);
options.setStreaming(true);
options.setFlinkMaster("localhost:6123");
options.setRunner(FlinkRunner.class);
I start the cluster with start-cluster.sh, and I can see the two processes (job and task managers) are running. The logs on Flink don't have much. On the client side, after turning on debug, I can see the following
18:43:20.507 [flink-akka.actor.default-dispatcher-4] INFO akka.remote.Remoting - Remoting started; listening on addresses :[akka.tcp://flink#talonx:38183]
18:43:20.511 [main] INFO org.apache.flink.client.program.StandaloneClusterClient - Actor system started at akka.tcp://flink#talonx:38183
18:43:20.511 [main] INFO org.apache.flink.client.program.StandaloneClusterClient - Submitting job with JobID: dbf63281771465550fd3598b2b67b91f. Waiting for job completion.
Submitting job with JobID: dbf63281771465550fd3598b2b67b91f. Waiting for job completion.
18:43:20.521 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Received SubmitJobAndWait(JobGraph(jobId: dbf63281771465550fd3598b2b67b91f)) but there is no connection to a JobManager yet.
18:43:20.522 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Received job test-talonx-0618131319-b721a69a (dbf63281771465550fd3598b2b67b91f).
18:43:20.523 [flink-akka.actor.default-dispatcher-4] INFO org.apache.flink.runtime.client.JobSubmissionClientActor - Disconnect from JobManager null.
After a while, I get the following exception on the client
19:03:19.396 [main] ERROR org.apache.beam.runners.flink.FlinkRunner - Pipeline execution failed
org.apache.flink.client.program.ProgramInvocationException: The program execution failed: Couldn't retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:492)
at org.apache.flink.client.program.StandaloneClusterClient.submitJob(StandaloneClusterClient.java:105)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:456)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:449)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.executeRemotely(RemoteStreamEnvironment.java:212)
at org.apache.flink.streaming.api.environment.RemoteStreamEnvironment.execute(RemoteStreamEnvironment.java:176)
at org.apache.beam.runners.flink.FlinkPipelineExecutionEnvironment.executePipeline(FlinkPipelineExecutionEnvironment.java:126)
at org.apache.beam.runners.flink.FlinkRunner.run(FlinkRunner.java:115)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:311)
at org.apache.beam.sdk.Pipeline.run(Pipeline.java:297)
at Test.main(Test.java:106)
Caused by: org.apache.flink.runtime.client.JobExecutionException: Couldn't retrieve the JobExecutionResult from the JobManager.
at org.apache.flink.runtime.client.JobClient.awaitJobResult(JobClient.java:300)
at org.apache.flink.runtime.client.JobClient.submitJobAndWait(JobClient.java:387)
at org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:481)
... 10 common frames omitted
Caused by: org.apache.flink.runtime.client.JobClientActorConnectionTimeoutException: Lost connection to the JobManager.
at org.apache.flink.runtime.client.JobClientActor.handleMessage(JobClientActor.java:219)
at org.apache.flink.runtime.akka.FlinkUntypedActor.handleLeaderSessionID(FlinkUntypedActor.java:104)
at org.apache.flink.runtime.akka.FlinkUntypedActor.onReceive(FlinkUntypedActor.java:71)
at akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
at akka.actor.ActorCell.invoke(ActorCell.scala:495)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
at akka.dispatch.Mailbox.run(Mailbox.scala:224)
at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
What might be missing here?
I have a SolrCloud cluster (6.6) setup with external Zookeeper Ensemble (3.4.8) of 5 nodes. Recently, one machine (ip1:port1) that run 1 Zookeeper with id=1 went down. This is what I've done to replace zookeeper:
Start zookeeper in another machine with the same id (=1).
Change zoo.cfg in 4 live zookeeper to match new zookeeper server and restart.
Update ZK_HOST variable in solr.in.sh to match new zookeeper server.
Restart solr.
After that, my solr cluster seemed to functioning well, but in solr.log, it looked like solr client and zookeeper servers still try to connect to the old zookeeper:
Solr log
2017-12-01 15:04:38.782 WARN (Timer-0-SendThread(ip1:port1)) [ ] o.a.z.ClientCnxn Client session timed out, have not heard from server in 30029ms for sessionid 0x0
2017-12-01 15:04:40.807 WARN (Timer-0-SendThread(ip1:port1)) [ ] o.a.z.ClientCnxn Client session timed out, have not heard from server in 31030ms for sessionid 0x0
Zookeeper log:
2017-12-01 13:53:57,972 [myid:] - INFO [main-SendThread(ip1:port1):ClientCnxn$SendThread#1032] - Opening socket connection to server ip1:port1. Will not attempt to authenticate using SASL (unknown error)
2017-12-01 13:54:03,972 [myid:] - WARN [main-SendThread(ip1:port1):ClientCnxn$SendThread#1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2017-12-01 13:54:05,074 [myid:] - INFO [main-SendThread(ip1:port1):ClientCnxn$SendThread#1032] - Opening socket connection to server ip1:port1. Will not attempt to authenticate using SASL (unknown error)
2017-12-01 13:54:06,974 [myid:] - WARN [main-SendThread(ip1:port1):ClientCnxn$SendThread#1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
I've done some search in add/remove zookeeper but didn't find a document for it. My zookeeper version (3.4.7) is not supported for dynamic reconfiguration (which is in zookeeper 3.5).
Is there a way I can manually remove/add zookeeper server from ensemble?
Thanks for your attention!
I am trying to implement solrcloud.I foollowed doc from official resource https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud .It works fine with embeded zookeper but it is recomended to use external zookeper. I insalled zookeper on my system created data dictionary zookeper on my home folder.I created sub folders named 1 and 2 and created myid file with text 1 and two respectively i each folder as mentioned in doc.I created config files for zookeper zoo.cnfg
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2879:3879
server.2=localhost:2888:3888
and zoo2.cnfg
initLimit=5
syncLimit=2
clientPort=2182
server.1=localhost:2878:3878
server.2=localhost:2888:3888
Next I run cd
bin/zkServer.sh start zoo.cfg
bin/zkServer.sh start zoo2.cfg
And its started sucessfully. next I run
bin/solr start -e cloud -z localhost:2181,localhost:2182
system ask me no of shards etc like in getting started i select port for node1 8990 and for node 2 8991. It gives error
Waiting to see Solr listening on port 8991 [/] Still not seeing Solr listening on 8991 after 30 seconds!
WARN - 2015-10-30 09:47:04.827; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:05.929; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:06.030; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:07.131; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:07.232; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Where I am missing ? gone through many docs but apche doc is not proper for external zookeper setup.
Your Zookeeper ensemble must have an impair number of nodes : 1, 3, 5, etc...
If you want to test ZK clustering feature than you have to set up at least 3 ZK instances. In this case, don't forget :
To set correctly the ZK server id in the file myid, that must be created in the directory dataDir, referenced by your zoo.cfg.
Separate the dataDir and dataLogDir for each ZK instance.
I am trying to setup zookeeper on ec2 two instances. as given here and here.
I am trying to run zookeeper which fails with an error:
command: bin/zkCli.sh -server localhost:2181
> 2015-03-15 00:22:35,644 [myid:] - INFO [main:ZooKeeper#438] - Initiating client connection, connectString=localhost:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher#3ff0efca
Welcome to ZooKeeper!
2015-03-15 00:22:35,671 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread#975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
2015-03-15 00:22:35,677 [myid:] - WARN [main-SendThread(localhost:2181):ClientCnxn$SendThread#1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
[zk: localhost:2181(CONNECTING) 0] 2015-03-15 00:22:36,796 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread#975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-03-15 00:22:36,797 [myid:] - WARN [main-SendThread(localhost:2181):ClientCnxn$SendThread#1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
zoo.cfg as bellow
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=localhost:2888:3888
server.2=<My ec2 private IPs>:2889:3889
also I have created myId file as on both ec2 instances - /var/lib/zookeeper/myid
I also tried to edit /ect/hosts file but still facing the same issue.
also how I can start both of the zookeeper instances by 1 command?
Note: Server get started successfully if I tried with bin/zkCli.sh start command.
Thanks in advance!
look zk log zookeeper.out,if there have connection limit error, configure the following to zoo.cfg.
# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=60
This is temporary error , for mine after some time , It gone away :-
This is my zoo.conf file ::-
Dir=../data
clientPort=2181
tickTime=2000
initLimit=5
This error occurred when I forgot to run% ZOOKEEPER_HOME% \ bin \ zkserver.cmd
By running, the problem has been resolved.
Correct this property on the server.properties
default would be localhost change it to match the zookeeper server starup ip and port
zookeeper.connect=0.0.0.0:2181
I have 3 zookeeper nodes. Those node was working fine but when I restart those nodes using ./zkServer.sh restart, the zookeeper did not got up again.
When I checked on the zookeeper status, it return:
./zkServer.sh status
JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
my zoo.cnf is:
dataDir=/var/lib/zookeeperdata/3
clientPort=2181
initLimit=50
tickTime=2000
syncLimit=10
maxClientCnxns=100000
server.1=IP1 value:2888:3888
server.2=IP2 value:2889:3889
server.3=127.0.0.1:2890:3890
This is unstable behavior because may be after two hours or tomorrow if I made restart for the 3 zookeeper nodes, they will see each others and working fine because this happened before with me.
zookeeper log:
2014-05-14 15:22:34,236 [myid:3] - INFO [main:NIOServerCnxnFactory#94] - binding to port 0.0.0.0/0.0.0.0:2181
2014-05-14 15:22:34,282 [myid:3] - INFO [main:QuorumPeer#913] - tickTime set to 2000
2014-05-14 15:22:34,283 [myid:3] - INFO [main:QuorumPeer#933] - minSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO [main:QuorumPeer#944] - maxSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO [main:QuorumPeer#959] - initLimit set to 50
2014-05-14 15:22:34,356 [myid:3] - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeperdata/3/version-2/snapshot.f100000001
2014-05-14 15:22:43,387 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /127.0.0.1:50923
2014-05-14 15:22:43,396 [myid:3] - INFO [Thread-1:QuorumCnxManager$Listener#486] - My election bind port: 0.0.0.0/0.0.0.0:3890
2014-05-14 15:22:43,404 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOExce
ption: ZooKeeperServer not running
2014-05-14 15:22:43,404 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1001] - Closed socket connection for client /127.0.0.1:50923 (no se
ssion established for client)
2014-05-14 15:22:43,427 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer#670] - LOOKING
2014-05-14 15:22:43,429 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection#740] - New election. My id = 3, proposed zxid=0xf100000001
2014-05-14 15:22:48,438 [myid:3] - WARN [WorkerSender[myid=3]:QuorumCnxManager#368] - Cannot open channel to 1 at election address /54.76.10.81:3888
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
at java.lang.Thread.run(Thread.java:662)
2014-05-14 15:22:53,440 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager#368] - Cannot open channel to 1 at election address /54.76.10.81:3
888
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:388)
I searched a lot on this but I did not found anything useful for me so I hope someone can help me.
Thanks
I've seen behavior like this as well. A ZK configuration that's been running fine will sometimes simply fail to restart. When this happens I've tried the following:
1) look at the logs for all of the servers...often one will list an error
2) stop all servers and restart
3) stop all servers and restart the servers one at a time
4) verify that each server's myid file exists, has correct permissions and has the right value.
I've used clusterssh to open windows to each of the servers so that the restarts can be at the very same time...and then I've tailed all of the server logs. Keep in mind that during restart the ZK cluster is doing a lot: both starting each server and electing a leader. I've had times when the cluster seemed to fail and then after a few more minutes it seems to figure it out.
There is a great tool called zktop that I've used for monitoring ZK.
I fixed it by changing the IP 127.0.0.1 to the internal IP for amazon node, after making this change for the three nodes and restart, this problem did not happened again. I hope this answer can help someone asking about the same problem.
make sure you have put correct data Dir in each of your node configuration.
and also put a myid file in data Dir and put a number between 1-255 for each of you node in the myid file.
I think it resole the issue.