Replace zookeeper server from zookeeper ensemble (with SolrCloud) - solr

I have a SolrCloud cluster (6.6) setup with external Zookeeper Ensemble (3.4.8) of 5 nodes. Recently, one machine (ip1:port1) that run 1 Zookeeper with id=1 went down. This is what I've done to replace zookeeper:
Start zookeeper in another machine with the same id (=1).
Change zoo.cfg in 4 live zookeeper to match new zookeeper server and restart.
Update ZK_HOST variable in solr.in.sh to match new zookeeper server.
Restart solr.
After that, my solr cluster seemed to functioning well, but in solr.log, it looked like solr client and zookeeper servers still try to connect to the old zookeeper:
Solr log
2017-12-01 15:04:38.782 WARN (Timer-0-SendThread(ip1:port1)) [ ] o.a.z.ClientCnxn Client session timed out, have not heard from server in 30029ms for sessionid 0x0
2017-12-01 15:04:40.807 WARN (Timer-0-SendThread(ip1:port1)) [ ] o.a.z.ClientCnxn Client session timed out, have not heard from server in 31030ms for sessionid 0x0
Zookeeper log:
2017-12-01 13:53:57,972 [myid:] - INFO [main-SendThread(ip1:port1):ClientCnxn$SendThread#1032] - Opening socket connection to server ip1:port1. Will not attempt to authenticate using SASL (unknown error)
2017-12-01 13:54:03,972 [myid:] - WARN [main-SendThread(ip1:port1):ClientCnxn$SendThread#1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2017-12-01 13:54:05,074 [myid:] - INFO [main-SendThread(ip1:port1):ClientCnxn$SendThread#1032] - Opening socket connection to server ip1:port1. Will not attempt to authenticate using SASL (unknown error)
2017-12-01 13:54:06,974 [myid:] - WARN [main-SendThread(ip1:port1):ClientCnxn$SendThread#1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
I've done some search in add/remove zookeeper but didn't find a document for it. My zookeeper version (3.4.7) is not supported for dynamic reconfiguration (which is in zookeeper 3.5).
Is there a way I can manually remove/add zookeeper server from ensemble?
Thanks for your attention!

Related

Solr Cloud with ZooKeeper in local machine

I am trying to setup up in my local laptop with 3 Solr, 3 ZooKeeper and 1 Load Balancer.
I have followed the post https://www.codehousegroup.com/insight-and-inspiration/tech-stream/how-to-configure-sitecore-with-solr-cloud and https://medium.com/#sarkaramrit2/setting-up-solr-cloud-6-3-0-with-zookeeper-3-4-6-867b96ec4272 and few others
I have setup successfully 3 solr in my laptop and URL's are as follows
https://solrcloud1:6161/solr/#/
https://solrcloud2:6162/solr/#/
https://solrcloud3:6163/solr/#/
I have installed the local load balancer "GoBetween" and mapped my above 3 solr paths over there. When I hit https://solrcloud:3010/solr/#/ this URL, I am able to receive response from different Solr instances also. It looks it's working fine as well.
ZooKeeer
I have downloaded zookeeper and placed that in all the 3 Solr locations as well
E.g.: SolrCloud1 = \LocalSolrCloud\SolrCloud1\ contains "solr-8.8.2" and "zookeeper-3.5.6" same structure for SolrCloud2 and SolrCloud3 also.
In Each zookeeper location, I have created a data folder and created "myid" file and placed the value as "1", "2", "3" respectively.
Each zoopkeeper's "zoo.cfg" file contains the below
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/SolrCloud1/zookeeper-3.5.6/data
clientPort=2181
autopurge.snapRetainCount=4
autopurge.purgeInterval=24
server.1=locahost:2888:3888
server.2=locahost:2889:3889
server.3=locahost:2890:3890
When I run the zooKeeper from the command prompt, I am getting the below error.
2022-12-15 11:47:20,051 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager#679] -
Cannot open channel to 2 at election address localhost/127.0.0.1:3889
java.net.ConnectException: Connection refused: connect
at java.net.DualStackPlainSocketImpl.waitForConnect(Native Method)
at java.net.DualStackPlainSocketImpl.socketConnect(DualStackPlainSocketImpl.java:85)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:172)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
at java.net.Socket.connect(Socket.java:589)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:650)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:707)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:620)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:477)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:456)
at java.lang.Thread.run(Thread.java:748)

Which ports should I open in firewall on nodes with Apach Flink?

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.
I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Using external zookeper with solr cloud

I am trying to implement solrcloud.I foollowed doc from official resource https://cwiki.apache.org/confluence/display/solr/Getting+Started+with+SolrCloud .It works fine with embeded zookeper but it is recomended to use external zookeper. I insalled zookeper on my system created data dictionary zookeper on my home folder.I created sub folders named 1 and 2 and created myid file with text 1 and two respectively i each folder as mentioned in doc.I created config files for zookeper zoo.cnfg
clientPort=2181
initLimit=5
syncLimit=2
server.1=localhost:2879:3879
server.2=localhost:2888:3888
and zoo2.cnfg
initLimit=5
syncLimit=2
clientPort=2182
server.1=localhost:2878:3878
server.2=localhost:2888:3888
Next I run cd
bin/zkServer.sh start zoo.cfg
bin/zkServer.sh start zoo2.cfg
And its started sucessfully. next I run
bin/solr start -e cloud -z localhost:2181,localhost:2182
system ask me no of shards etc like in getting started i select port for node1 8990 and for node 2 8991. It gives error
Waiting to see Solr listening on port 8991 [/] Still not seeing Solr listening on 8991 after 30 seconds!
WARN - 2015-10-30 09:47:04.827; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:05.929; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:06.030; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:07.131; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
WARN - 2015-10-30 09:47:07.232; [ ] org.apache.zookeeper.ClientCnxn$SendThread; Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
Where I am missing ? gone through many docs but apche doc is not proper for external zookeper setup.
Your Zookeeper ensemble must have an impair number of nodes : 1, 3, 5, etc...
If you want to test ZK clustering feature than you have to set up at least 3 ZK instances. In this case, don't forget :
To set correctly the ZK server id in the file myid, that must be created in the directory dataDir, referenced by your zoo.cfg.
Separate the dataDir and dataLogDir for each ZK instance.

not attempt to authenticate using SASL (unknown error)

I am trying to setup zookeeper on ec2 two instances. as given here and here.
I am trying to run zookeeper which fails with an error:
command: bin/zkCli.sh -server localhost:2181
> 2015-03-15 00:22:35,644 [myid:] - INFO [main:ZooKeeper#438] - Initiating client connection, connectString=localhost:2181 sessionTimeout=30000 watcher=org.apache.zookeeper.ZooKeeperMain$MyWatcher#3ff0efca
Welcome to ZooKeeper!
2015-03-15 00:22:35,671 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread#975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
JLine support is enabled
2015-03-15 00:22:35,677 [myid:] - WARN [main-SendThread(localhost:2181):ClientCnxn$SendThread#1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
[zk: localhost:2181(CONNECTING) 0] 2015-03-15 00:22:36,796 [myid:] - INFO [main-SendThread(localhost:2181):ClientCnxn$SendThread#975] - Opening socket connection to server localhost/127.0.0.1:2181. Will not attempt to authenticate using SASL (unknown error)
2015-03-15 00:22:36,797 [myid:] - WARN [main-SendThread(localhost:2181):ClientCnxn$SendThread#1102] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
zoo.cfg as bellow
tickTime=2000
initLimit=10
syncLimit=5
dataDir=/var/lib/zookeeper
clientPort=2181
server.1=localhost:2888:3888
server.2=<My ec2 private IPs>:2889:3889
also I have created myId file as on both ec2 instances - /var/lib/zookeeper/myid
I also tried to edit /ect/hosts file but still facing the same issue.
also how I can start both of the zookeeper instances by 1 command?
Note: Server get started successfully if I tried with bin/zkCli.sh start command.
Thanks in advance!
look zk log zookeeper.out,if there have connection limit error, configure the following to zoo.cfg.
# the maximum number of client connections.
# increase this if you need to handle more clients
maxClientCnxns=60
This is temporary error , for mine after some time , It gone away :-
This is my zoo.conf file ::-
Dir=../data
clientPort=2181
tickTime=2000
initLimit=5
This error occurred when I forgot to run% ZOOKEEPER_HOME% \ bin \ zkserver.cmd
By running, the problem has been resolved.
Correct this property on the server.properties
default would be localhost change it to match the zookeeper server starup ip and port
zookeeper.connect=0.0.0.0:2181

zookeeper is not running after restart

I have 3 zookeeper nodes. Those node was working fine but when I restart those nodes using ./zkServer.sh restart, the zookeeper did not got up again.
When I checked on the zookeeper status, it return:
./zkServer.sh status
JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.
my zoo.cnf is:
dataDir=/var/lib/zookeeperdata/3
clientPort=2181
initLimit=50
tickTime=2000
syncLimit=10
maxClientCnxns=100000
server.1=IP1 value:2888:3888
server.2=IP2 value:2889:3889
server.3=127.0.0.1:2890:3890
This is unstable behavior because may be after two hours or tomorrow if I made restart for the 3 zookeeper nodes, they will see each others and working fine because this happened before with me.
zookeeper log:
2014-05-14 15:22:34,236 [myid:3] - INFO [main:NIOServerCnxnFactory#94] - binding to port 0.0.0.0/0.0.0.0:2181
2014-05-14 15:22:34,282 [myid:3] - INFO [main:QuorumPeer#913] - tickTime set to 2000
2014-05-14 15:22:34,283 [myid:3] - INFO [main:QuorumPeer#933] - minSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO [main:QuorumPeer#944] - maxSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO [main:QuorumPeer#959] - initLimit set to 50
2014-05-14 15:22:34,356 [myid:3] - INFO [main:FileSnap#83] - Reading snapshot /var/lib/zookeeperdata/3/version-2/snapshot.f100000001
2014-05-14 15:22:43,387 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory#197] - Accepted socket connection from /127.0.0.1:50923
2014-05-14 15:22:43,396 [myid:3] - INFO [Thread-1:QuorumCnxManager$Listener#486] - My election bind port: 0.0.0.0/0.0.0.0:3890
2014-05-14 15:22:43,404 [myid:3] - WARN [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#354] - Exception causing close of session 0x0 due to java.io.IOExce
ption: ZooKeeperServer not running
2014-05-14 15:22:43,404 [myid:3] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1001] - Closed socket connection for client /127.0.0.1:50923 (no se
ssion established for client)
2014-05-14 15:22:43,427 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer#670] - LOOKING
2014-05-14 15:22:43,429 [myid:3] - INFO [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection#740] - New election. My id = 3, proposed zxid=0xf100000001
2014-05-14 15:22:48,438 [myid:3] - WARN [WorkerSender[myid=3]:QuorumCnxManager#368] - Cannot open channel to 1 at election address /54.76.10.81:3888
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
at java.lang.Thread.run(Thread.java:662)
2014-05-14 15:22:53,440 [myid:3] - WARN [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager#368] - Cannot open channel to 1 at election address /54.76.10.81:3
888
java.net.SocketTimeoutException: connect timed out
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
at java.net.Socket.connect(Socket.java:529)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:388)
I searched a lot on this but I did not found anything useful for me so I hope someone can help me.
Thanks
I've seen behavior like this as well. A ZK configuration that's been running fine will sometimes simply fail to restart. When this happens I've tried the following:
1) look at the logs for all of the servers...often one will list an error
2) stop all servers and restart
3) stop all servers and restart the servers one at a time
4) verify that each server's myid file exists, has correct permissions and has the right value.
I've used clusterssh to open windows to each of the servers so that the restarts can be at the very same time...and then I've tailed all of the server logs. Keep in mind that during restart the ZK cluster is doing a lot: both starting each server and electing a leader. I've had times when the cluster seemed to fail and then after a few more minutes it seems to figure it out.
There is a great tool called zktop that I've used for monitoring ZK.
I fixed it by changing the IP 127.0.0.1 to the internal IP for amazon node, after making this change for the three nodes and restart, this problem did not happened again. I hope this answer can help someone asking about the same problem.
make sure you have put correct data Dir in each of your node configuration.
and also put a myid file in data Dir and put a number between 1-255 for each of you node in the myid file.
I think it resole the issue.

Resources