Flink cluster unable to boot up - getChildren() failed w/ error = -6 - apache-flink

I setup a new Flink cluster (v1.15) in a Kubernetes cluster. This new cluster is setup in the same namespace in which an existing Flink cluster (v1.13) is running fine.
The job-manager of the new Flink cluster is in a CrashLoopBackOff state. job-manager prints the following set of messages continuously, which includes a specific ERROR message:
2022-10-10 23:02:47,214 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,214 ERROR org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState [] - Authentication failed
2022-10-10 23:02:47,215 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established, initiating session, client: /100.98.125.116:57754, server: flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session establishment complete on server flink-zk-client-service/100.65.161.135:2181, sessionid = 0x381a609d51f0082, negotiated timeout = 4000
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: RECONNECTED
2022-10-10 23:02:47,216 INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,218 ERROR org.apache.flink.shaded.curator5.org.apache.curator.framework.recipes.leader.LeaderLatch [] - getChildren() failed. rc = -6 <============
2022-10-10 23:02:47,218 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Unable to read additional data from server sessionid 0x381a609d51f0082, likely server has closed socket, closing socket connection and attempting reconnect
It seems the error message indicates that a specific node of the new cluster in ZK either does not exist or does not have any children. But I could be off. I am using the same zookeeper for the v1.13 cluster, no issues with that cluster.
Content of zookeeper. (ClusterId is - dev-cl2):
ls /dev-cl2/dev-cl2
[leader]
ls /dev-cl2/dev-cl2/leader
[]
get /dev-cl2/dev-cl2/leader
// Nothing printed
Any help or pointers to troubleshooting this issue would be greatly appreciated. Thank you.
Update1:
I noticed the following the 1.15 release notes.
A new multiple component leader election service was implemented that only runs a single leader election per Flink process. If this should cause any problems, then you can set high-availability.use-old-ha-services: true in the flink-conf.yaml to use the old high availability services.
As a test, I set high-availability.use-old-ha-services: true. Did not have any effect.

Related

Rolling restart of zetcd causes Flink process to terminate

I am running zetcd and Flink in containers on AWS Fargate. The zetcd cluster contains three nodes. The deployment strategy is to replaces one node at a time to maintain quorum. Deployments to the zetcd cluster cause Flink processes to die due to failing to connect to Zookeeper.
I observe the following scenario:
Starting condition: healty zetcd cluster with three nodes, and a healthy Flink cluster.
When the first zetcd node is deployed, some Flink instances may lose connection to zookeeper if they are talking to this specific zetcd node, but will restore connection to a different healthy zetcd node.
When the second zetcd node is deployed, same as above. Further, I observe Flink never tries to connect to newly provisioned zetcd nodes.
When the last zetcd node is deployed, Flink fails to re-establish connection with zetcd, and the Flink process terminates.
When all Flink nodes are reprovisioned, the system returns to a healthy state.
I think Flink caches the zetcd nodes on startup, and Flink is not aware of zetcd node replacements. Once all initial zetcd nodes are replaced, Flink cannot connect to zookeeper and dies.
Flink uses Apache Curator; perhaps this behavior is an artifact of how Curator manages connections to Zookeeper?
I appreciate any guidance on how to keep Flink up to date with the current list of zetcd nodes, or if I am entirely wrong in the first place :)
Relevant flink-conf.yaml
high-availability: zookeeper
high-availability.zookeeper.quorum: zetcd-service.local:2181
high-availability.storageDir: s3://flink-state/ha
high-availability.jobmanager.port: 6123
Flink loses connection to ZK, and attempts to reconnect.
00:42:07.788 [main-SendThread(ip-10-0-59-233.us-west-2.compute.internal:2181)] INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x79526ef2595a9606, likely server has closed socket, closing socket connection and attempting reconnect
00:42:07.888 [main-EventThread] INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink#10.0.38.41:6123/user/dispatcher no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://10.0.38.41:8081 no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink#10.0.38.41:6123/user/resourcemanager no longer participates in the leader election.
00:42:07.889 [Curator-PathChildrenCache-0] DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Received CONNECTION_SUSPENDED event
00:42:07.889 [Curator-PathChildrenCache-0] WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).
00:42:08.820 [main-SendThread(ip-10-0-160-244.us-west-2.compute.internal:2181)] INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-10-0-160-244.us-west-2.compute.internal/10.0.160.244:2181
Flink fails to connect to a ZK node, and dies.
00:42:22.892 [Curator-Framework-0] ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed out for connection string (zetcd-service.local:2181) and timeout (15000) / elapsed (15004)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [flink-dist_2.11-1.8.1.jar:1.8.1]

Zookeeper errors

I am using solr with zookeeper and see the following errors in zookeeper logs
Using zk 3.4.10 and solr 6.6
EndOfStreamException: Unable to read additional data from client sessionid 0x1XXXXXXX, likely client has closed socket
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:239)
at org.apache.zookeeper.server.NIOServerCnxnFactory.run(NIOServerCnxnFactory.java:203)
at java.lang.Thread.run(Thread.java:745)
2019-04-28 06:24:59,939 [myid:] - INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn#1044] - Closed socket connection for client /10.40.96.193:46260 which had sessionid 0x1XXXXXXX
The zoo keeper config
tickTime=2000
initLimit=10
syncLimit=5
Do these config values result in above exception? If yes, can someone explain whether we should increase or decrease initLimit & syncLimit?
Thanks in advance.
Those 3 config parameters only refer to the ZooKeeper servers (ensemble) and irrelevant to your exception. They are for synchronization between the leader and the followers.
Your client connection exception is more likely caused by a network issue (maybe TCP keep alive settings).
See ZooKeeper Administrator's Guide:Cluster options for more information on initLimit and syncLimit.

Running Flink build-in program sometimes arise Exception:java.io.IOException: Connecting the channel failed

I have set up a flink standalone cluster, with one master and three slaves , all SESU Linux machines. In the master Dashboard http://flink-master:8081/ I can see 3 Task Managers and 3 task slots as I have set taskmanager.numberOfTaskSlots: 1 in flink-conf.yaml in all of the slaves.
When I run a flink built-in program,like the examples/streaming/Iteration.jar,I get exception often:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:197)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:132)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:84)
at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:59)
at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:156)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:480)
at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:502)
at org.apache.flink.streaming.runtime.io.BarrierTracker.getNextNonBlocked(BarrierTracker.java:93)
at org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:214)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:718)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connecting to remote task manager + 'ccr202/127.0.0.2:49651' has failed. This might indicate that the remote task manager has been lost.
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:220)
at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:132)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
... 1 more
Caused by: java.net.ConnectException: Connection refused: ccr202/127.0.0.2:49651
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
... 6 more
It seems that the network causes the problem,but sometimes the flink program can successfully finish.So what is the reason?
I also encounter this issue very frequently especially when there are many taskManagers. There are a few config I have tried to solve this issue. It's happened when the taskManager read the remote partition through netty connection. It timed out when request the connection. I increased the config "taskmanager.network.netty.server.numThreads", it solved the issue.

Connection refused when starting Solr with external Zookeeper

I have setup 3 servers with Amazon EC2, and have each server with the following Zookeeper-config.
tickTime=2000
initLimit=10
syncLimit=5
clientPort=2181
server.1=server1address:2888:3888
server.2=server3address:2888:3888
server.3=server3address:2888:3888
I start zookeeper on each server, and after I start Solr on the servers, I get errors like this in Solr:
3766 [main] INFO org.apache.solr.common.cloud.ConnectionManager – Waiting for client to connect to ZooKeeper
3790 [main-SendThread(*serverAddress*:2181)] WARN org.apache.zookeeper.ClientCnxn – Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:692)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
This was apparently coming because Zookeeper wasn't running properly. What I then figured out was that zookeeper was producing this error:
2013-06-09 08:00:57,953 [myid:1] - INFO [ec2amazonaddress.com/ipaddress#amazon:QuorumCnxManager$Listener#493] - Received connection request /ipaddress:60855
2013-06-09 08:00:57,963 [myid:1] - WARN [WorkerSender[myid=1]:QuorumCnxManager#368] - Cannot open
channel to 3 at election address ec2amazonaddress/ipaddress#amazon:
3888
java.net.ConnectException: Connection refused
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391)
at java.net.Socket.connect(Socket.java:579)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:35
4)
So the problem is with ZooKeeper. What I did was to start another server before the server I previously started first, and then it worked. However, after some restarts that didn't work anymore. In other words, it seems like the order of when you start the ZK server matters. I was able to see that some servers who were fired up first went into follower mode instead of leader mode right away, and maybe that's the reason. I have deleted and reinstalled my whole setup, but the problem was still there.
I have checked the ports and have killed all processes using ports 2181 and 2888/3888 before launching Zookeeper. What bothers me is that this has worked with the same setup earlier.
Hope some of you guys have some experience with this problem. Any suggestion that could be related to not being able to connect to ZK-servers is also welcomed

SolrCloud with embedded ZooKeeper server says: "ZooKeeperServer not running"

When I start my SolrCloud server, Solr opens a socket connection to the embedded ZooKeeper server but says: "ZooKeeperServer not running".
It doesn't state a reason.
How can I figure out why the ZooKeeper server isn't actually running?
2012-05-30 15:02:36.538 [main] INFO org.apache.solr.cloud.SolrZkServer - STARTING EMBEDDED STANDALONE ZOOKEEPER SERVER at port 9983
2012-05-30 15:02:36.545 [Thread-14] INFO o.a.z.server.ZooKeeperServerMain - Starting server
2012-05-30 15:02:36.552 [Thread-14] INFO o.a.zookeeper.server.ZooKeeperServer - Server environment:zookeeper.version=3.3.3-1203054, built on 11/17/2011 05:47 GMT
... [snip] ...
2012-05-30 15:02:37.092 [main-SendThread()] INFO org.apache.zookeeper.ClientCnxn - Opening socket connection to server localhost/127.0.0.1:9983
2012-05-30 15:02:37.097 [main-SendThread(localhost:9983)] INFO org.apache.zookeeper.ClientCnxn - Socket connection established to localhost/127.0.0.1:9983, initiating session
2012-05-30 15:02:37.097 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] INFO o.a.zookeeper.server.NIOServerCnxn - Accepted socket connection from /127.0.0.1:43635
2012-05-30 15:02:37.100 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] WARN o.a.zookeeper.server.NIOServerCnxn - Exception causing close of session 0x0 due to java.io.IOException: ZooKeeperServer not running
2012-05-30 15:02:37.100 [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:9983] INFO o.a.zookeeper.server.NIOServerCnxn - Closed socket connection for client /127.0.0.1:43635 (no session established for client)
2012-05-30 15:02:37.101 [main-SendThread(localhost:9983)] INFO org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x0, likely server has closed socket, closing socket connection and attempting reconnect
In my case specifically, it seemed that having a bunch of extra files in my conf/ directory was causing problems. Try to have the fewest amount of files necessary in that directory to ensure embedded Zookeeper running properly.

Resources