Rolling restart of zetcd causes Flink process to terminate - apache-flink

I am running zetcd and Flink in containers on AWS Fargate. The zetcd cluster contains three nodes. The deployment strategy is to replaces one node at a time to maintain quorum. Deployments to the zetcd cluster cause Flink processes to die due to failing to connect to Zookeeper.
I observe the following scenario:
Starting condition: healty zetcd cluster with three nodes, and a healthy Flink cluster.
When the first zetcd node is deployed, some Flink instances may lose connection to zookeeper if they are talking to this specific zetcd node, but will restore connection to a different healthy zetcd node.
When the second zetcd node is deployed, same as above. Further, I observe Flink never tries to connect to newly provisioned zetcd nodes.
When the last zetcd node is deployed, Flink fails to re-establish connection with zetcd, and the Flink process terminates.
When all Flink nodes are reprovisioned, the system returns to a healthy state.
I think Flink caches the zetcd nodes on startup, and Flink is not aware of zetcd node replacements. Once all initial zetcd nodes are replaced, Flink cannot connect to zookeeper and dies.
Flink uses Apache Curator; perhaps this behavior is an artifact of how Curator manages connections to Zookeeper?
I appreciate any guidance on how to keep Flink up to date with the current list of zetcd nodes, or if I am entirely wrong in the first place :)
Relevant flink-conf.yaml
high-availability: zookeeper
high-availability.zookeeper.quorum: zetcd-service.local:2181
high-availability.storageDir: s3://flink-state/ha
high-availability.jobmanager.port: 6123
Flink loses connection to ZK, and attempts to reconnect.
00:42:07.788 [main-SendThread(ip-10-0-59-233.us-west-2.compute.internal:2181)] INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Unable to read additional data from server sessionid 0x79526ef2595a9606, likely server has closed socket, closing socket connection and attempting reconnect
00:42:07.888 [main-EventThread] INFO org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink#10.0.38.41:6123/user/dispatcher no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Connection to ZooKeeper suspended. Can no longer retrieve the leader from ZooKeeper.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender http://10.0.38.41:8081 no longer participates in the leader election.
00:42:07.888 [Curator-ConnectionStateManager-0] WARN org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - Connection to ZooKeeper suspended. The contender akka.tcp://flink#10.0.38.41:6123/user/resourcemanager no longer participates in the leader election.
00:42:07.889 [Curator-PathChildrenCache-0] DEBUG org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - Received CONNECTION_SUSPENDED event
00:42:07.889 [Curator-PathChildrenCache-0] WARN org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore - ZooKeeper connection SUSPENDING. Changes to the submitted job graphs are not monitored (temporarily).
00:42:08.820 [main-SendThread(ip-10-0-160-244.us-west-2.compute.internal:2181)] INFO org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - Opening socket connection to server ip-10-0-160-244.us-west-2.compute.internal/10.0.160.244:2181
Flink fails to connect to a ZK node, and dies.
00:42:22.892 [Curator-Framework-0] ERROR org.apache.flink.shaded.curator.org.apache.curator.ConnectionState - Connection timed out for connection string (zetcd-service.local:2181) and timeout (15000) / elapsed (15004)
org.apache.flink.shaded.curator.org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [flink-dist_2.11-1.8.1.jar:1.8.1]
at org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [flink-dist_2.11-1.8.1.jar:1.8.1]

Related

Flink cluster unable to boot up - getChildren() failed w/ error = -6

I setup a new Flink cluster (v1.15) in a Kubernetes cluster. This new cluster is setup in the same namespace in which an existing Flink cluster (v1.13) is running fine.
The job-manager of the new Flink cluster is in a CrashLoopBackOff state. job-manager prints the following set of messages continuously, which includes a specific ERROR message:
2022-10-10 23:02:47,214 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Opening socket connection to server flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,214 ERROR org.apache.flink.shaded.curator5.org.apache.curator.ConnectionState [] - Authentication failed
2022-10-10 23:02:47,215 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Socket connection established, initiating session, client: /100.98.125.116:57754, server: flink-zk-client-service/100.65.161.135:2181
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Session establishment complete on server flink-zk-client-service/100.65.161.135:2181, sessionid = 0x381a609d51f0082, negotiated timeout = 4000
2022-10-10 23:02:47,216 INFO org.apache.flink.shaded.curator5.org.apache.curator.framework.state.ConnectionStateManager [] - State change: RECONNECTED
2022-10-10 23:02:47,216 INFO org.apache.flink.runtime.leaderelection.ZooKeeperMultipleComponentLeaderElectionDriver [] - Connection to ZooKeeper was reconnected. Leader election can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,217 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalDriver [] - Connection to ZooKeeper was reconnected. Leader retrieval can be restarted.
2022-10-10 23:02:47,218 ERROR org.apache.flink.shaded.curator5.org.apache.curator.framework.recipes.leader.LeaderLatch [] - getChildren() failed. rc = -6 <============
2022-10-10 23:02:47,218 INFO org.apache.flink.shaded.zookeeper3.org.apache.zookeeper.ClientCnxn [] - Unable to read additional data from server sessionid 0x381a609d51f0082, likely server has closed socket, closing socket connection and attempting reconnect
It seems the error message indicates that a specific node of the new cluster in ZK either does not exist or does not have any children. But I could be off. I am using the same zookeeper for the v1.13 cluster, no issues with that cluster.
Content of zookeeper. (ClusterId is - dev-cl2):
ls /dev-cl2/dev-cl2
[leader]
ls /dev-cl2/dev-cl2/leader
[]
get /dev-cl2/dev-cl2/leader
// Nothing printed
Any help or pointers to troubleshooting this issue would be greatly appreciated. Thank you.
Update1:
I noticed the following the 1.15 release notes.
A new multiple component leader election service was implemented that only runs a single leader election per Flink process. If this should cause any problems, then you can set high-availability.use-old-ha-services: true in the flink-conf.yaml to use the old high availability services.
As a test, I set high-availability.use-old-ha-services: true. Did not have any effect.

ActiveMQ slave broker accepts incoming connection from Apache Camel

I have the following configuration:
Two actively running Tomcat instances running Apache Camel 2.20.2 that use the competing consumer concept to read message of the same JMS message queue
ActiveMQ 5.15.0 in a master/slave configuration using a shared kahaDB
It happens that one of the Camel instances connects to the slave broker even though the slave broker is not active (i.e. as far as I can tell from the log files it did not get a lock on the kahaDB).
When this occurs the route on that Camel instance is blocked, and we get a ExchangeTimedOutException and this blocks the route and messages are being queued up.
WARN EndpointMessageListener:213 - Execution of JMS message listener failed. Caused by: [org.apache.camel.RuntimeCamelException - org.apache.camel.ExchangeTimedOutException: The OUT message was not received within: 30000 millis. Exchange[ID-MXPBMES-01P-I02-1625784159041-1-16108]]
Is it normal that a slave broker accepts a connection from a client application (Camel in our case)?
The secondary broker should not accept connections so this sounds like a bug, although you are not using the latest broker so before doing anything you should update to the latest release as there are always bug fixes going on.
Some issues can arise if the underlying file system does not provide a reliable locking mechanism which can lead to both primary and backup brokers becoming active.

Configuring Ports for Flink Job/Task Manager Metrics

I am running Flink in Amazon EMR. In flink-conf.yaml, I have metrics.reporter.prom.port: 9249-9250
Depending whether the job manager and task manager are running in the same node, the task manager metrics are reported on port 9250 (if running on same node as job manager), or on port 9249 (if running on a different node).
Is there a way to configure so that the task manager metrics are always reported on port 9250?
I saw a post that we can "provide each *Manager with a separate configuration." How to do that?
Thanks
You can configure different ports for the JM and TM by starting the processes with differently configured flink-conf.yaml.
On Yarn, Flink currently uses the same flink-conf.yaml for all processes.

Remote debugging Flink local cluster

I want to deploy my jobs on a local Flink cluster during development (i.e. JobManager and TaskManager running on my development laptop), and use remote debugging. I tried adding
"-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005" to the flink-conf.yaml file. Since job and task manager are running on the same machine, the task manager throws exception stating that the socket is already in use and terminates. Is there any way I can get this running.
You are probably setting env.java.opts, which affects all JVMs started by Flink. Since the jobmanager gets started first, it grabs the port before the taskmanager is started.
You can use env.java.opts.taskmanager to pass parameters only for taskmanager JVMs.

ZooKeeper - SOLR issue

We are using Solr 4.2.1 and ZooKeeper 3.4.5 and there are 2 Solr servers.
Solr is reporting "No registered leader was found" and "WARNING ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to ZK".
ZooKeeper is reporting "Exception when following the leader".
But after restarting both, it works for some time and it reports the issue again.
Here are some additional logs from Solr:
SEVERE ZkController There was a problem finding the leader in
zk:org.apache.solr.common.SolrException: Could not get leader props
org.apache.solr.common.SolrException: No registered leader was found, collection:www-live slice:shard1
SEVERE: shard update error StdNode: http://10.23.3.47:8983/solr/www-live/:org.apache.solr.client.solrj.SolrServerException: Server refused connection at: http://10.23.3.47:8983/solr/www-live
SEVERE: Recovery failed - trying again... (5) core=www-live
From ZooKeeper
2016-01-14 11:25:08,423 [myid:1] - WARN [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower#89] - Exception when following the leader
java.io.EOFException
at java.io.DataInputStream.readInt(DataInputStream.java:375)
at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:108)
at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:152)
at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:740)
Any help is much appreciated.
Thank you.
How many zookeepers you have?
It must be on odd numbers for leader election. If it is on even number, please update it to odd number and try again.
Three ZooKeeper servers is the minimum recommended size for an
ensemble, and we also recommend that they run on separate machines.
For reliable ZooKeeper service, you should deploy ZooKeeper in a
cluster known as an ensemble. As long as a majority of the ensemble
are up, the service will be available. Because Zookeeper requires a
majority, it is best to use an odd number of machines. For example,
with four machines ZooKeeper can only handle the failure of a single
machine; if two machines fail, the remaining two machines do not
constitute a majority. However, with five machines ZooKeeper can
handle the failure of two machines.
http://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html

Resources