Which ports should I open in firewall on nodes with Apach Flink? - apache-flink

When I try to run my flow on Apache Flink standalone cluster I see the following exception:
java.lang.IllegalStateException: Update task on instance aaa0859f6af25decf1f5fc1821ffa55d # app-2 - 4 slots - URL: akka.tcp://flink#192.168.38.98:46369/user/taskmanager failed due to:
at org.apache.flink.runtime.executiongraph.Execution$6.onFailure(Execution.java:954)
at akka.dispatch.OnFailure.internal(Future.scala:228)
at akka.dispatch.OnFailure.internal(Future.scala:227)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:174)
at akka.dispatch.japi$CallbackBridge.apply(Future.scala:171)
at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
at scala.runtime.AbstractPartialFunction.applyOrElse(AbstractPartialFunction.scala:28)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:134)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
at scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://flink#192.168.38.98:46369/user/taskmanager#1804590378]] after [10000 ms]
at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333)
at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117)
at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:599)
at scala.concurrent.BatchingExecutor$class.execute(BatchingExecutor.scala:109)
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:597)
at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467)
at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419)
at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423)
at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375)
at java.lang.Thread.run(Thread.java:745)
Seems like port 46369 blocked by firewall. It is true because I read configuration section and open these ports only:
6121:
comment: Apache Flink TaskManager (Data Exchange)
6122:
comment: Apache Flink TaskManager (IPC)
6123:
comment: Apache Flink JobManager
6130:
comment: Apache Flink JobManager (BLOB Server)
8081:
comment: Apache Flink JobManager (Web UI)
The same ports described in flink-conf.yaml:
jobmanager.rpc.address: app-1.stag.local
jobmanager.rpc.port: 6123
jobmanager.heap.mb: 1024
taskmanager.heap.mb: 2048
taskmanager.numberOfTaskSlots: 4
taskmanager.memory.preallocate: false
blob.server.port: 6130
parallelism.default: 4
jobmanager.web.port: 8081
state.backend: jobmanager
restart-strategy: none
restart-strategy.fixed-delay.attempts: 2
restart-strategy.fixed-delay.delay: 60s
So, I have two questions:
This exception related to blocked ports. Right?
Which ports should I open on firewall for standalone Apache Flink cluster?
UPDATE 1
I found configuration problem in masters and slaves files (I skip new line separators between hosts described in these files). I fixed it and now I see other exceptions:
flink--taskmanager-0-app-1.stag.local.log
flink--taskmanager-0-app-2.stag.local.log
I have 2 nodes:
app-1.stag.local (with running job and task managers)
app-2.stag.local (with running task manager)
As you can see from these logs the app-1.stag.local task manager can't connect to other task manager:
java.io.IOException: Connecting the channel failed: Connecting to remote task manager + 'app-2.stag.local/192.168.38.98:35806' has failed. This might indicate that the remote task manager has been lost.
but app-2.stag.local has open port:
2016-03-18 16:24:14,347 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful initialization (took 39 ms). Listening on SocketAddress /192.168.38.98:35806
So, I think problem related to firewall but I don't understand where I can configure this port (or range of ports) in Apache Flink.

I have found a problem: taskmanager.data.port parameter was set to 0 by default (but documentation say what it should be set to 6121).
So, I set this port in flink-conf.yaml and now all works fine.

Related

Flink: all traffic goes to one Task Manager (in cluster with 1 Job Manager + 2 Task Managers)

I have following set up:
Installation type k8s: 1.18
Flink version: 1.12
1 Job Manager
2 Task Manager
In flink-conf.yaml of task manager
flink-conf.yaml: |
state.backend: rocksdb
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.numberOfTaskSlots: 2
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 2900m
taskmanager.memory.process.size: 2900m
jobmanager.web.address: 0.0.0.0
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
In flink-conf.yaml of job manager
flink-conf.yaml: |
state.backend: rocksdb
blob.server.port: 6124
jobmanager.rpc.port: 6123
parallelism.default: 2
queryable-state.proxy.ports: 6125
taskmanager.rpc.port: 6122
jobmanager.memory.process.size: 2900m
taskmanager.memory.process.size: 2900m
jobmanager.web.address: 0.0.0.0
rest.address: 0.0.0.0
rest.bind-address: 0.0.0.0
rest.port: 8081
With above configuration, only one task manager is active i.e. gets traffic and another task manager remains idle, even though number of events increases to extreme level.
Please suggest, if I am missing anything?
You have set the parallelism to 2, and given each task manager 2 slots. Thus a single task manager can provide the requested parallelism, and that's what will happen by default.
If you want the scheduler to behave differently, you could set
cluster.evenly-spread-out-slots: true
or you could reduce taskmanager.numberOfTaskSlots to 1.

Flink job .UnfulfillableSlotRequestException: Could not fulfill slot req. Req resource profile (ResourceProfile{UNKNOWN}) is unfulfillable

Flink job submission
$ ./bin/flink run -m 10.0.2.4:6123 /streaming/mvn-flinkstreaming-scala/mvn-flinkstreaming-scala-1.0.jar
Stream processing!!!!!!!!!!!!!!!!!
org.apache.flink.streaming.api.datastream.DataStreamSink#40ef3420
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 31 more
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
... 29 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not fulfill slot request
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: java.util.concurrent.ExecutionException: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
but when i checked the logs of the job in UI getting a different error,
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:607)
at java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:591)
... 31 more
Caused by: org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException: No pooled slot available and request to ResourceManager for new slot failed
... 29 more
Caused by: java.util.concurrent.CompletionException: org.apache.flink.runtime.resourcemanager.exceptions.ResourceManagerException: Could not fulfill slot request ea
What should I check, my configuration parameter are as below,
A) is the -m ip_address:6123 right option or 8081 should be the port...
config..
# Note this accounts for all memory usage within the TaskManager process, including JVM metaspace and other overhead.
taskmanager.memory.process.size: 1568m
# To exclude JVM metaspace and overhead, please, use total Flink memory size instead of 'taskmanager.memory.process.size'.
# It is not recommended to set both 'taskmanager.memory.process.size' and Flink memory.
#
# taskmanager.memory.flink.size: 1280m
# The number of task slots that each TaskManager offers. Each slot runs one parallel pipeline.
taskmanager.numberOfTaskSlots: 2
# The parallelism used for programs that did not specify and other parallelism.
parallelism.default: 2
Cluster starting,
$ bin/start-cluster.sh
Starting cluster.
Starting standalonesession daemon on host centos1.
Starting taskexecutor daemon on host centos2.
Starting taskexecutor daemon on host centos3.
Able to process grep in master node,
]$ ps -ef | grep flink
root 12300 1 10 07:22 pts/0 00:00:05 java -Xms16384m -Xmx16384m -Dlog.file=/storage/flink-1.10.0/log/
Unable to find a process of fink related to task managers,
centos2 ~]$ psg flink
Is this a right state?
I have faced the same problem in Flink-1.10.0. So please make sure you have enough memory according to the data load.
Error which i was getting:
java.lang.OutOfMemoryError: Metaspace
jobmanager.heap.size: 1024m (Default)
taskmanager.memory.flink.size: 1280m (Default)
taskmanager.memory.jvm-metaspace.size: 256m (Default)
So I have increased taskmanager.memory.jvm-metaspace.size according to data load and it solved my issue.
For more details click here.
I've encountered this problem before, which is most likely an indicator for insufficient memory on your flink cluster. The different error messages also make sense, as they relate to each other.
Check your
jobmanager.heap.size
and
taskmanager.heap.size
in the config, increase them to a rather oversized amount and you should not see this error anymore. From here you can finetune the actual memory settings

Flink TaskManager livenessProbe doesn't work

I'm following this doc to configure probes for JobManager and TaskManager on Kubernetes.
JobManager works perfectly, but TaskManager doesn't work. I noticed in the pod log that the liveness probe failed:
Normal Killing 3m36s kubelet, gke-dagang-test-default-pool-494df2ba-vhs5 Killing container with id docker://taskmanager:Container failed liveness probe.. Container will be killed and recreated.
Warning Unhealthy 37s (x8 over 7m37s) kubelet, gke-dagang-test-default-pool-494df2ba-vhs5 Liveness probe failed: dial tcp 10.20.1.54:6122: connect: connection refused
I'm wondering does TM actually listen on 6122?
Flink version: 1.9.0
Turns out it is because I didn't add taskmanager.rpc.port: 6122 in flink-config.yaml, now it works perfectly.

Failed to submit JobGraph Apache Flink

I am trying to run the simple code below after building everything from Flink's github master branch for various reasons. I get an exception below and I wonder what runs on port 9065? and How to fix this exception?
val dataStream = senv.fromElements(1, 2, 3, 4)
dataStream.countWindowAll(2).sum(0).print()
senv.execute("My streaming program")
Below is the Exception
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to submit JobGraph.
at org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$18(RestClusterClient.java:306)
at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at org.apache.flink.runtime.rest.RestClient.lambda$submitRequest$222(RestClient.java:196)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:680)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:603)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:563)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:424)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:268)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:284)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.CompletionException: java.net.ConnectException: Connection refused: localhost/127.0.0.1:9065
at java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:943)
at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:926)
... 16 more
Caused by: java.net.ConnectException: Connection refused: localhost/127.0.0.1:9065
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:281)
I build it from the sources in the following way (just followed the instructions on Flink github page):
git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests
cd build-target
./bin/start-scala-shell.sh local
Underlying distributed runtime is currently heavily worked on in master. Starting from 1.5 the default runtime will be the one known as FLIP6, therefore ocassionally some parts might not work. I think it would be very beneficial if you could create a JIRA ticket for this.
Just to add what runs on 9065 port, in the new architecture it is the default port of Dispatcher.
I had the same exception. My issue was that I had a port conflict when started the cluster with a docker image on my machine. So I had changed the port for rest in flink config file to use 8084 instead of 8081. When I did this the cluster would start up properly but I was unable to submit the job. When I killed the conflicting process and reverted the port back to 8081, I could submit jobs successfully
I got the same error.
Use jdk 1.8 for flink 1.7.2

Replace zookeeper server from zookeeper ensemble (with SolrCloud)

I have a SolrCloud cluster (6.6) setup with external Zookeeper Ensemble (3.4.8) of 5 nodes. Recently, one machine (ip1:port1) that run 1 Zookeeper with id=1 went down. This is what I've done to replace zookeeper:
Start zookeeper in another machine with the same id (=1).
Change zoo.cfg in 4 live zookeeper to match new zookeeper server and restart.
Update ZK_HOST variable in solr.in.sh to match new zookeeper server.
Restart solr.
After that, my solr cluster seemed to functioning well, but in solr.log, it looked like solr client and zookeeper servers still try to connect to the old zookeeper:
Solr log
2017-12-01 15:04:38.782 WARN (Timer-0-SendThread(ip1:port1)) [ ] o.a.z.ClientCnxn Client session timed out, have not heard from server in 30029ms for sessionid 0x0
2017-12-01 15:04:40.807 WARN (Timer-0-SendThread(ip1:port1)) [ ] o.a.z.ClientCnxn Client session timed out, have not heard from server in 31030ms for sessionid 0x0
Zookeeper log:
2017-12-01 13:53:57,972 [myid:] - INFO [main-SendThread(ip1:port1):ClientCnxn$SendThread#1032] - Opening socket connection to server ip1:port1. Will not attempt to authenticate using SASL (unknown error)
2017-12-01 13:54:03,972 [myid:] - WARN [main-SendThread(ip1:port1):ClientCnxn$SendThread#1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744)
at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:361)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1141)
2017-12-01 13:54:05,074 [myid:] - INFO [main-SendThread(ip1:port1):ClientCnxn$SendThread#1032] - Opening socket connection to server ip1:port1. Will not attempt to authenticate using SASL (unknown error)
2017-12-01 13:54:06,974 [myid:] - WARN [main-SendThread(ip1:port1):ClientCnxn$SendThread#1162] - Session 0x0 for server null, unexpected error, closing socket connection and attempting reconnect
I've done some search in add/remove zookeeper but didn't find a document for it. My zookeeper version (3.4.7) is not supported for dynamic reconfiguration (which is in zookeeper 3.5).
Is there a way I can manually remove/add zookeeper server from ensemble?
Thanks for your attention!

Resources