Apache Flink, Job with big grap - submisson times out on cluster - apache-flink

We trying to build Flink Job for price aggregation with quite complicated logic.
E.g. previous version had a graph as below.
After another development iteration, I added even more complexity to the job.
The new version was running fine from IDE, however, deployment to cluster fails with
Caused by: org.apache.flink.runtime.client.JobClientActorSubmissionTimeoutException: Job submission to the JobManager timed out.
If I reconfigure the job (reduce graph complexity) it gets deployed without any problem.
My questions are:
Are there any limitations on Job Graph size and complexity when submitting to standalone cluster?
Is there any possibility to disable graphical graph representation (I have suspicions that the problem is caused by Graph view - locally my job works)
Are there any debug tools, to understand what is happening on the Job submission, and why it times out?
Thanks in advance.

The solution was to use latest flink version (1.5 at the time of writing).

Related

Millisecond behind latest jumps after Flink version upgrade

I noticed a very strange behavior with a recent version bump from Flink 1.14.4 to 1.15.2. My project consumes around 30K records per second from a sharded kinesis stream, and during the version upgrade, it will follow the best practice to first trigger a savepoint from the running job, start the new job from the savepoint and then remove the old job. So far so good, and the above logic has been tested multiple times without any issue for 1.14.4. Usually, after the version upgrade, our job will have a few minutes delay for millisecond behind latest, but it will catch up with the speed quickly(within 30mins). Our savepoint is around one hundred MBs big, and our job DAG will become 90 - 100% busy with some backpressure when we redeploy but after 10-20 minutes it goes back to normal.
Then the strange thing happened, when I tried to redeploy with 1.15.2 upgrade from a running 1.14.4 job, I can see a savepoint has been created and the new job is running, all the metrics look fine, except suddenly millisecond behind the latest jumps to 10 hours!! and it takes days for my application to catch up with the kinesis stream latest record. I don't understand why it jumps from 0 second to 10+ hours when we restart the new job. The only main change I introduced with version bump is to change failOnError from true to false, but I don't think this is the root cause.
I have one assumption, I tried to redeploy the new 1.15.2 job by changing our parallelism, redeploying a job from 1.15.2 does not introduce a big delay, so I assume the issue above only happens when we bump version from 1.14.4 to 1.15.2? I did try to bump it twice and I see the same 10hrs+ jump in delay.
Any insights are welcome, thank you.
While looking through the Flink 1.15 changes related to the Kinesis consumer, there's nothing obvious that stands out for me. I would recommend filing a Jira ticket with the Flink community on this issue. See https://issues.apache.org/jira/projects/FLINK/issues

Reuse Apache Beam Workflow Graph

I'm using Apache Beam to run Batch pipelines on Flink, running on AWS EMR.
Beam Pipelines are being created and submitted to a long running Flink cluster. However, I see that there is a cost associated with building the job graph and submitting it to EMR, taking constantly in excess of 2 minutes.
I want to reduce this time & was wondering if there is a way to cache/prevent Beam from building pipelines with every run of the Driver cronjob.
The job graph is a bit complex, so the 2 minute time is justified. Just looking at ways to reduce/eliminate it from happening everytime.

Flink task managers are not processing data after restart

I am new to flink and i deployed my flink application which basically perform simple pattern matching. It is deployed in Kubernetes cluster with 1 JM and 6 TM. I am sending messages of size 4.4k and 200k messages every 10 min to eventhub topic and performing load testing. I added restart strategy and checking pointing as below and i am not explicitly using any states in my code as there is no requirement for it
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(interval, CheckpointingMode.EXACTLY_ONCE);
// advanced options:
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(1000);
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(120000);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
// allow job recovery fallback to checkpoint when there is a more recent savepoint
env.getCheckpointConfig().setPreferCheckpointForRecovery(true);
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
5, // number of restart attempts
Time.of(5, TimeUnit.MINUTES) // delay
));
Initially i was facing Netty server issue with network buffer and i followed this link https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/config.html#taskmanager-network-memory-floating-buffers-per-gate flink network and heap memory optimizations and applied below settings and everything is working fine
taskmanager.network.memory.min: 256mb
taskmanager.network.memory.max: 1024mb
taskmanager.network.memory.buffers-per-channel: 8
taskmanager.memory.segment-size: 2mb
taskmanager.network.memory.floating-buffers-per-gate: 16
cluster.evenly-spread-out-slots: true
taskmanager.heap.size: 1024m
taskmanager.memory.framework.heap.size: 64mb
taskmanager.memory.managed.fraction: 0.7
taskmanager.memory.framework.off-heap.size: 64mb
taskmanager.memory.network.fraction: 0.4
taskmanager.memory.jvm-overhead.min: 256mb
taskmanager.memory.jvm-overhead.max: 1gb
taskmanager.memory.jvm-overhead.fraction: 0.4
But i have two below questions
If any task manager restarts because of any failures the task manager is restarting successfully and getting registered with job manager but after the restarted task manager don't perform any processing of data it will sit idle. Is this normal flink behavior or do i need to add any setting to make task manager to start processing again.
Sorry and correct me if my understanding is wrong, flink has a restart strategy in my code i made limit 5 attempts of restart. What will happen if my flink job is not successfully overcomes the task failure entire flink job will be remained in idle state and i have to restart job manually or is there any mechanism i can add to restart my job even after it crossed the limit of restart job attempts.
Is there any document to calculate the number of cores and memory i should assign to flink job cluster based on data size and rate at which my system receives the data ?
Is there any documentation on flink CEP optimization techniques?
This is the error stack trace i am seeing in job manager
I am seeing the below errors in my job manager logs before the pattern matching
Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager '/10.244.9.163:46377'. This might indicate that the remote task manager was lost.
at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:393)
at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:358)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:236)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1416)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:257)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:243)
at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:912)
at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:816)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:416)
at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:515)
at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:918)
at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Thanks in advance, please help me in resolving my doubts
Various points:
If your patterns involve matching temporal sequences (e.g., "A followed by B"), then you need state to do this. Most of Flink's sources and sinks also use state internally to record offsets, etc., and this state needs to be checkpointed if you care about exactly-once guarantees. If the patterns are being streamed in dynamically, then you'll want to store the patterns in Flink state as well.
Some of the comments in the code don't match the configuration parameters: e.g., "500 ms of progress" vs. 1000, "checkpoints have to complete within one minute" vs 120000. Also, keep in mind that the section of the documentation that you copied these settings from is not recommending best practices, but is instead illustrating how to make changes. In particular, env.getCheckpointConfig().setPreferCheckpointForRecovery(true); is a bad idea, and that config option should probably not exist.
Some of your entries in config.yaml are concerning. taskmanager.memory.managed.fraction is rather large (0.7) -- this only makes sense if you are using RocksDB, since managed memory has no other purpose for streaming. And taskmanager.memory.network.fraction and taskmanager.memory.jvm-overhead.fraction are both very large, and the sum of these three fractions is 1.5, which doesn't make sense.
In general the default network configuration works well across a wide range of deployment scenarios, and it is unusual to need to tune these settings, except in large clusters (which is not the case here). What sort of problems did you encounter?
As for your questions:
After a TM failure and recovery, the TMs should automatically resume processing from the most recent checkpoint. To diagnose why this isn't happening, we'll need more information. To gain experience with a deployment that handles this correctly, you can experiment with the Flink Operations Playground.
Once the configured restart strategy has played itself out, the job will FAIL, and Flink will no longer try to recover that job. You can, of course, build your own automation on top of Flink's REST API, if you want something more sophisticated.
Documentation on capacity planning? No, not really. This is generally figured out through trial and error. Different applications tend to have different requirements in ways that are difficult to anticipate. Things like your choice of serializer, state backend, number of keyBys, the sources and sinks, key skew, watermarking, and so on can all have significant impacts.
Documentation on optimizing CEP? No, sorry. The main points are
do everything you can to constrain the matches; avoid patterns that must keep state indefinitely
getEventsForPattern can be expensive

Flink EMR Program failing

I have, what I would consider, a fairly simple Flink program. Sourced from a Kafka stream, filter's applied, process function applied, flat map applied, and sent to a Redis sink. Running this locally in a stand alone environment on my dev box, there is no problem. I am trying to push this into production on AWS EMR, I followed the guide for running a Flink program on EMR. After my first test, I had a GC overhead limit exceeded error, so I made adjustments to reduce the amount of data stored. My next try the program ran for much longer, but eventually failed, not giving any indication of a type of error like it had previously.
I am unsure how to go about debugging problems that I suspect may be a side effect of running on EMR. Most of the monitoring metrics in the EMR console are useless as far as I can tell. If it matters, I am running the program as a Step in EMR, the guide i followed is here http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-flink.html. This program is also suppose to be an always up solution, basically it will constantly be reading from the Kafka Stream and processing the data(If that matters at all, not sure if there is a different configuration I should be using for an always up solution)
I'll be happy to provide any information needed to help me getting this into production.
Thank you

Debugging on the remote cluster

I have a program which works fine in a local cluster but not running properly when executing in on the remote cluster. I would like to know, what are the best and common ways of debugging a program running on a remote Flink cluster?
Any help is appreciated!
There are several ways to debug a Flink application on a remote cluster.
Since using a real debugger is complicated, I would first try to log as much as possible to find out the error.
Another approach that could be helpful is using Flink's accumulators. With them, you can gather some statistics: For example when you have a filter, you can determine, how many elements passed the filter and so on.
The last resort is attaching a debugger to one of the Flink TaskManager JVMs.
Also check out my presentation on the topic: http://de.slideshare.net/robertmetzger1/apache-flink-hands-on

Resources