Flink Yarn infinite restart on task failure - apache-flink

I am running flink streaming job on AWS yarn cluster with below configuration
Master Node - 1, Core Node - 1, Task Nodes - 3
And I enabled
jobmanager.execution.failover-strategy: region
As one of my task nodes are failing and trying to restart at region level (in my case at task node level) and I enabled the restart strategy as fixedDelayrestart with 5 attempts of 5 minutes delay and my checkpoints are disabled.
Reference Image
If you see the image it is restarting more than expected.
Can anybody help me understand why does it is behaving like this?

The documentation has a section about the "Restart Pipelined Region Failover Strategy" [1]. The bottom line is, if you have a streaming job with an operator that physically partitions the stream, such as keyBy, all tasks will end up being in the same region, and therefore all tasks will be restarted as a whole. For batch jobs, you need to configure the ExecutionMode [2] to be BATCH or BATCH_FORCED.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/task_failure_recovery.html#restart-pipelined-region-failover-strategy
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.9/api/java/org/apache/flink/api/common/ExecutionMode.html

Related

Is it possible to add new embedded worker while cluster is running on statefun?

Here is the deal;
I'm dealing with adding new worker (embbeded) to on running the cluster (flink statefun 2.2.1).
As you see the new task manager can be registered to the cluster;
Screenshot of new deployed taskmanager
But it doesn't initialize (it doesn't deploying sources);
What am I missing here?? (master and workers has to same jar files too? or it should be enough deploying taskmanager with jar file)
Any help would be appreciated,
Thx.
Flink supports two different approaches to rescaling: active and reactive.
Reactive mode is new in Flink 1.13 (released just this week), and works as you expected: add (or remove) a task manager, and your application will adjust to the new parallelism. You can read about elastic scaling and reactive mode in the docs.
Reactive mode is currently a work in progress, but might need your needs.
In broad strokes, for active mode rescaling you need to:
Do a stop with savepoint to bring down your current job while taking a snapshot of its state.
Relaunch with the new parallelism, using the savepoint as the starting point.
The exact details depend on how your cluster is deployed.
For a step-by-step tutorial, see Upgrading & Rescaling a Job in the Flink Operations Playground.
The above applies to rescaling statefun embedded functions. Being stateless, remote functions can be rescaled more straightforwardly.

Can't get checkpoint to load in Flink

I'd like for the latest checkpoint to be loaded in Flink but it just isn't. I've written a word count application that is meant to resume counting where it left off after a restart. I am running it from my IDE so I'm not starting a Flink cluster.
Here's the code I wrote https://github.com/edu05/wordcount/tree/simple
Which is inspired from the example provided by the Flink creators on checkpointing... https://github.com/streaming-with-flink/examples-scala
What am I missing? How can I also avoid re-printing some of the word counts? I don't see many contributors in Stackoverflow on Apache Flink, is there another more appropriate forum?
Checkpoints are by default not retained and are only used to resume a job from failures.
If you need to start your job from a retained checkpoint you have to do it manually just as from savepoint by the following way:
$ bin/flink run -s :checkpointMetaDataPath [:runArgs]

Flink EMR Installation

I am new to flink and trying to deploy this on EMR cluster. I have used 3 node cluster (1 master and 2 slaves) with their default configuration. I have not done any configuration changes and sticking with default configuration.
I am curious to understand the following points:
How does master and slaves communicate with each other as I have not mentioned any IP in conf/slaves in master node?
I can see a flink library in master node (Path: /usr/lib/flink) but cannot find flink library in slave nodes. How is my code getting executed on slave nodes?
I will change some config according to my requirements in conf/flink-config.yml, if required. Do I need to make any other change on master or slave node apart from this?
See the Running flink-crawler in EMR wiki page for details on how we run a Flink streaming job on top of EMR. Note that in this mode Flink is running via YARN, thus the Flink conf/slaves file isn't being used. You should also take a look at the YARN Setup documentation to better understand how Flink runs on top of YARN.

Why "Configuration" section of running job is empty?

Can anybody explain me why "Configuration" section of running job in Apache Flink Dashboard is empty?
How to use this job configuration in my flow? Seems like this is not described in documentation.
The configuration tab of a running job shows the values of the ExecutionConfig. Depending on the version of Flink you might will experience a different behaviour.
Flink <= 1.0
The ExecutionConfig is only accessible for finished jobs. For running jobs, it is not possible to access it. Once the job has finished or has been stopped/cancelled, you should be able to see the ExecutionConfig.
Flink > 1.0
The ExecutionConfig can also be accessed for running jobs.

Apache flink Quick Start "Analyze the Result" error for K-Means

I followed the implementation of Apache-flink via: quick_start
I am not able to perform the last task i.e. 'Analyze the Result' because there is no result file inside the kmeans folder.
If you look into the above screenshot of flink JobManager, there you can see Status as FAILED for KMeans Example. And may be due to this failed status there is no result file inside the kmeans folder.
Now on clicking the KMeans Example, I get the following visualization:
​
And below is the screenshot of exceptions:
​
Could you please guide me what am I doing wrong.
The problem is that the cluster has been started with a single TaskManager which has only a single slot and that you want to execute the KMeans job at the same time with a parallelism of 4.
In order to run the job with parallelism of 4, you have to increase the number of TaskManager of your cluster or the number of slots on each TaskManager. The latter can be set in the Flink configuration flink-conf.yaml with taskmanager.numberOfTaskSlots: 4. For the former, you can modify the conf/slaves file to add new machines for the additional TaskManager.
Alternatively, you can decrease the parallelism of your job to 1. You can control the parallelism with the command line option -p. E.g. bin/flink run -p 1 -c JobClass job.tar.

Resources