Flink: How to configure Flink such that the Taskmanagers auto restart after a failure? - apache-flink

How to configure Flink such that the Taskmanagers auto restart after a failure ?

On yarn and kubernetes Flink has a native resource manager (YarnResourceManager and KubernetesResourceManager) that will arrange for the requested number of slots to be available. In other environments you'll need to use cluster-framework-specific mechanisms to take care of this yourself.
Note that for k8s, only session clusters are supported by this new, more active mode implemented by KubernetesResourceManager. Job clusters still need to be managed in the old fashioned way, as described in the docs.
And then there are managed Flink environments where these details are taken care of for you -- e.g., Ververica Platform or Kinesis Data Analytics.

Related

Flink Session Cluster vs Job Cluster

Please can some one help me on this to under stand better
How to create Flink Session based cluster and How to create job based cluster? Do we have any specific configuration params?
How de we know cluster is session based or job based?
This is covered in the Flink documentation. For an overview, see https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/overview/, and for the details, see the pages for your specific environment, for example:
standalone
standalone kubernetes
native kubernetes
yarn
Note that job mode deployments have been deprecated in favor of application mode.

How to auto scale up/down Flink Stateful Functions on K8s

My Current Flink Application
based on Flink Stateful Function 3.1.1, it reads message from Kafka, process the message and then sink to Kafka Egress
Application has been deployed on K8s following guide and is running well: Stateful Functions Deployment
Based on the standard deployment, I have turned on kubernetes HA
My Objectives
I want to auto scale up/down the stateful functions.
I also want to know how to create more standby job managers
My Observations about the HA
I tried to set kubernetes.jobmanager.replicas in the flink-config ConfigMap:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: flink-config
labels:
app: shadow-fn
data:
flink-conf.yaml: |+
kubernetes.jobmanager.replicas: 7
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
I see no standby job managers in K8s.
Then I directly adjust the replicas of deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: statefun-master
spec:
replicas: 7
Standby job managers show up. I check the pod log, the leader election is done successfully. However, when I access UI in the web browser, it says:
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
What's wrong with my approach?
My Questions about the scaling
Reactive Mode is exactly what I need. I tried but failed, job manager has error message:
Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Reactive mode is configured for an unsupported cluster type. At the moment, reactive mode is only supported by standalone application clusters (bin/standalone-job.sh).
It seems that stateful function auto scaling shouldn't be done in this way.
What's the correct way to do the auto scaling, then?
Potential Approach(Probably incorrect)
After some research, my current direction is:
Job Manger has nothing to do with auto scaling. It is related to HA on K8s. I just need to make sure Job Manager has correct failover behaviors
My stateful functions are Flink remote services, i.e., they are regular k8s services. they can be deployed in form of KNative service to achieve auto scaling. Replicas of services goes up only when http requests come from Flink's worker
The most important part, Flink's worker(or Task Manager) I have no idea how to do the auto scaling yet. Maybe I should use KNative to deploy the Flink worker?
If it doesn't work with KNative, maybe I should totally change the flink runtime deployment. E.g., to try the original reactive demo. But I'm afraid the Stateuful functions are not intended to work like that.
At the last
I have read the Flink documentation and Github samples over and over but cannot find any more information to do this. Any hint/instructions/guideline are appreciated!
Since Reactive Mode is a new, experimental feature, not all features supported by the default scheduler are also available with Reactive Mode (and its adaptive scheduler). The Flink community is working on addressing these limitations.
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/

Deployment of new version of Flink application failed

env
flink 1.7.1
kafka 1.0.1
I use Flink application in Streaming process.
Read topic from kafka and sink it to kafka new Topic.
When i change application with new version of code and deploy, it comes to application execution failure.
If i deploy the same group.id after changing the application code, could there be a conflict with previous state checkpoint information?
Yes, if you are trying to do a stateful upgrade of your Flink application, there are a few things that can cause it to fail.
The UIDs of the stateful operators are used to find the state for each operator. If you haven't set the UIDs, then if the job's topology has changed, state restore will fail because Flink won't be able to find the state. See the docs on Assigning Operator IDs for details.
If you have dropped a stateful operator, then you should run the new job while specifying -allowNonRestoredState.
If you have modified your data types, the job can fail when attempting to deserialize the state in the checkpoint or savepoint. Flink 1.7 did not have any support for automatic schema evolution or state migration. In more recent versions of Flink, if you stick to POJOs or Avro, this is handled automatically. Otherwise you need custom serializers.
If this doesn't help you figure out what's going wrong, please share the information from the logs showing the specific exception.

Flink: Multi-data center deployment possible?

Does Flink support multi-data center deployments so that jobs can survive a single data-center outage (i.e. resume failed jobs in a different data-center) ?
My limited research on Google suggests the answer: there is no such support.
Thanks
While there is no explicit support for multi-datacenter failover in Flink, it can be (and has been) setup. A Flink Forward talk provides one example: engineers from Uber describe their experience with this and related topics in Practical Experience running Flink in Production. There are a number of components to their solution, but the key ideas are: (1) running extra YARN resource managers, configured with a longer inetaddr.ttl, and (2) checkpointing to a federation of two HDFS clusters.

Where is the JobManager on embedded Flink instances?

I am developing an application with multiple (micro)services.
I am using Flink (over Kafka) to stream messages between the services. Flink is embedded in the Java applications, each running in a separate docker container.
This is the first time I'm trying Flink and after reading the docs I still have a feeling I'm missing something basic.
Who is managing the jobs?
Where is the JobManager running?
How do I monitor the processing?
Thanks,
Moshe
I would recommend this talk by Stephan Ewen at Flink Forward 2016. It explains the current Apache Flink architecture (10:45) for different deployments as well as future goals.
In general, the JobManager is managing Flink jobs and TaskManagers execute your job consisting of multiple tasks. How the components are orchestrated depends on your deployment (local, Flink cluster, YARN, Mesos etc.).
The best tool for monitor your processing is the Flink Web UI at port 8081 by default, it offers different metrics for debugging and monitoring (e.g. monitoring checkpointing or back-pressure).

Resources