I am developing an application with multiple (micro)services.
I am using Flink (over Kafka) to stream messages between the services. Flink is embedded in the Java applications, each running in a separate docker container.
This is the first time I'm trying Flink and after reading the docs I still have a feeling I'm missing something basic.
Who is managing the jobs?
Where is the JobManager running?
How do I monitor the processing?
Thanks,
Moshe
I would recommend this talk by Stephan Ewen at Flink Forward 2016. It explains the current Apache Flink architecture (10:45) for different deployments as well as future goals.
In general, the JobManager is managing Flink jobs and TaskManagers execute your job consisting of multiple tasks. How the components are orchestrated depends on your deployment (local, Flink cluster, YARN, Mesos etc.).
The best tool for monitor your processing is the Flink Web UI at port 8081 by default, it offers different metrics for debugging and monitoring (e.g. monitoring checkpointing or back-pressure).
Related
Please can some one help me on this to under stand better
How to create Flink Session based cluster and How to create job based cluster? Do we have any specific configuration params?
How de we know cluster is session based or job based?
This is covered in the Flink documentation. For an overview, see https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/overview/, and for the details, see the pages for your specific environment, for example:
standalone
standalone kubernetes
native kubernetes
yarn
Note that job mode deployments have been deprecated in favor of application mode.
My Current Flink Application
based on Flink Stateful Function 3.1.1, it reads message from Kafka, process the message and then sink to Kafka Egress
Application has been deployed on K8s following guide and is running well: Stateful Functions Deployment
Based on the standard deployment, I have turned on kubernetes HA
My Objectives
I want to auto scale up/down the stateful functions.
I also want to know how to create more standby job managers
My Observations about the HA
I tried to set kubernetes.jobmanager.replicas in the flink-config ConfigMap:
---
apiVersion: v1
kind: ConfigMap
metadata:
name: flink-config
labels:
app: shadow-fn
data:
flink-conf.yaml: |+
kubernetes.jobmanager.replicas: 7
high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
I see no standby job managers in K8s.
Then I directly adjust the replicas of deployment:
apiVersion: apps/v1
kind: Deployment
metadata:
name: statefun-master
spec:
replicas: 7
Standby job managers show up. I check the pod log, the leader election is done successfully. However, when I access UI in the web browser, it says:
{"errors":["Service temporarily unavailable due to an ongoing leader election. Please refresh."]}
What's wrong with my approach?
My Questions about the scaling
Reactive Mode is exactly what I need. I tried but failed, job manager has error message:
Exception in thread "main" org.apache.flink.configuration.IllegalConfigurationException: Reactive mode is configured for an unsupported cluster type. At the moment, reactive mode is only supported by standalone application clusters (bin/standalone-job.sh).
It seems that stateful function auto scaling shouldn't be done in this way.
What's the correct way to do the auto scaling, then?
Potential Approach(Probably incorrect)
After some research, my current direction is:
Job Manger has nothing to do with auto scaling. It is related to HA on K8s. I just need to make sure Job Manager has correct failover behaviors
My stateful functions are Flink remote services, i.e., they are regular k8s services. they can be deployed in form of KNative service to achieve auto scaling. Replicas of services goes up only when http requests come from Flink's worker
The most important part, Flink's worker(or Task Manager) I have no idea how to do the auto scaling yet. Maybe I should use KNative to deploy the Flink worker?
If it doesn't work with KNative, maybe I should totally change the flink runtime deployment. E.g., to try the original reactive demo. But I'm afraid the Stateuful functions are not intended to work like that.
At the last
I have read the Flink documentation and Github samples over and over but cannot find any more information to do this. Any hint/instructions/guideline are appreciated!
Since Reactive Mode is a new, experimental feature, not all features supported by the default scheduler are also available with Reactive Mode (and its adaptive scheduler). The Flink community is working on addressing these limitations.
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/elastic_scaling/
So, I am very very new to all the Apache Frameworks I am trying to use. I want your suggestions on a couple of workflow design for an IoT streaming application:
As we have NiFi connectors available for Flink, and we can easily use Beam abstraction over Flink. Can I use NiFi as the dataflow tool to drive data from MiNiFi to Flink Cluster (here store it in-memory or something) and then use Beam Pipeline to process the data further.
Is there any NiFi connector for the beam ? If not can we do so? So, we directly stream data from NiFi to the Beam job (running on a Flink Cluster)
I am still in the early design phase, it would be great if we can discuss possible workarounds. Let me know if you need any other details.
Does Flink support multi-data center deployments so that jobs can survive a single data-center outage (i.e. resume failed jobs in a different data-center) ?
My limited research on Google suggests the answer: there is no such support.
Thanks
While there is no explicit support for multi-datacenter failover in Flink, it can be (and has been) setup. A Flink Forward talk provides one example: engineers from Uber describe their experience with this and related topics in Practical Experience running Flink in Production. There are a number of components to their solution, but the key ideas are: (1) running extra YARN resource managers, configured with a longer inetaddr.ttl, and (2) checkpointing to a federation of two HDFS clusters.
How to configure Flink such that the Taskmanagers auto restart after a failure ?
On yarn and kubernetes Flink has a native resource manager (YarnResourceManager and KubernetesResourceManager) that will arrange for the requested number of slots to be available. In other environments you'll need to use cluster-framework-specific mechanisms to take care of this yourself.
Note that for k8s, only session clusters are supported by this new, more active mode implemented by KubernetesResourceManager. Job clusters still need to be managed in the old fashioned way, as described in the docs.
And then there are managed Flink environments where these details are taken care of for you -- e.g., Ververica Platform or Kinesis Data Analytics.