Apache Flink in Kubernetes - apache-flink

Could anyone please let me know how I can setup Flink in my Serverless platform (FaaS) to perform event driven operations?
I looked at Flink functions and it seems to be promising. Could anyone clarify on the below?
What I need to install in my FaaS env. to trigger the flink function when an event (file changes in my s3 bucket) occurs?
I don't have big data platform and so planning to use flink in my serverless/kubernetes env.
Thanks in advance!!

To use StateFun You would generally need:
An Ingress that would trigger the functions.
The actual code that would react to your events (the stateful function) Dockerized
A way to lunch your application
Specifically:
Every stateful function application starts with an Ingress, basically that is a funnel of events that your functions can react to.
In your case, you can use Amazon Kinesis as your Ingress, and make sure that your S3 events will end up there.
The next thing that you would need, is to get yourself familiar with a stateful function SDK, either in Java or in Python and write the logic that deals with the incoming events. The result of that stage would be a Docker image.
Then, you need to lunch the image obtained at (2) and for that you can use Kubernetes (you don't have to).
There are Helm charts provided for your convenience and a simple utility to generate the necessary k8s resources.

Related

Flink CLI vs Flink Web Console

We have a requirement where to replace flink console UI and enable all the functionalities of Flink Web console using CLI utilities, for some of the functionalities like starting job, save-points etc we are using Flink CLI.
My questions are
Does Flink CLI has parity with Flink Web UI Console?
If not, is there alternate ways to do things without ui what is possible via Flink Console (like checking/monitoring back pressure of a job etc)
I am trying to find a solution where on-call engineer can completely monitor and operate on flink using command line / terminal without need to go to web ui
Thanks in Advance
In theory the Flink CLI plus the REST api provide a superset of the functionality available via the web UI. But some things, like identifying a busy task that's causing backpressure, can be done much more quickly with the web UI. For monitoring and troubleshooting I think you'll need to either build some tooling and/or set up a metrics dashboard (e.g., using Grafana in combination with your preferred metrics reporter).

Dynamic Job Creation and Submission to Flink

Hi I am planning to use flink as a backend for my feature where we will show a UI to user to graphically create event patterns for eg: Multiple login failures from the same Ip address.
We will create the flink pattern programmatically using the given criteria by the user in the UI.
Is there any documentation on how to dynamically create the jar file and dynamically submit the job with it to flink cluster?
Is there any best practice for this kind of use case using apache flink?
The other way you can achieve that is that you can have one jar which contains something like an “interpreter” and you will pass to it the definition of your patterns in some format (e.g. json). After that “interpreter” translates this json to Flink’s operators. It is done in such a way in https://github.com/TouK/nussknacker/ Flink’s based execution engine. If you use such an approach you will need to handle redeployment of new definition in your own application.
One straightforward way to achieve this would be to generate a SQL script for each pattern (using MATCH_RECOGNIZE) and then use Ververica Platform's REST API to deploy and manage those scripts: https://docs.ververica.com/user_guide/application_operations/deployments/artifacts.html?highlight=sql#sql-script-artifacts
Flink doesn't provide tooling for automating the creation of JAR files, or submitting them. That's the sort of thing you might use a CI/CD pipeline to do (e.g., github actions).
Disclaimer: I work for Ververica.

Can i expose an endpoint from my flink streaming application

I would like to expose an end point from my flink streaming application.Which returns some static metadata about the app . What are the possible ways to implement this . Please help
What sort of metadata would you like to retrieve? Flink exposes a CLI which is enables you to gather data about the running job. Which you are able to use both if you're running it on e.g. Kubernetes or AWS KDA.
You can also define and expose your own metrics if the CLI doesn't fulfil your use case.

Migrating from Google App Engine Mapreduce to Apache Beam

I have been a long-time user of Google App Engine's Mapreduce library for processing data in the Google Datastore. Google no longer supports it and it doesn't work at all in Python 3. I'm trying to migrate our older Mapreduce jobs to Google's Dataflow / Apache Beam runner, but the official documentation is awful, it just describes Apache Beam, it does not tell you how to migrate.
In particular, the issues are this:
in Mapreduce, the jobs would run on your existing deployed application. However in Beam you have to create and deploy a custom Docker image to build the environment for Dataflow, is this right?
To create a new job template in Mapreduce, you just need to edit a yaml file and deploy it. To create one in Apache beam, you need to create custom runner code, a template file deployed to google cloud storage, and link up with the docker image, is this right?
Is the above accurate? If so, is it generally the case that working with Dataflow is much more difficult than Mapreduce? Are there any libraries or tips for making this easier?
In technical terms that's what is happening, but unless you have some specific advanced use-cases, you won't need to set any custom Docker images manually. Dataflow does some work in the background to run your user code and dependencies on a custom container so that it can execute your user-written code and dependencies on their VMs.
In Dataflow, writing a job template mainly requires writing some pipeline code in your chosen language (Java or Python), and possibly writing some metadata. Once your code is written, creating and staging the template itself isn't much different than running a normal Dataflow job. There's a page documenting the process.
I agree the page on Mapreduce to Beam migration is very sparse and unhelpful, although I think I understand why that is. Migrating from Mapreduce to Beam isn't a straightforward 1:1 migration where only the syntax changes. It's a different pipeline model and most likely will require some level of rewriting your code for the migration. A migration guide that fully covered everything would end up repeating most of the existing documentation.
Since it sounds like most of your questions are around setting up and executing Beam pipelines, I encourage you to begin with the Dataflow quickstart in your chosen language. It won't teach you how to write pipelines, but will teach you how to set up your environment to write and run pipelines. There are links in the quickstarts which direct you to Apache Beam tutorials that teach you the Beam API and how to write your own pipelines, and those will be useful for rewriting your Mapreduce code in Beam.

Using Flink LocalEnvironment for Production

I wanted to understand the limitations of LocalExecutionEnvironment and if it can be used to run in production ?
Appreciate any help/insight. Thanks
LocalExecutionEnvironment spins up a Flink MiniCluster, which runs the entire Flink system (JobManager, TaskManager) in a single JVM. So you're limited to CPU cores and memory available on that one machine. You also don't have HA from multiple JobManagers. I haven't looked at other limitations of the MiniCluster environment, but I'm sure more exist.
A LocalExecutionEnvironment doesn't load a config file on startup, so you have to do all of the configuration in the application. By default it also doesn't offer a REST endpoint. You can solve both these issues by doing something like this:
String cwd = Paths.get(".").toAbsolutePath().normalize().toString();
Configuration conf = GlobalConfiguration.loadConfiguration(cwd);
env = StreamExecutionEnvironment.createLocalEnvironmentWithWebUI(conf);
Logging may be another issue that will require a workaround.
I don't believe you'll be able to use the Flink CLI to control the job, but if you create the Web UI (as shown above) you can at least use the REST API to do things like triggering savepoints (after first using the REST API to get the job ID).

Resources