We are building a stream processing job using Flink v1.12.2 and planning to run it on a Kubernetes cluster. While referring to the official Flink documentation, we came across, primarily, two ways of submitting Flink jobs to a Kubernetes cluster, one is in Standalone mode and the other is in Native mode. We noticed that with the latter option, there are no yaml config files and looks simple. Just wondering what is the recommended mode/approach and their pros and cons. Thank you.
glad to hear you're trying out Flink on K8s!
The Native mode is the current recommendation for starting out on Kubernetes as it is the simplest option, like you noted. In Flink 1.13 (to be released in the coming weeks), there is added support for specifying Pod templates. One of the drawbacks to this approach is its limited ability to integrate with CI/CD.
Some other popular approaches for a more "Kubernetes" style of running jobs (i.e. just YAML manifests) include Lyft's Operator, the Ververica Platform (disclaimer: I work here, on this), and Google Cloud Platform's Operator. These are all more work to set up but offer a better CI/CD story, which can help make using Flink in production less effort in the long run.
If you'd like to talk about any of these more in-depth, the User Mailing List is full of helpful people that can weigh some of the pros/cons that apply to your use case.
Related
I am considering using Flink or Apache Beam (with the flink runner) for different stream processing applications. I am trying to compare the two options and make the better choice. Here are the criteria I am looking into and for which I am struggling to find info for the flink runner (I found basically all the info for flink standalone already) :
Ease of use
Scalability
Latency
Throughput
Versatility
Metrics generation
Can deploy with Kubernetes (easily)
Here are the other criteria which I think I already know the answers too:
Ability to do stateful operations: Yes for both
Exactly-once guarantees: Yes for both
Integrates well with Kafka: Yes for both (might be a little harder with beam)
Language supported:
Flink: Java, Scala, Python, SQL
Beam: Java, Python, GO
If you have any insight on these criteria for the flink runner please let me know! I will update the post if I find answers!
Update: Good article I found on the advantage of using Beam (don't look at the airflow part):
https://www.astronomer.io/blog/airflow-vs-apache-beam/
Similar to OneCricketeer's comment, it's quite subjective to compare these 2.
If you are absolutely sure that you are going to use FlinkRunner, you could just cut the middle man and directly use Flink. And it saves you trouble in case Beam is not compatible with a specific FlinkRunner version you want to use in the future (or if there is a bug). And if you are sure all the I/Os you are going to use are well supported by Flink and you know where/how to set up your FlinkRunner (in different modes), it makes sense to just use Flink.
If you consider moving to other languages/runners in the future, Beam offers language and runner portabilities for you to write a pipeline once and run everywhere.
Beam supports more than Java, Python and Go:
JavaScript: https://github.com/robertwb/beam-javascript
Scala: https://github.com/spotify/scio
Euphoria API
SQL
Runners:
DataflowRunner
FlinkRunner
NemoRunner
SparkRunner
SamzaRunner
Twister2Runner
Details can be found on https://beam.apache.org/roadmap/.
I am using Apache camel for quite long time and found it to be a fantastic solution for all kind of system integration related business need. But couple of years back I came accross the Apache Nifi solution. After some googleing I found that though Nifi can work as ETL tool but it is actually meant for stream processing.
In my opinion, "Which is better" is very bad question to ask as that depend on different things. But it will be nice if somebody can describe more about the basic comparison between the two and also the obvious question, when to use what.
It will help to take decision as per my current requirement, which will be the good option in my context or should I use both of them together.
The biggest and most obvious distinction is that NiFi is a no-code approach - 99% of NiFi users will never see a line of code. It is a web based GUI with a drag and drop interface to build pipelines.
NiFi can perform ETL, and can be used in batch use cases, but it is geared towards data streams. It is not just about moving data from A to B, it can do complex (and performant) transformations, enrichments and normalisations. It comes out of the box with support for many specific sources and endpoints (e.g. Kafka, Elastic, HDFS, S3, Postgres, Mongo, etc.) as well as generic sources and endpoints (e.g. TCP, HTTP, IMAP, etc.).
NiFi is not just about messages - it can work natively with a wide array of different formats, but can also be used for binary data and large files (e.g. moving multi-GB video files).
NiFi is deployed as a standalone application - it's not a framework or api or library or something that you integrate in to something else. It is a fully self-contained, realised application that is fully featured out of the box with no additional development. Though it can be extended with custom development if required.
NiFi is natively clustered - it expects (but isn't required) to be deployed on multiple hosts that work together as a cluster for performance, availability and redundancy.
So, the two tools are used quite differently - hopefully that helps highlight some of the key differences
It's true that there is some functional overlap between NiFi and Camel, but they were designed very differently:
Apache NiFi is a data processing and integration platform that is mostly used centrally. It has a low-code approach and prefers configuration.
Apache Camel is an integration framework which is mostly used in distributed solutions. Solutions are coded in Java. Example solutions are adapters, flows, API's, connectors, cloud functions and so on.
They can be used very well together. Especially when using a message broker like Apache ActiveMQ or Apache Kafka.
An example: A java application is enhanced with Camel so that it can send messages to Kafka. In NiFi the first step is consuming those messages from Kafka. Then in the NiFi flow the message is changed in various steps. In the middle the message is put on another Kafka topic. A Camel function (CamelK) in the cloud does various operations on the message, when it's finished it put the message on a Kafka topic. The message goes through a NiFi flow which at the end calls an API created with Camel.
In a blog I wrote in detail on the various ways to combine Camel and Nifi:
https://raymondmeester.medium.com/using-camel-and-nifi-in-one-solution-c7668fafe451
What are the advantages and disadvantages of using python or java when developing apache flink stateful function.
Is there any performance difference? which one is more efficient for the same operation?
Can we develop the application completely on python?
What are the features that one supports and the other does not.
StateFun support embedded functions and remote functions.
Embedded functions are bundled and deployed within the JVM processes that run Flink. Therefore they must be implemented in a JVM language (like Java) and they would be the most performant. The downside is that any change to the function code requires a restart of the Flink cluster.
Remote functions are functions that are executing in a separate process, and are invoked by the Flink cluster for every incoming message addressed to them. Therefore they are expected to be less performant than the embedded functions, but they provide a great flexibility in:
Choosing an implementation language
Fast scaling up and down
Fast restart in case of a failure.
Rolling upgrades
Can we develop the application completely on python?
Is it is possible to develop an application completely in Python, see the python greeter example.
What are the features that one supports and the other does not.
The current features are currently supported only in the Java SDK:
Richer routing logic from an ingress to a function. Any routing logic that you can describe via code.
Few more state types like a table and a buffer.
Exposing existing Flink sources and Sinks as ingresses and egresses.
Could someone explain the benefits/issues with hosting a database in Kubernetes via a persistent volume claim combined with a storage volume over using an actual cloud database resource?
It's essentially a trade-off: convenience vs control. Take a concrete example: let's say you pay Amazon money to use Athena, which is really just a nicely packaged version of Facebook Presto which AWS kindly operates for you in exchange for $$$. You could run Presto on EKS yourself, but why would you.
Now, let's say you want to or need to use Apache Drill or Apache Impala. Amazon doesn't offer it. Nor does any of the other big public cloud providers at time of writing, as far as I know.
Another thought: what if you want to migrate off of AWS? Your data has gravity as well.
Could someone explain the benefits/issues with hosting a database in Kubernetes ... over using an actual cloud database resource?
As previous excellent answer noted:
It's essentially a trade-off: convenience vs control
In addition to previous example (Athena), take a look at RDS as well and see what you would need to handle yourself (why would you, as said already):
Automatic backups
Multizone deployments
Snapshots
Engine upgrades
Read replicas
and other bells and whistles that come with managed service opposed to self-hosted/managed one.
But there is more to it than just convenience/control that this post I trying to shed light onto:
Kubernetes is adding another layer of abstraction there (pods, services...), and depending on way of handling storage (persistent volumes) you can have two additional considerations:
Access speed (depending on your use case this can be negligent or show stopper).
Storage that you have at hand might not be optimized for relational database type of I/O (or restrict you to schedule pods efficiently). The very same reasons you are not advised to run db on NFS for example.
There are several recent conference talks on kubernetes pointing out that database is big no-no for kubernetes (although this is highly opinionated, we do run average load mysql and postgresql databases in k8s), and large load/fast I/O is somewhat challenge to get right on k8s as opposed to somebody already fine tuned everything for you in managed cloud solution.
In conclusion:
It is all about convenience, controls and capabilities.
We are currently using Apache-Camel for ETL, that is, we take daily/weekly/monthly exports from various databases, perform needed actions and then publish the results somewhere for other databases to ingest.
Recently i saw a talk on Apache-Airflow, and it seems to me that it can do the work Camel is doing only easier. By easier i mean it looks like it would be more self-documenting and therefore easier to maintain. Am i correct? And why are there no comparisons between the two, like there are between Camel and Mule?
Apache Camel and Apache Airflow were written for different purposes. The former as a Enterprise Integration Framework, the latter as a platform to programmatically author, schedule and monitor workflows, this is why they are not generally compared side-by-side.
Apache Camel can be used for ETL: think of ETL as a process integrating the operational DB and the datawarehouse, and think of each step in the ETL data-processing process as a message.
Would it be easier to perform the task we are doing now, if we changed to Airflow? Well, generally how well suited a framework is for a specific company's needs depends on how things are set up on site. In our case we have chosen for Java and we want our processes to run on windows machines and on linux. The comparison then becomes:
Camel's main advantages are that it we are already using it, it's Java, and there is even a Spring boot auto-configuration.
The main disadvantages are that it is hard to maintain: understanding what exactly happens when and why, is hard. This is not directly caused by the features Camel has as a Enterprise Integration Framework, but because it is not tailored to simplify workflows.
Airflow is specifically written with scheduling interdependent jobs in mind, it even has a GUI to simplify this task.
For us it would require additional installations and it may not work with our Java-witten jobs out-of-the-box (i know that it is possible to call java from python, but this just adds more complexity).
For my needs i'm going to explore other options and maybe just leave things the way they are.
It depends on the type of problem(s) you are looking to solve. Apache Camel is an enterprise integration framework that implements well-known, accepted Enterprise Integration Patterns to provide specific solutions to types of well known problems.
Apache Airflow does not implement these integration patterns and therefore would be less useful in solving these specific types of problems.
From my experience with Camel, it is often misused as a generic platform to solve non enterprise-integration problems, which leads to dealing with the unnecessary overhead and constraints of the framework.
Using your ETL problem as an example, I would think that Apache Camel would be unnecessary unless you were doing some form of Message Routing or Message Transformation of the data that would warrant/benefit from using an integration solution such as Camel. The solutions that Apache Camel offers for these well-known integration problems are the real benefit to using Apache Camel over another tool or doing it by hand.
TLDR; To answer your question, Apache Camel is an Enterprise Integration Framework for solving specific types of integration problems and Apache Airflow is not. That is likely why there is no comparison between the two - they are apples and oranges, in a sense.
While you may be able to do some of the same things in both, Apache Camel will also have complex integration solutions out of the box that Airflow won't.