I'd appreciate some advice around the use of Stateful functions.
We are currently using Flink whereby we consume from a number of kafka streams, aggregate, run a computation and then output to a new stream.
The problem is that the computation element is provided by a different team whose language of choice is Python. We would like to provide them with the ability to develop and update their component independently of the streaming elements.
Initially, we just ported their code to Java.
Stateful functions seem to offer an alternative here whereby we would keep some of our functionality as is and host the model as a Stateful Function in Python. I'm wondering however, if there is any advantage to this over just hosting the computation module on its own pipeline and using AsyncFunction in Flink to interact with it.
If we were to move to Stateful functions I can't help feeling that we are adding complexity without using its power but I may be missing some important considerations around speed and resilience?
I want to begin by noting that Stateful Functions does have a DataStream interop module. This means you can use StateFun to handle the Python functions of your pipeline without rewriting the entire Flink Job.
That said, what advantages does Stateful Functions bring over using AsyncIO and doing it yourself?
Automated handling of connections, batching, back-pressuring, and retries. Even if you are using a single python function and no state, Stateful Functions has been heavily optimized to be as fast and efficient as possible with continual improvements from the community that you will get to leverage for free. StateFun has more sophisticated back pressuring and retry mechanisms in place than AsyncIO that you would need to redevelop on your own.
Higher level APIs. StateFuns Python SDK (and others) provide well defined, typed apis that are easy to develop against. The other team you are working with will only require a few lines of glue code to integrate with StateFun while the project will handle the transport protocols for you.
State! As the name of the project implies, stateful functions are well stateful. Python functions can maintain state and you will get Flink's exactly once guarantees out of the box.
Related
I am new to flink and our use case deals with Stateless computation.
Read event, process event and persist into Database. But Flink documentation never speaks about stateless processing. Any example repository to find stateless examples or documentation.
Finally For this use case which Flink model works? streaming application or event driven application.
There's a strong emphasis in the docs on stateful stream processing, because the community is proud of having created a highly performant, fault tolerant stream processor that provides exactly-once guarantees even when operating stateful pipelines at large scale. But you can certainly make good use Flink without using state.
However, truly stateless applications are rare. Even in cases where the application doesn't do anything obviously stateful (such as windowing or pattern matching), state is required to deliver exactly-once, fault-tolerant semantics. Flink can ensure that each incoming event is persisted exactly once into the sink(s), but doing so requires that Flink's sources and sinks keep state, and that state must be checkpointed (and recovered after failures). This is all handled transparently, except that you need to enable and configure checkpointing, assuming you care about exactly-once guarantees.
The Flink docs include an tutorial on Data Pipelines and ETL that includes some examples and an exercise (https://github.com/apache/flink-training/tree/master/ride-cleansing) that are stateless.
Flink has three primary APIs:
DataStream API: this low-level API is very capable, but the other APIs listed here have strong advantages for some use cases. The tutorials in the docs make for a good starting point. See also https://training.ververica.com/.
Flink SQL and the Table API: this is especially well suited for ETL and analytics workloads. https://github.com/ververica/sql-training is a good starting point.
Stateful Functions API: this API offers a different set of abstractions, and a cloud-native, language-agnostic runtime that supports a variety of SDKs. This is a good choice for event-driven applications. https://flink.apache.org/stateful-functions.html and https://github.com/ververica/flink-statefun-workshop are good starting points.
What are the advantages and disadvantages of using python or java when developing apache flink stateful function.
Is there any performance difference? which one is more efficient for the same operation?
Can we develop the application completely on python?
What are the features that one supports and the other does not.
StateFun support embedded functions and remote functions.
Embedded functions are bundled and deployed within the JVM processes that run Flink. Therefore they must be implemented in a JVM language (like Java) and they would be the most performant. The downside is that any change to the function code requires a restart of the Flink cluster.
Remote functions are functions that are executing in a separate process, and are invoked by the Flink cluster for every incoming message addressed to them. Therefore they are expected to be less performant than the embedded functions, but they provide a great flexibility in:
Choosing an implementation language
Fast scaling up and down
Fast restart in case of a failure.
Rolling upgrades
Can we develop the application completely on python?
Is it is possible to develop an application completely in Python, see the python greeter example.
What are the features that one supports and the other does not.
The current features are currently supported only in the Java SDK:
Richer routing logic from an ingress to a function. Any routing logic that you can describe via code.
Few more state types like a table and a buffer.
Exposing existing Flink sources and Sinks as ingresses and egresses.
I'm trying to decide whether to build a Logic App or a Web App.
It has to do things I'm quite comfortable doing in C#: receive messages in various formats (a few thousand per day), translate them, make API calls and forward them. None of the endpoints are widely used, so the out-of-the-box connectors won't be a benefit. Some require custom headers, the contents of which are calculated using a hashing algorithm. Some of the work involves converting Json into XML and vice-versa.
From what I've read, one of the key points of difference of Logic Apps are that you don't have to write any code. Since our organisation is actually quite comfortable with code, that doesn't feel like it'll actually be a benefit.
Am I missing something? Are there any compelling reasons why a Logic App would be better than a Web App in this instance?
Using Logic Apps has a few additional benefits over just writing code which include:
Out of box monitoring. For every execution you get to see exactly what happened in each step of the process with a monitoring view that replicates your Logic App design view.
Built in failure handling. Logic Apps will automatically retry calls on failure cases and also allows you to either customize the retry policy or have a custom retry policy with a do-until pattern.
Out of box alerting. You can configure alerts to inform you of failures.
Serverless. You don't worry about sizing or scaling and you pay by consumption.
Faster development. Logic Apps allows you to build out the solution faster especially as you consider that you don't have to code for monitoring views, alerting, and error handling that comes out of the box with Logic Apps.
Easy to extend. If you are already using a Logic App access to over a 125 connectors to various services will make it easy to add business value or making it smarter by including things like cognitive services to your workflow with very little extra effort.
I've decided to keep away from Logic Apps for these reasons:
It is not supported outside Azure. We aren't tied to any other providers, and to use Logic Apps would break that independence.
I don't know how much of the problem is readily soluble using Logic Apps. (It seems I will be solving all sorts of problems which wouldn't be problems if I was using C#. This article details some issues encountered while developing a simple process using an earlier version of Logic Apps.)
Nobody has come up with an argument more compelling than the reasons I've given above (especially the first one) why we should use it, so it would be a gamble with little to gain and plenty to lose.
You can think of Logic Apps as an orchestrator - something that takes external pieces of functionality, and weaves a workflow together.
It has nothing to do with your requirement of "writing code" - your code can be external functions on any platform - on-prem, AWS, Azure, Zendesk, and all of your code can be connected together using Logic Apps.
Regardless of which platform you choose, you will still have cross-cutting concerns such as monitoring, logging, alerting, deployments, etc, and Logic Apps addresses very robustly all of those requirements.
Good morning everybody,
I have already used Apache Storm to build topologies and I found that a good thing about the API they expose is the possibility to "manually" connect the operators in the graph topology.
You can create loops, for example.
I was wondering if there is a best practice to achieve the same "expressivity" in Flink.
Thank you so much!
Cyclic topologies are not supported in Flink. You can perform iterations through a specific operator. Except for cycles, you define your graph through the standard API and it's rather flexible compared to, for example, Spark. Many DataSet and DataStream API accept both functions and custom implementations of classes like RichMapFunction,RichFlatMapFunction and so on. This gives a huge degree of flexibility and customizability together with modularity and reusability. It takes some time to go beyond the standard API and learn how to customize your Flink Jobs properly but it's worth it.
Flink has an "easy-mode", that resembles the API of Spark, in which you can do most of the stuff you need. When you want to express stuff that is out of the scope and use cases of the standard API, instead of doing weird workarounds like you have to in Spark, you can work directly with a layer that is partially below the standard API. There are many pieces that you can extend and customize and then plug in place of the provided operators/triggers/sources/sinks and so on. This is mostly documented feature by feature.
I am slowly getting familiar with Camel, however I am struggling to understand the level of granularity at which it should be considered. Should Camel be used only if passing messages from one application to another, or it is also appropriate to use Camel to pass messages between components and / or layers within a single application?
For example, I have a requirement to expose a web service that accepts bookings, validates them and writes them to a queue. Would you recommend using camel in this scenario or does it really depend on the level of flexibility I want my solution to allow.
Put another way, if I was required to save the bookings to a database I would never have considered camel and instead just built it as a traditional app that calls a DAL to save the booking. Of course I could use camel-ibatis to insert the data but in this context using camel seems overkill.
Thank you for any pointers on this.
As you obviously suspect, it's somewhat of a grey line. The more that you need to flexible, the more benefit you'll get from using Camel.
Just this past week, I built a prototype of an app that needed to accept an HTTP post, put the data on a queue, and then pull messages from the queue and use them to update a Mongo database.
Initially, I used Camel, and it worked well. Then, the requirement for the HTTP POST was removed (it became just consuming messages from a queue and updating the database), and the database update became more complex than was easily supported via a simple string-based camel mongo endpoint spec, so I wound up doing away with camel, and rewrote it with just a jms connector and the Mongo api.
So, as usual, it depends. I would say that if you're just moving data between two endpoints, and there are no content-based decisions or routing, then you probably won't benefit from using Camel. Once you actually want to use one or more of the Enterprise Integration Patterns, then Camel will be a benefit.
As camel has lots of component, it looks like it can do every thing. But sometime it could be more straight forward if you can use the third party lib directly and don't have lots of business logic which can leverage the Enterprise Integration Patterns.
The real benefit of Camel is you can focus on how to route your message to meet your business logic needs without caring about implementation detail of the component.
as suggested, there are no hard rules...here is my take
use Camel to simplify technology challenges
complex message routing algorithms (EIPs)
technology integrations (components)
use Camel for these types of requirements
highly event based processes (EIPs)
exposing multiple interfaces to biz logic (http, file, jms, etc)
complex runtime management needs (lifecycle, policies)
that said...
don't use for just one simple use case
don't add unnecessary complexity
have a quorum of use cases/reasons/justification to use it
along these lines, I presented the following at ApacheCon focused on why/how to use Camel:
http://www.consulting-notes.com/2014/06/apachecon-2014-presentation-apache.html