Flink for Stateless processing - apache-flink

I am new to flink and our use case deals with Stateless computation.
Read event, process event and persist into Database. But Flink documentation never speaks about stateless processing. Any example repository to find stateless examples or documentation.
Finally For this use case which Flink model works? streaming application or event driven application.

There's a strong emphasis in the docs on stateful stream processing, because the community is proud of having created a highly performant, fault tolerant stream processor that provides exactly-once guarantees even when operating stateful pipelines at large scale. But you can certainly make good use Flink without using state.
However, truly stateless applications are rare. Even in cases where the application doesn't do anything obviously stateful (such as windowing or pattern matching), state is required to deliver exactly-once, fault-tolerant semantics. Flink can ensure that each incoming event is persisted exactly once into the sink(s), but doing so requires that Flink's sources and sinks keep state, and that state must be checkpointed (and recovered after failures). This is all handled transparently, except that you need to enable and configure checkpointing, assuming you care about exactly-once guarantees.
The Flink docs include an tutorial on Data Pipelines and ETL that includes some examples and an exercise (https://github.com/apache/flink-training/tree/master/ride-cleansing) that are stateless.
Flink has three primary APIs:
DataStream API: this low-level API is very capable, but the other APIs listed here have strong advantages for some use cases. The tutorials in the docs make for a good starting point. See also https://training.ververica.com/.
Flink SQL and the Table API: this is especially well suited for ETL and analytics workloads. https://github.com/ververica/sql-training is a good starting point.
Stateful Functions API: this API offers a different set of abstractions, and a cloud-native, language-agnostic runtime that supports a variety of SDKs. This is a good choice for event-driven applications. https://flink.apache.org/stateful-functions.html and https://github.com/ververica/flink-statefun-workshop are good starting points.

Related

Unified connectors for DataStream/Table APIs

I am writing a simple connector (source/sink) for Flink 1.14.4 which mostly wraps the official Kafka Connector and automatically set ups custom serializers/deserializers. I'm a bit confused about the current state of the new source/sink interfaces introduced in FLIP-27 and FLIP-143. Is it currently possible to write unified connectors, really (that is, connectors that work across different APIs, such as DataStream/Table)? By looking at the code of the current Kafka Connector, I see it comes with both legacy and new flavours, but AFAIK the connector for the Table API still relies on the legacy API only. Also, by reading the official documentation:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sourcessinks/
It seems that the new interfaces cannot still be used for the Table API. To make it worse, I find it very confusing that only sources are mentioned in the DataStream section, which already describes the new approach:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/
but nothing is said regarding sinks. Overall, I think this leaves the user not knowing very well how to approach the creation of custom connectors as of today. In particular, I would expect having an equivalent section for the DataStream API, i.e., one covering the creation of user-defined sources & sinks, as that given above for the Table API.
The Unified Source and Sink API (FLIP-27 and FLIP-143) were created to create one interface for connectors so they could be used for both bounded (batch) and unbounded (stream) data.
Both interfaces allow for building a source/sink that you can use in either DataStream or Table/SQL API. That's currently already the case for FileSystem, Kafka and Pulsar (as of Flink 1.15 which will be released shortly).
You're absolutely right that the current documentation doesn't make this clear. At the moment, the Flink community is working on externalizing the connectors (moving each of them from the Flink repository to their own individual repository) and overhauling the documentation and guides on how to write a connector.

Flink Stateful Functions with an existing Flink application

I'd appreciate some advice around the use of Stateful functions.
We are currently using Flink whereby we consume from a number of kafka streams, aggregate, run a computation and then output to a new stream.
The problem is that the computation element is provided by a different team whose language of choice is Python. We would like to provide them with the ability to develop and update their component independently of the streaming elements.
Initially, we just ported their code to Java.
Stateful functions seem to offer an alternative here whereby we would keep some of our functionality as is and host the model as a Stateful Function in Python. I'm wondering however, if there is any advantage to this over just hosting the computation module on its own pipeline and using AsyncFunction in Flink to interact with it.
If we were to move to Stateful functions I can't help feeling that we are adding complexity without using its power but I may be missing some important considerations around speed and resilience?
I want to begin by noting that Stateful Functions does have a DataStream interop module. This means you can use StateFun to handle the Python functions of your pipeline without rewriting the entire Flink Job.
That said, what advantages does Stateful Functions bring over using AsyncIO and doing it yourself?
Automated handling of connections, batching, back-pressuring, and retries. Even if you are using a single python function and no state, Stateful Functions has been heavily optimized to be as fast and efficient as possible with continual improvements from the community that you will get to leverage for free. StateFun has more sophisticated back pressuring and retry mechanisms in place than AsyncIO that you would need to redevelop on your own.
Higher level APIs. StateFuns Python SDK (and others) provide well defined, typed apis that are easy to develop against. The other team you are working with will only require a few lines of glue code to integrate with StateFun while the project will handle the transport protocols for you.
State! As the name of the project implies, stateful functions are well stateful. Python functions can maintain state and you will get Flink's exactly once guarantees out of the box.

Apache Flink Stateful Functions python vs java performance

What are the advantages and disadvantages of using python or java when developing apache flink stateful function.
Is there any performance difference? which one is more efficient for the same operation?
Can we develop the application completely on python?
What are the features that one supports and the other does not.
StateFun support embedded functions and remote functions.
Embedded functions are bundled and deployed within the JVM processes that run Flink. Therefore they must be implemented in a JVM language (like Java) and they would be the most performant. The downside is that any change to the function code requires a restart of the Flink cluster.
Remote functions are functions that are executing in a separate process, and are invoked by the Flink cluster for every incoming message addressed to them. Therefore they are expected to be less performant than the embedded functions, but they provide a great flexibility in:
Choosing an implementation language
Fast scaling up and down
Fast restart in case of a failure.
Rolling upgrades
Can we develop the application completely on python?
Is it is possible to develop an application completely in Python, see the python greeter example.
What are the features that one supports and the other does not.
The current features are currently supported only in the Java SDK:
Richer routing logic from an ingress to a function. Any routing logic that you can describe via code.
Few more state types like a table and a buffer.
Exposing existing Flink sources and Sinks as ingresses and egresses.

In project reactor or akka streams, what is the conceptual difference between sink and subscriber?

The concepts of sink and subscriber seem similar to me. Also, I don't see the concept of sink being explicitly defined in the reactive streams spec.
I see that Oleh Dokuka, from Project Reactor (missing disclaimer there), posted an answer already, however much of its assumptions about Akka Streams and Reactive Streams are incorrect, so allow me to clarify below.
Disclaimer: I participated in Reactive Streams since it's early days, and authored most of its Technology Compatibility Kit. I also maintain Akka and Akka Streams.
Also note that: Reactive Streams have been included in Java 9, and are there known as java.util.concurrent.Flow.* so all the comments below regarding RS stand exactly the same way about j.u.c.Flow.Subscriber and the other types.
The answer
Reactive Streams is a Service Provider Interface (SPI) Specification
Reactive Streams, and specifically the Publisher / Subscriber / Subscription / Processor types, are a Service Provider Interface. This is confirmed even in the earliest discussions about the specification dating back to 2014.
In the earliest days of the specification even the spec's types attempted to hide Publisher, Subscriber and the other types. Sadly the types would leak regardless in the back then considered API, thus the API(!) was removed and the SPI types are all that remained.
Nowadays you see some implementations of Reactive Streams claim that their direct extending of these types is a benefit for some reason. This is not correct, as such was not, and is not the goal of the Reactive Streams interfaces. It is rather a misunderstanding of what these types are -- strictly the inter-op interfaces that Reactive Streams libraries agree to understand and "speak" (a protocol).
For reference, RxJava 2.0 as well as Reactor do directly extend these types, while Akka Streams remains true to the RS's design and principles by hiding them as a application developer programming interface -- which is why Sink does not extend Subscriber. This has nothing to do with being "native support" how I've seen people claim the direct IS-A relationship is (rather, claiming an inter-op library being your "native" is a misunderstanding of the concept).
Sinks and Subscribers, Sources and Publishers
The concepts of sink and subscriber seem similar to me.
Correct, they are, on purpose and by design, similar.
As a Sink is a lifted representation of something that effectively yields a Subscriber. To simplify, you can think of it as a "Subscriber factory" (more specifically, the Sink is the "blueprint", and the Materializer takes the sink's blueprint and creates the appropriate RS stages, including Publishers for Sources and Subscribers for Sinks.
So when you say Sink.ignore it actually is a factory that will end up creating a Subscriber that does all the requesting and ignoring, as according to Reactive Streams. The same with all other methods declared on Sink.
The same applies to Source, which relates 1:1 to a Reactive Streams Publisher. So a Source.single(1) is something that will internally materialize into a Publisher that does it's job - emits that 1 element if it's allowed to do so by it's downstream.
A.K.A. Why there is no Sink in Reactive Streams?
As mentioned above, Akka's Sink does not directly extend a Subscriber. It is however basically a factory for them.
You may ask: "Does the user never see these Publisher/Subscriber types at all though during normal usage?" And the answer is: yes indeed, and this is a feature as well as design goal (in accordance with what Reactive Streams is). If the underlying Publisher and Subscriber instances were exposed to users all the time directly, one may call them incorrectly causing bugs and confusion. If these types are never exposed unless explicitly asked for, there is less chances for accidental mistakes!
Some have misunderstood that design, and claimed that there is no "native" support for it in Akka Streams (which is not true). Let's see through what being detached from the Subscriber in the API gains us:
Also, I don't see the concept of sink being explicitly defined in the reactive streams spec.
Indeed, Sinks are not part of Reactive Streams, and that's absolutely fine.
Benefits from avoiding the "Sink IS-A Subscriber"
Sinks are part of Akka Streams, and their purpose is to provide the fluent DSL, as well as be factories for Subscribers. In other words, if Subscriber is the LEGO blocks, Sink is what builds them (and the Akka Stream Materializer is what puts the various LEGO blocks together in order to "run" them).
In fact, it is beneficial to users that Sink does not carry any definitive IS-A with a Subscriber (sic!) like other libraries do:
This is because since org.reactivestreams.Subscriber has now been included in Java 9, and has become part of Java itself, libraries should migrate to using the java.util.concurrent.Flow.Subscriber instead of org.reactivestreams.Subscriber. Libraries which selected to expose and directly extend the Reactive Streams types will now have a tougher time to adapt the JDK9 types -- all their classes that extend Subscriber and friends will need to be copied or changed to extend the exact same interface, but from a different package. In Akka we simply expose the new type when asked to -- already supporting JDK9 types, from the day JDK9 was released.
With Reactive Streams being an SPI -- a Service Provider Interface -- it is intended for libraries to share such that they can "talk the same types and protocol". All communication that Akka Streams do, and other Reactive Streams libraries do, adhere to those rules, and if you want to connect some other library to Akka Streams, you'd do just that -- give Akka Streams the inter-op type, which is the Subscriber, Processor, or Publisher; not the Sink, since that's the Akka's "Akka specific" DSL (domain specific language), which adds convenience and other niceties on top of it, hiding (on purpose!) the Subscriber type.
Another reason Akka (and to be honest other RS implementations were encouraged to do so as well, but chose not to do so) hides these types is because they are easy to do the wrong thing with. If you pass out a Subscriber anyone could call things on it, and even un-knowingly break rules and guarantees that the Reactive Streams Specification requires from anyone interacting with the type.
In order to avoid mistakes from happening, the Reactive Streams types in Akka Streams are "hidden" and only exposed when explicitly asked for—minimizing the risk of people making mistakes by accidentally calling methods on “raw” Reactive Streams types without following their protocol.

How does Flink handle serialization of managed state?

What format does Flink persists managed state of an operator (either for checkpointing or for communication between logical operators (i.e. along edges of the job graph)?
The documentation reads
Standard types such as int, long, String etc. are handled by
serializers we ship with Flink. For all other types, we fall back to
Kryo.
What are those serializers that ship with Flink?
Background: I am considering switching from JSON to using AVRO for both ingesting data into my Sources, and also emitting data to my Sinks. However, the auto-generated POJO classes created by Avro are rather noisy. So within the Job graph (for communication between Flink operators) I am contemplating whether there is any performance benefit to using a binary serialization format like Avro. It may be that there is no material performance impact (since Flink potentially uses an optimized format as well), and it just has to do more with types compatibility. But I just wanted to get more information on it.
Flink uses its own, built-in serialization framework for basic types, POJOs, and case classes, and it is designed to be efficient. Avro does have advantages in the area of schema evolution, which is relevant when considering Flink's savepoints. On that topic, see this message on the user mailing list.

Resources