How does Flink handle serialization of managed state? - apache-flink

What format does Flink persists managed state of an operator (either for checkpointing or for communication between logical operators (i.e. along edges of the job graph)?
The documentation reads
Standard types such as int, long, String etc. are handled by
serializers we ship with Flink. For all other types, we fall back to
Kryo.
What are those serializers that ship with Flink?
Background: I am considering switching from JSON to using AVRO for both ingesting data into my Sources, and also emitting data to my Sinks. However, the auto-generated POJO classes created by Avro are rather noisy. So within the Job graph (for communication between Flink operators) I am contemplating whether there is any performance benefit to using a binary serialization format like Avro. It may be that there is no material performance impact (since Flink potentially uses an optimized format as well), and it just has to do more with types compatibility. But I just wanted to get more information on it.

Flink uses its own, built-in serialization framework for basic types, POJOs, and case classes, and it is designed to be efficient. Avro does have advantages in the area of schema evolution, which is relevant when considering Flink's savepoints. On that topic, see this message on the user mailing list.

Related

Why not write data as hudi or iceburg format in flink-table-store?

Recently I had a chance to get to know the flink-table-store project. I was attracted by the idea behind it at the first glance.
After reading the docs, I've got a question in my head for a while. It's about the design of the file storage.
It looks it can implemented based on the other popular open-source libraries other than creating a total new component (lsm tree based). Hudi or iceburg looks like a good choice, since they both support change logs saving and querying.
If do it like so, there is no need to create a component for other related computation engine (spark, hive or trinno) since they are already supported by hudi or iceburg. It looks like a better solution for me instead of create another wheel.
So, here is my questions. Is there any issue writing data as hudi or iceburg? Why not choose them in the first design decision?
Looking for design explanation.
Flink Data Store is a new project created to natively support update/delete operations on DFS tables using data snapshots.
These features are already available in Apache Hudi, the first open lakehouse format, Delta Lake, the lakehouse format developed and maintained by Databricks and Apache Iceberg which evolve quickly.
The table created with these tools can be queried from different tools/engines (Spark, Flink, Trino, Athena, Spectrum, Dremio, ...), but to support all these tools, they do some changes on the design which can affect the performance, while Flink Data Store is created and optimized for Flink, so it gives you the best performance with Apache Flink comparing with the other 3 projects.
Is there any issue writing data as hudi or iceberg?
Not at all, a lot of companies use Hudi and Iceberg with Spark, Flink and Trino in production, and they have no issues.
Why not choose them in the first design decision?
If you want to create tables readable by the other tools, you should avoid using Flink Data Store, and you need to choose between the other options, but the main idea of Flink Data Store was to create internal tables used to transform your streaming data, which is similar for KTables in kafka stream, so you can write your streaming data to Flink Data Store tables, transform them on multiple stage, and at the end, write the result to Hudi or Iceberg table to query it by the different tools

Unified connectors for DataStream/Table APIs

I am writing a simple connector (source/sink) for Flink 1.14.4 which mostly wraps the official Kafka Connector and automatically set ups custom serializers/deserializers. I'm a bit confused about the current state of the new source/sink interfaces introduced in FLIP-27 and FLIP-143. Is it currently possible to write unified connectors, really (that is, connectors that work across different APIs, such as DataStream/Table)? By looking at the code of the current Kafka Connector, I see it comes with both legacy and new flavours, but AFAIK the connector for the Table API still relies on the legacy API only. Also, by reading the official documentation:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sourcessinks/
It seems that the new interfaces cannot still be used for the Table API. To make it worse, I find it very confusing that only sources are mentioned in the DataStream section, which already describes the new approach:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/
but nothing is said regarding sinks. Overall, I think this leaves the user not knowing very well how to approach the creation of custom connectors as of today. In particular, I would expect having an equivalent section for the DataStream API, i.e., one covering the creation of user-defined sources & sinks, as that given above for the Table API.
The Unified Source and Sink API (FLIP-27 and FLIP-143) were created to create one interface for connectors so they could be used for both bounded (batch) and unbounded (stream) data.
Both interfaces allow for building a source/sink that you can use in either DataStream or Table/SQL API. That's currently already the case for FileSystem, Kafka and Pulsar (as of Flink 1.15 which will be released shortly).
You're absolutely right that the current documentation doesn't make this clear. At the moment, the Flink community is working on externalizing the connectors (moving each of them from the Flink repository to their own individual repository) and overhauling the documentation and guides on how to write a connector.

Using Python Processors in Java Flink Application

I have a use case where I want to implement an AWS Kinesis Data Application with Flink in Java. It will listen to multiple Kinesis streams via the Data Streams API. However, the analysis of those streams will be done in Python (since our data scientists prefer Python).
From this answer, there appears to be support for calling Python UDFs from Java. However, I want to be able to convert an incoming stream to a table, via
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
Table sessionsTable = tableEnv.fromDataStream(inputStream);
...and then have a Python processor that is invoked to process that stream.
I really have 3 questions here:
Is this a supported use case?
If so, is there documentation that describes how to do so?
If so, will this add significant overhead to the application?
The starting point in the Flink documentation for learning about using Python with Tables and Datastreams is at https://ci.apache.org/projects/flink/flink-docs-stable/docs/dev/python/overview/.
The Python APIs only provide a subset of what's available from Java; you'll have to look and see if what you need is included.
Not sure about performance, but you can, for example, convert back and forth between Flink Tables and Pandas dataframes.

Flink for Stateless processing

I am new to flink and our use case deals with Stateless computation.
Read event, process event and persist into Database. But Flink documentation never speaks about stateless processing. Any example repository to find stateless examples or documentation.
Finally For this use case which Flink model works? streaming application or event driven application.
There's a strong emphasis in the docs on stateful stream processing, because the community is proud of having created a highly performant, fault tolerant stream processor that provides exactly-once guarantees even when operating stateful pipelines at large scale. But you can certainly make good use Flink without using state.
However, truly stateless applications are rare. Even in cases where the application doesn't do anything obviously stateful (such as windowing or pattern matching), state is required to deliver exactly-once, fault-tolerant semantics. Flink can ensure that each incoming event is persisted exactly once into the sink(s), but doing so requires that Flink's sources and sinks keep state, and that state must be checkpointed (and recovered after failures). This is all handled transparently, except that you need to enable and configure checkpointing, assuming you care about exactly-once guarantees.
The Flink docs include an tutorial on Data Pipelines and ETL that includes some examples and an exercise (https://github.com/apache/flink-training/tree/master/ride-cleansing) that are stateless.
Flink has three primary APIs:
DataStream API: this low-level API is very capable, but the other APIs listed here have strong advantages for some use cases. The tutorials in the docs make for a good starting point. See also https://training.ververica.com/.
Flink SQL and the Table API: this is especially well suited for ETL and analytics workloads. https://github.com/ververica/sql-training is a good starting point.
Stateful Functions API: this API offers a different set of abstractions, and a cloud-native, language-agnostic runtime that supports a variety of SDKs. This is a good choice for event-driven applications. https://flink.apache.org/stateful-functions.html and https://github.com/ververica/flink-statefun-workshop are good starting points.

In project reactor or akka streams, what is the conceptual difference between sink and subscriber?

The concepts of sink and subscriber seem similar to me. Also, I don't see the concept of sink being explicitly defined in the reactive streams spec.
I see that Oleh Dokuka, from Project Reactor (missing disclaimer there), posted an answer already, however much of its assumptions about Akka Streams and Reactive Streams are incorrect, so allow me to clarify below.
Disclaimer: I participated in Reactive Streams since it's early days, and authored most of its Technology Compatibility Kit. I also maintain Akka and Akka Streams.
Also note that: Reactive Streams have been included in Java 9, and are there known as java.util.concurrent.Flow.* so all the comments below regarding RS stand exactly the same way about j.u.c.Flow.Subscriber and the other types.
The answer
Reactive Streams is a Service Provider Interface (SPI) Specification
Reactive Streams, and specifically the Publisher / Subscriber / Subscription / Processor types, are a Service Provider Interface. This is confirmed even in the earliest discussions about the specification dating back to 2014.
In the earliest days of the specification even the spec's types attempted to hide Publisher, Subscriber and the other types. Sadly the types would leak regardless in the back then considered API, thus the API(!) was removed and the SPI types are all that remained.
Nowadays you see some implementations of Reactive Streams claim that their direct extending of these types is a benefit for some reason. This is not correct, as such was not, and is not the goal of the Reactive Streams interfaces. It is rather a misunderstanding of what these types are -- strictly the inter-op interfaces that Reactive Streams libraries agree to understand and "speak" (a protocol).
For reference, RxJava 2.0 as well as Reactor do directly extend these types, while Akka Streams remains true to the RS's design and principles by hiding them as a application developer programming interface -- which is why Sink does not extend Subscriber. This has nothing to do with being "native support" how I've seen people claim the direct IS-A relationship is (rather, claiming an inter-op library being your "native" is a misunderstanding of the concept).
Sinks and Subscribers, Sources and Publishers
The concepts of sink and subscriber seem similar to me.
Correct, they are, on purpose and by design, similar.
As a Sink is a lifted representation of something that effectively yields a Subscriber. To simplify, you can think of it as a "Subscriber factory" (more specifically, the Sink is the "blueprint", and the Materializer takes the sink's blueprint and creates the appropriate RS stages, including Publishers for Sources and Subscribers for Sinks.
So when you say Sink.ignore it actually is a factory that will end up creating a Subscriber that does all the requesting and ignoring, as according to Reactive Streams. The same with all other methods declared on Sink.
The same applies to Source, which relates 1:1 to a Reactive Streams Publisher. So a Source.single(1) is something that will internally materialize into a Publisher that does it's job - emits that 1 element if it's allowed to do so by it's downstream.
A.K.A. Why there is no Sink in Reactive Streams?
As mentioned above, Akka's Sink does not directly extend a Subscriber. It is however basically a factory for them.
You may ask: "Does the user never see these Publisher/Subscriber types at all though during normal usage?" And the answer is: yes indeed, and this is a feature as well as design goal (in accordance with what Reactive Streams is). If the underlying Publisher and Subscriber instances were exposed to users all the time directly, one may call them incorrectly causing bugs and confusion. If these types are never exposed unless explicitly asked for, there is less chances for accidental mistakes!
Some have misunderstood that design, and claimed that there is no "native" support for it in Akka Streams (which is not true). Let's see through what being detached from the Subscriber in the API gains us:
Also, I don't see the concept of sink being explicitly defined in the reactive streams spec.
Indeed, Sinks are not part of Reactive Streams, and that's absolutely fine.
Benefits from avoiding the "Sink IS-A Subscriber"
Sinks are part of Akka Streams, and their purpose is to provide the fluent DSL, as well as be factories for Subscribers. In other words, if Subscriber is the LEGO blocks, Sink is what builds them (and the Akka Stream Materializer is what puts the various LEGO blocks together in order to "run" them).
In fact, it is beneficial to users that Sink does not carry any definitive IS-A with a Subscriber (sic!) like other libraries do:
This is because since org.reactivestreams.Subscriber has now been included in Java 9, and has become part of Java itself, libraries should migrate to using the java.util.concurrent.Flow.Subscriber instead of org.reactivestreams.Subscriber. Libraries which selected to expose and directly extend the Reactive Streams types will now have a tougher time to adapt the JDK9 types -- all their classes that extend Subscriber and friends will need to be copied or changed to extend the exact same interface, but from a different package. In Akka we simply expose the new type when asked to -- already supporting JDK9 types, from the day JDK9 was released.
With Reactive Streams being an SPI -- a Service Provider Interface -- it is intended for libraries to share such that they can "talk the same types and protocol". All communication that Akka Streams do, and other Reactive Streams libraries do, adhere to those rules, and if you want to connect some other library to Akka Streams, you'd do just that -- give Akka Streams the inter-op type, which is the Subscriber, Processor, or Publisher; not the Sink, since that's the Akka's "Akka specific" DSL (domain specific language), which adds convenience and other niceties on top of it, hiding (on purpose!) the Subscriber type.
Another reason Akka (and to be honest other RS implementations were encouraged to do so as well, but chose not to do so) hides these types is because they are easy to do the wrong thing with. If you pass out a Subscriber anyone could call things on it, and even un-knowingly break rules and guarantees that the Reactive Streams Specification requires from anyone interacting with the type.
In order to avoid mistakes from happening, the Reactive Streams types in Akka Streams are "hidden" and only exposed when explicitly asked for—minimizing the risk of people making mistakes by accidentally calling methods on “raw” Reactive Streams types without following their protocol.

Resources