Does Apache Flink checkpointing need to use with stateful functions? - apache-flink

Does the Apache Flink checkpointing feature need to be used with stateful functions?

You don't need to. If your functions do not have State, nothing will be checkpointed. But be aware that certain built-in functions have state on their own e.g. FlinkKafkaConsumer.

Related

How flink handle unused keyed state field when we update our job

We have a job which all the user feature and information are stored in keyed state. Each user feature represents a state descriptor. But we are evolving our features so sometimes some features are abandoned in our next release/version because we will no longer declare the abandoned feature state's descriptor in our code. My question is how flink takes care of those abandoned state? Will it no longer restore those abandoned state automatically?
If you are using Flink POJOs or Avro types, then Flink will automatically migrate the types and state for you. Otherwise, it will not, and you could implement a custom serializer instead. Or you could use the State Processor API to clean things up.

Does my Flink application need watermarks? If not, do I need WatermarkStrategy.noWatermarks?

I'm not sure if my Flink application actually requires Watermarks. When are they necessary?
And if I don't need them, what is the purpose of WatermarkStrategy.noWatermarks()?
A Watermark for time t marks a location in a data stream and asserts that the stream, at that point, is now complete up through time t.
The only purpose watermarks serve is to trigger the firing of event-time-based timers.
Event-time-based timers are directly exposed by the KeyedProcessFunction API, and are also used internally by
event-time Windows
the CEP (pattern-matching) library, which uses Watermarks to sort the incoming stream(s) if you specify you want to do event-time based processing
Flink SQL, again only when doing event-time-based processing: e.g., ORDER BY, versioned table joins, windows, MATCH_RECOGNIZE, etc.
Common cases where you don't need watermarks include applications that only rely on processing time, or when doing batch processing. Or when processing data that has timestamps, but never relying on event-time timers (e.g., simple event-by-event processing).
Flink's new source interface, introduced by FLIP-27, does require a WatermarkStrategy:
env.fromSource(source, watermarkStrategy, sourceName);
In cases where you don't actually need watermarks, you can use WatermarkStrategy.noWatermarks() in this interface.

Apache Flink Stateful Functions state scaling

We are able to scale stateless functions as much as we want. But stateful functions likely to be bottleneck if we don't scale them as we scale stateless functions. Scaling state seems to be tricky because of the distribution of the data across the nodes. If stateful functions become a bottlenecks, can we scale them too?
Stateful functions are scaled exactly in the same way as stateless.
In StateFun, the remote functions are in fact stateless processes that receive the state just prior to the invocation for a given key, and after a successful invocation the changes to the state are communicated back to the Flink cluster. Therefore scaling stateless or stateful functions is really the same.
To learn more I'd recommend watching a keynote introducing StateFun 2.0 or visiting the distributed architecture page

Async I/O for DataSet in Apache Flink

What is the equivalent for async I/O on a DataSet in Flink ? For DataStream its basically AsyncDataStream.
Doing a blocking call in the map function ?
Are their any best practices ?
I'd implement that with a RichMapPartitionFunction, which provides an iterator over the input and a collector to emit results.
Since the DataSet API does not need to integrate with the checkpointing mechanism and respect the order of records and timestamps, the implementation shouldn't be very involved although MapPartitionFunction does not provide any async-specific tooling.

Querying Data from Apache Flink

I am looking to migrate from a homegrown streaming server to Apache Flink. One thing that we have is a Apache Storm like DRPC interface to run queries against the state held in the processing topology.
So for example: I have a bunch of sensors that I am running an moving average on. I want to run a query on the topology and return all the sensors where that average is above a fixed value.
Is there an equivalent in Flink, or if not, what is the best way to achieve equivalent functionality?
Out-of-box Flink does not come with a solution for querying the internal state of operations right now. You're lucky however, because there are two solutions: We did an example of a stateful word count example that allows querying the state. This is available here: https://github.com/dataArtisans/query-window-example
For one of the upcoming versions of Flink we are also working on a generic solution to the queryable state use case. This will allow querying the state of any internal operation.
Also, could it also suffice, in your case, to just periodically output the values to something like Elasticsearch using a Window Operation. The results could then simply be queried from Elasticsearch.
They are coming with Out-of-box solution called Queryable State in next release.
Here is an example
https://github.com/apache/flink/blob/master/flink-tests/src/test/java/org/apache/flink/test/query/QueryableStateITCase.java
But I suggest you should read about it more first then see the example.

Resources