Flink pipeline without a data sink with checkpointing on - apache-flink

I am researching on building a flink pipeline without a data sink. i.e my pipeline ends when it makes a successful api call to a datastore.
In that case if we don't use a sink operator how will checkpointing work ?
As checkpointing is based on the concept of pre-checkpoint epoch (all events that are persisted in state or emitted into sinks) and a post-checkpoint epoch. Is having a sink required for a flink pipeline?

Yes, sinks are required as part of Flink's execution model:
DataStream programs in Flink are regular programs that implement
transformations on data streams (e.g., filtering, updating state,
defining windows, aggregating). The data streams are initially created
from various sources (e.g., message queues, socket streams, files).
Results are returned via sinks, which may for example write the data
to files, or to standard output (for example the command line
terminal)
One could argue that your that the call to your datastore is the actual sink implementation that you could use. You could define your own sink and execute the datastore call there.
I am not keen on the details of your datastore, but one could assume that you are serializing these events and sending them to the datastore in some way. In that case, you could flow all your elements to the sink operator, and store each of these elements in some ListState which you can continuously offload and send. This way, if your application needs to be upgraded, in flight records will not be lost and will be recovered and sent once the job has restored.

Related

Need advice on migrating from Flink DataStream Job to Flink Stateful Functions 3.1

I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.
The functions of my current Flink Job are:
Read message from Kafka
Each message is in format a slice of data packets, e.g.(s for slice):
s-0, s-1 are for packet 0
s-4, s-5, s-6 are for packet 1
The job merges slices into several data packets and then sink packets to HBase
Window functions are applied to deal with disorder of slice arrival
My Objectives
Currently I already have Flink Stateful Functions demo running on my k8s. I want to do rewrite my entire job upon on stateful functions.
Save data into MinIO instead of HBase
My current plan
I have read the doc and got some ideas. My plans are:
There's no need to deal with Kafka anymore, Kafka Ingress(https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/apache-kafka/) handles it
Rewrite my job based on java SDK. Merging are straightforward. But How about window functions?
Maybe I should use persistent state with TTL to mimic window function behaviors
Egress for MinIO is not in the list of default Flink I/O Connectors, therefore I need to write my custom Flink I/O Connector for MinIO myself, according to https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/flink-connectors/
I want to avoid Embedded module because it prevents scaling. Auto scaling is the key reason why I want to migrate to Flink stateful functions
My Questions
I don't feel confident with my plan. Is there anything wrong with my understandings/plan?
Are there any best practice I should refer to?
Update:
windows were used to assemble results
get a slice, inspect its metadata and know it is the last one of the packet
also knows the packet should contains 10 slices
if there are already 10 slices, merge them
if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.
I want to get rid of windows during the rewrite, but I don't know how
Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events
With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.
In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.
Doing this with the Statefun API
You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer:
as each slice arrives, add it to the packet that's being assembled
if it is the first slice, send a delayed message that will act as a timeout
when all the slices have arrived, merge them and send the packet
if the delayed message arrives before the packet is complete, do whatever is appropriate (e.g., go ahead and send the partial packet)

Periodically refreshing static data in Apache Flink?

I have an application that receives much of its input from a stream, but some of its data comes from both a RDBMS and also a series of static files.
The stream will continuously emit events so the flink job will never end, but how do you periodically refresh the RDBMS data and the static file to capture any updates to those sources?
I am currently using the JDBCInputFormat to read data from the database.
Below is a rough schematic of what I am trying to do:
For each of your two sources that might change (RDBMS and files), create a Flink source that uses a broadcast stream to send updates to the Flink operators that are processing the data from Kafka. Broadcast streams send each Object to each task/instance of the receiving operator.
For each of your sources, files and RDBMS, you can create a snapshot in HDFS or in a storage periodically(example at every 6 hours) and calculate the difference between to snapshots.The result will be push to Kafka. This solution works when you can not modify the database and files structure and an extra information(ex in RDBMS - a column named last_update).
Another solution is to add a column named last_update used to filter data that has changed between to queries and push the data to Kafka.

Flink two phase commit for map function to implement exactly-once semantics

Background:
We have a Flink pipeline which consists of multiple sources, multiple sinks and multiple operators along the pipeline which also update databases.
For the sake of the question and to make it simpler let's assume we have a pipeline which looks like so:
Source -> KeyBy -> FlatMap -> Filter -> Sink
This pipeline supposed to allow us to listen to notifications regarding changes in some data. (Each notification contains an ID) For each ,notification we read data from the DB, run an algorithm and update the same DB row. After that we also emit the magnitude of the change of the data. Only if the data change magnitude is large enough we emit a notification to another Kafka topic.
The Source subscribes to Kafka topic to listen for the notifications on the changed data IDs.
The KeyBy is keying by the ID to make sure the same ID is not processed by 2 instances of the operators at the same time.
Given the ID, the FlatMap reads the data from the DB, runs an algorithm and updates the same DB row. It emits the change magnitude. It is a FlatMap and not a Map because in some cases we don't want to emit any change magnitude, for example if we had some specific errors.
The Filter filters the stream for magnitudes less then some threshold
The Sink is sending the filtered notifications to another Kafka topic.
Question:
We want to run the pipeline with exactly-once semantics.
From what we see, Flink supports exactly-once semantics for Kafka sources, for Kafka sinks and for stateful or stateles operators in the middle. We couldn't find any place explaining how to do an exactly once with resources you update along the pipeline.
There is a TwoPhaseCommitSinkFunction that allows to create a sink function that allows the exactly-once semantics.
We cannot use it because we want to update the database and after that emit a change notification to Kafka. Doing it in 2 separate sinks will create race conditions where we can receive a magnitude notification before the DB is actually updated.
Are we missing something? Is there a way to implement 2 phase commits in Map/FlatMap operators? Is there another solution?
Thanks!

Flink Kinesis Consumer not storing last successfully processed sequence nos

We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application.
KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off.
But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink application processing upon application restart.
A possible solution to this could be to rely on Flink checkpointing mechanism, but that only works when application resumes upon failure, and not when the application has been deliberately cancelled and is needed to be restarted from the last successfully consumed Kinesis stream sequence no.
Do we need to store these last successfully consumed sequence nos ourselves ?
Best practice with Flink is to use checkpoints and savepoints, as these create consistent snapshots that contain offsets into your message queues (in this case, Kinesis stream sequence numbers) together with all of the state throughout the rest of the job graph that resulted from having consumed the data up to those offsets. This makes it possible to recover or restart without any loss or duplication of data.
Flink's checkpoints are snapshots taken automatically by Flink itself for the purpose of recovery from failures, and are in a format optimized for rapid restoration. Savepoints use the same underlying snapshot mechanism, but are triggered manually, and their format is more concerned about operational flexibility than performance.
Savepoints are what you are looking for. In particular, cancel with savepoint and resume from savepoint are very useful.
Another option is to use retained checkpoints with ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION.
To add to David's response, I'd like to explain the reasoning behind not storing sequence numbers.
Any kind of offsets committing into the source system would limit the checkpointing/savepointing feature only to fault tolerance. That is, only the latest checkpoint/savepoint would be able to recover.
However, Flink actually supports to jump back to a previous checkpoint/savepoint. Consider an application upgrade. You make a savepoint before, upgrade and let it run for a couple of minutes where it creates a few checkpoints. Then, you discover a critical bug. You would like to rollback to the savepoint that you have taken and discard all checkpoints.
Now if Flink commits the source offsets only to the source systems, we would not be able to replay the data between now and the restored savepoint. So, Flink needs to store the offsets in the savepoint itself as David pointed out. At this point, additionally committing to source system does not yield any benefit and is confusing while restoring to a previous savepoint/checkpoint.
Do you see any benefit in storing the offsets additionally?

Does every record in a Flink EventTime application need a timestamp?

I'm building a Flink Streaming system that can handle both live data and historical data. All data comes from the same source and then in split into historical and live. The live data gets timestamped and watermarked, while the historical data is received in-order. After the live stream is windowed, both streams are unioned and flow into the same processing pipeline.
I cannot find anywhere if all records in an EventTime streaming environment need to be timestamped, or if Flink can even handle this mix of live and historical data at the same time. Is this a feasible approach or will it create problems that I am too inexperienced to see? What will the impact be on the order of the data?
We have this setup to allow us to do partial-backfills. Each stream is keyed by an id, and we send in historical data to replace the observed data for one id while not affecting the live processing of other ids.
This is the job graph:
Generally speaking, the best approach is to have proper event-time timestamps on every event, and to use event-time everywhere. This has the advantage of being able to use the exact same code for both live data and historic data -- which is very valuable when the need arises to re-process historic data in order to fix bugs or upgrade your pipeline. With this in mind, it's typically possible to do backfill by simply running a second copy of the application -- one that's processing historic data rather than live data.
As for using a mix of historic and live data in the same application, and whether you need to have timestamps and watermarks for the historic events -- it depends on the details. For example, if you are going to connect the two streams, the watermarks (or lack of watermarks) on the historic stream will hold back the watermarks on the connected stream. This will matter if you try to use event-time timers (or windows, which depend on timers) on the connected stream.
I don't think you're going to run into problems, but if you do, a couple of ideas:
You could go ahead and assign timestamps on the historic stream, and write a custom periodic watermark generator that always returns Watermark.MAX_WATERMARK. That will effectively disable any effect the watermarks for the historic stream would have on the watermarking when it's connected to the live stream.
Or you could decouple the backfill operations, and do that in another application (by putting some sort of queuing in-between the two jobs, like Kafka or Kinesis).

Resources