Apache Flink for a real-time processing of finite streams - apache-flink

I am considering how to use Apache Flink for a voting system I’m currently developing. I am a complete newbie to Flink and any help would be appreciated.
The requirements are:
A user with some privilege can start a vote on an arbitrary issue. The user can close the vote any time they like.
As much as hundreds of thousands of people possibly join a vote
Counting of votes should be started immediately after a vote is started, and the intermediate results should be updated over time, so that it can be shown to the participants.
When the system finishes counting after a vote has been closed, it should notify the participants of the final result.
In my understanding, Flink’s stream processing is for a real-time processing of infinite streams, while batch processing is for a non-real-time processing of finite streams.
How can I apply Flink to my requirement, which is a real-time processing of finite streams?

Flink's DataStream API can process events of finite streams without any problems. The DataStream program will simply terminate when the stream reached its end.
You can simulate this behavior if you use a SocketTextStreamFunction to read text data from a socket. Once you close the socket, the program will terminate. Alternatively, you can also read data from a file which is also some kind of finite stream. However, keep in mind that incomplete windows will not be automatically evaluated. So you have to make sure that you do not lose data in windows if you use them.

Related

Flink to implement a job which should start processing events once its parent job has done bootstrapping

I have a use case to implement in which historic data processing needs to be done before my streaming job can start processing live events.
My streaming job will become part of already running system, which means data is already present. And this data first needs to be processed before my job starts processing the live streaming events.
So how should i design this, what i can think off are the following ways;
a) First process the historic data, once done than only start the streaming job.
b) Start the historic data processing & streaming job simultaneously. But keep buffering the events until the historic data has been processed.
c) Make one job having both the capabilities of historic data processing + streaming live events.
Pros & Cons of the above approaches;
Approach (a), simple but needs manual intervention. Plus as the historic data will take time to get loaded, and once done post that when i start the job what should be the flink consumer property to read from the stream - earliest, latest or timestamp based? Reason to think about it as the moment job starts it will be a fresh consumer with no offset/consumer group id registered with kafka broker (in my case it is Oracle streaming service)
Approach (b) buffer size should be large enough to withhold the events states. Also the window that will hold the events needs to buffer till 'x' timestamp value for the first time only while post that it should be 'y' value (ideally very very less than 'x' as the bootstrapping is already done) . How to make this possible?
Approach (c) sounds good, but historic processing is only for first time & most importantly post historic processing only buffered events need to be processed. So next time as no historic processing is reqd. so how would other stream knows that it should keep processing the events as no historic processing is reqd.
Appreciate any help/suggestions to implement & design my use case better.
The HybridSource is designed for this scenario. The current implementation has some limitations, but you may find it meets your needs.
You might go for the approach explained in the 2019 Flink Forward talk A Tale of Dual Sources.
From what I remember, their situation was slightly different, in that they had two sources for the same data, a historic store (S3) and a queue with fresh events (Kafka), yet the data content and processing was the same.
They attempted to write a custom source which read from both Kafka and S3 at the same time, but that failed due to some idiosyncrasies of Flink source initialization.
They also did something like approach b, but the buffered data would often become way too large to handle.
They ended up making one job that can read both sources, but only reads S3 at first, then terminates itself by throwing an exception, and upon being restarted by Flink, starts reading Kafka.
With this restarting trick, you can essentially get the advantages of a and c, without having to worry about needing any manual intervention for the switch.

flink consumes rabbitmq messages in parallel, how to ensure sequential consumption

I listen to mysql binlog through flink, then drop it to rabbitmq queue, consume rabbitmq messages in flink, set parallelism to 1 for sequential consumption of messages, but this will cause flink task oom, is there any way to support multiple parallelism and consume sequentially? Please advise, thanks!
According to your description of the problem, It seems like you want to use multiple event sources and process them sequentially.
But it depends on what order that sequence is in.
You may check the concept of time semantics in flink.
If you can define event time for each event sent from multiple parallel sources, you can use Event Time Semantics together with AssignedWatermark.
So that when flink received them, it knows to process them in event time order regardless of the time flink receive them ( which is processing time).
Keywords are: Event Time (which is the default) and Processing Time

Flink Kinesis Consumer not storing last successfully processed sequence nos

We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application.
KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off.
But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink application processing upon application restart.
A possible solution to this could be to rely on Flink checkpointing mechanism, but that only works when application resumes upon failure, and not when the application has been deliberately cancelled and is needed to be restarted from the last successfully consumed Kinesis stream sequence no.
Do we need to store these last successfully consumed sequence nos ourselves ?
Best practice with Flink is to use checkpoints and savepoints, as these create consistent snapshots that contain offsets into your message queues (in this case, Kinesis stream sequence numbers) together with all of the state throughout the rest of the job graph that resulted from having consumed the data up to those offsets. This makes it possible to recover or restart without any loss or duplication of data.
Flink's checkpoints are snapshots taken automatically by Flink itself for the purpose of recovery from failures, and are in a format optimized for rapid restoration. Savepoints use the same underlying snapshot mechanism, but are triggered manually, and their format is more concerned about operational flexibility than performance.
Savepoints are what you are looking for. In particular, cancel with savepoint and resume from savepoint are very useful.
Another option is to use retained checkpoints with ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION.
To add to David's response, I'd like to explain the reasoning behind not storing sequence numbers.
Any kind of offsets committing into the source system would limit the checkpointing/savepointing feature only to fault tolerance. That is, only the latest checkpoint/savepoint would be able to recover.
However, Flink actually supports to jump back to a previous checkpoint/savepoint. Consider an application upgrade. You make a savepoint before, upgrade and let it run for a couple of minutes where it creates a few checkpoints. Then, you discover a critical bug. You would like to rollback to the savepoint that you have taken and discard all checkpoints.
Now if Flink commits the source offsets only to the source systems, we would not be able to replay the data between now and the restored savepoint. So, Flink needs to store the offsets in the savepoint itself as David pointed out. At this point, additionally committing to source system does not yield any benefit and is confusing while restoring to a previous savepoint/checkpoint.
Do you see any benefit in storing the offsets additionally?

Does every record in a Flink EventTime application need a timestamp?

I'm building a Flink Streaming system that can handle both live data and historical data. All data comes from the same source and then in split into historical and live. The live data gets timestamped and watermarked, while the historical data is received in-order. After the live stream is windowed, both streams are unioned and flow into the same processing pipeline.
I cannot find anywhere if all records in an EventTime streaming environment need to be timestamped, or if Flink can even handle this mix of live and historical data at the same time. Is this a feasible approach or will it create problems that I am too inexperienced to see? What will the impact be on the order of the data?
We have this setup to allow us to do partial-backfills. Each stream is keyed by an id, and we send in historical data to replace the observed data for one id while not affecting the live processing of other ids.
This is the job graph:
Generally speaking, the best approach is to have proper event-time timestamps on every event, and to use event-time everywhere. This has the advantage of being able to use the exact same code for both live data and historic data -- which is very valuable when the need arises to re-process historic data in order to fix bugs or upgrade your pipeline. With this in mind, it's typically possible to do backfill by simply running a second copy of the application -- one that's processing historic data rather than live data.
As for using a mix of historic and live data in the same application, and whether you need to have timestamps and watermarks for the historic events -- it depends on the details. For example, if you are going to connect the two streams, the watermarks (or lack of watermarks) on the historic stream will hold back the watermarks on the connected stream. This will matter if you try to use event-time timers (or windows, which depend on timers) on the connected stream.
I don't think you're going to run into problems, but if you do, a couple of ideas:
You could go ahead and assign timestamps on the historic stream, and write a custom periodic watermark generator that always returns Watermark.MAX_WATERMARK. That will effectively disable any effect the watermarks for the historic stream would have on the watermarking when it's connected to the live stream.
Or you could decouple the backfill operations, and do that in another application (by putting some sort of queuing in-between the two jobs, like Kafka or Kinesis).

whether flink supports suspend a flink job?

i am just beginning learning apache flink and meet the folling problem:
How can i suspend a flink job and then resume it ?
does flink support suspend a job using command line ?
Yes, you certainly can do this with Flink. You want to read about savepoints, which can be triggered from the command line or from the REST API.
Updated
Normally the goal of a stream processor is to do continuous, immediate processing of new elements as they become available. If you want to suspend processing, then I guess this might be with the goal of ignoring the source(s) for a while and dropping the arriving events, or with a desire to conserve computing resources for a time and to later resume without losing any input.
RichCoFlatmap and CoProcessFunction are building blocks you might find useful. You could setup a control stream connected to a socket (for example), and when you want to "suspend" the primary stream, send an event that causes the primary stream to either start dropping its input, or do a blocking read, or sleep, for example.
Or you might think about adding your own layer of abstraction on top of jobs, and cope with the fact that the jobids will change. Note that jobs can have names that remain unchanged across savepoints/restarts.

Resources