i am just beginning learning apache flink and meet the folling problem:
How can i suspend a flink job and then resume it ?
does flink support suspend a job using command line ?
Yes, you certainly can do this with Flink. You want to read about savepoints, which can be triggered from the command line or from the REST API.
Updated
Normally the goal of a stream processor is to do continuous, immediate processing of new elements as they become available. If you want to suspend processing, then I guess this might be with the goal of ignoring the source(s) for a while and dropping the arriving events, or with a desire to conserve computing resources for a time and to later resume without losing any input.
RichCoFlatmap and CoProcessFunction are building blocks you might find useful. You could setup a control stream connected to a socket (for example), and when you want to "suspend" the primary stream, send an event that causes the primary stream to either start dropping its input, or do a blocking read, or sleep, for example.
Or you might think about adding your own layer of abstraction on top of jobs, and cope with the fact that the jobids will change. Note that jobs can have names that remain unchanged across savepoints/restarts.
Related
I have a use case to implement in which historic data processing needs to be done before my streaming job can start processing live events.
My streaming job will become part of already running system, which means data is already present. And this data first needs to be processed before my job starts processing the live streaming events.
So how should i design this, what i can think off are the following ways;
a) First process the historic data, once done than only start the streaming job.
b) Start the historic data processing & streaming job simultaneously. But keep buffering the events until the historic data has been processed.
c) Make one job having both the capabilities of historic data processing + streaming live events.
Pros & Cons of the above approaches;
Approach (a), simple but needs manual intervention. Plus as the historic data will take time to get loaded, and once done post that when i start the job what should be the flink consumer property to read from the stream - earliest, latest or timestamp based? Reason to think about it as the moment job starts it will be a fresh consumer with no offset/consumer group id registered with kafka broker (in my case it is Oracle streaming service)
Approach (b) buffer size should be large enough to withhold the events states. Also the window that will hold the events needs to buffer till 'x' timestamp value for the first time only while post that it should be 'y' value (ideally very very less than 'x' as the bootstrapping is already done) . How to make this possible?
Approach (c) sounds good, but historic processing is only for first time & most importantly post historic processing only buffered events need to be processed. So next time as no historic processing is reqd. so how would other stream knows that it should keep processing the events as no historic processing is reqd.
Appreciate any help/suggestions to implement & design my use case better.
The HybridSource is designed for this scenario. The current implementation has some limitations, but you may find it meets your needs.
You might go for the approach explained in the 2019 Flink Forward talk A Tale of Dual Sources.
From what I remember, their situation was slightly different, in that they had two sources for the same data, a historic store (S3) and a queue with fresh events (Kafka), yet the data content and processing was the same.
They attempted to write a custom source which read from both Kafka and S3 at the same time, but that failed due to some idiosyncrasies of Flink source initialization.
They also did something like approach b, but the buffered data would often become way too large to handle.
They ended up making one job that can read both sources, but only reads S3 at first, then terminates itself by throwing an exception, and upon being restarted by Flink, starts reading Kafka.
With this restarting trick, you can essentially get the advantages of a and c, without having to worry about needing any manual intervention for the switch.
I listen to mysql binlog through flink, then drop it to rabbitmq queue, consume rabbitmq messages in flink, set parallelism to 1 for sequential consumption of messages, but this will cause flink task oom, is there any way to support multiple parallelism and consume sequentially? Please advise, thanks!
According to your description of the problem, It seems like you want to use multiple event sources and process them sequentially.
But it depends on what order that sequence is in.
You may check the concept of time semantics in flink.
If you can define event time for each event sent from multiple parallel sources, you can use Event Time Semantics together with AssignedWatermark.
So that when flink received them, it knows to process them in event time order regardless of the time flink receive them ( which is processing time).
Keywords are: Event Time (which is the default) and Processing Time
I have a working Flink job built on Flink Data Stream. I want to REWRITE the entire job based on the Flink stateful functions 3.1.
The functions of my current Flink Job are:
Read message from Kafka
Each message is in format a slice of data packets, e.g.(s for slice):
s-0, s-1 are for packet 0
s-4, s-5, s-6 are for packet 1
The job merges slices into several data packets and then sink packets to HBase
Window functions are applied to deal with disorder of slice arrival
My Objectives
Currently I already have Flink Stateful Functions demo running on my k8s. I want to do rewrite my entire job upon on stateful functions.
Save data into MinIO instead of HBase
My current plan
I have read the doc and got some ideas. My plans are:
There's no need to deal with Kafka anymore, Kafka Ingress(https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/apache-kafka/) handles it
Rewrite my job based on java SDK. Merging are straightforward. But How about window functions?
Maybe I should use persistent state with TTL to mimic window function behaviors
Egress for MinIO is not in the list of default Flink I/O Connectors, therefore I need to write my custom Flink I/O Connector for MinIO myself, according to https://nightlies.apache.org/flink/flink-statefun-docs-release-3.0/docs/io-module/flink-connectors/
I want to avoid Embedded module because it prevents scaling. Auto scaling is the key reason why I want to migrate to Flink stateful functions
My Questions
I don't feel confident with my plan. Is there anything wrong with my understandings/plan?
Are there any best practice I should refer to?
Update:
windows were used to assemble results
get a slice, inspect its metadata and know it is the last one of the packet
also knows the packet should contains 10 slices
if there are already 10 slices, merge them
if there are not enough slices yet, wait for sometime (e.g. 10 minutes) and then either merge or record packet errors.
I want to get rid of windows during the rewrite, but I don't know how
Background: Use KeyedProcessFunctions Rather than Windows to Assemble Related Events
With the DataStream API, windows are not a good building block for assembling together related events. The problem is that windows begin and end at times that are aligned to the clock, rather than being aligned to the events. So even if two related events are only a few milliseconds apart they might be assigned to different windows.
In general, it's more straightforward to implement this sort of use case with keyed process functions, and use timers as needed to deal with missing or late events.
Doing this with the Statefun API
You can use the same pattern mentioned above. The function id will play the same role as the key, and you can use a delayed message instead of a timer:
as each slice arrives, add it to the packet that's being assembled
if it is the first slice, send a delayed message that will act as a timeout
when all the slices have arrived, merge them and send the packet
if the delayed message arrives before the packet is complete, do whatever is appropriate (e.g., go ahead and send the partial packet)
We are using Flink Kinesis Consumer to consume data from Kinesis stream into our Flink application.
KCL library uses a DynamoDB table to store last successfully processed Kinesis stream sequence nos. so that the next time application starts, it resumes from where it left off.
But, it seems that Flink Kinesis Consumer does not maintain any such sequence nos. in any persistent store. As a result, we need to rely upon ShardIteratortype (trim_horizen, latest, etc) to decide where to resume Flink application processing upon application restart.
A possible solution to this could be to rely on Flink checkpointing mechanism, but that only works when application resumes upon failure, and not when the application has been deliberately cancelled and is needed to be restarted from the last successfully consumed Kinesis stream sequence no.
Do we need to store these last successfully consumed sequence nos ourselves ?
Best practice with Flink is to use checkpoints and savepoints, as these create consistent snapshots that contain offsets into your message queues (in this case, Kinesis stream sequence numbers) together with all of the state throughout the rest of the job graph that resulted from having consumed the data up to those offsets. This makes it possible to recover or restart without any loss or duplication of data.
Flink's checkpoints are snapshots taken automatically by Flink itself for the purpose of recovery from failures, and are in a format optimized for rapid restoration. Savepoints use the same underlying snapshot mechanism, but are triggered manually, and their format is more concerned about operational flexibility than performance.
Savepoints are what you are looking for. In particular, cancel with savepoint and resume from savepoint are very useful.
Another option is to use retained checkpoints with ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION.
To add to David's response, I'd like to explain the reasoning behind not storing sequence numbers.
Any kind of offsets committing into the source system would limit the checkpointing/savepointing feature only to fault tolerance. That is, only the latest checkpoint/savepoint would be able to recover.
However, Flink actually supports to jump back to a previous checkpoint/savepoint. Consider an application upgrade. You make a savepoint before, upgrade and let it run for a couple of minutes where it creates a few checkpoints. Then, you discover a critical bug. You would like to rollback to the savepoint that you have taken and discard all checkpoints.
Now if Flink commits the source offsets only to the source systems, we would not be able to replay the data between now and the restored savepoint. So, Flink needs to store the offsets in the savepoint itself as David pointed out. At this point, additionally committing to source system does not yield any benefit and is confusing while restoring to a previous savepoint/checkpoint.
Do you see any benefit in storing the offsets additionally?
I am considering how to use Apache Flink for a voting system I’m currently developing. I am a complete newbie to Flink and any help would be appreciated.
The requirements are:
A user with some privilege can start a vote on an arbitrary issue. The user can close the vote any time they like.
As much as hundreds of thousands of people possibly join a vote
Counting of votes should be started immediately after a vote is started, and the intermediate results should be updated over time, so that it can be shown to the participants.
When the system finishes counting after a vote has been closed, it should notify the participants of the final result.
In my understanding, Flink’s stream processing is for a real-time processing of infinite streams, while batch processing is for a non-real-time processing of finite streams.
How can I apply Flink to my requirement, which is a real-time processing of finite streams?
Flink's DataStream API can process events of finite streams without any problems. The DataStream program will simply terminate when the stream reached its end.
You can simulate this behavior if you use a SocketTextStreamFunction to read text data from a socket. Once you close the socket, the program will terminate. Alternatively, you can also read data from a file which is also some kind of finite stream. However, keep in mind that incomplete windows will not be automatically evaluated. So you have to make sure that you do not lose data in windows if you use them.