flink consumes rabbitmq messages in parallel, how to ensure sequential consumption - apache-flink

I listen to mysql binlog through flink, then drop it to rabbitmq queue, consume rabbitmq messages in flink, set parallelism to 1 for sequential consumption of messages, but this will cause flink task oom, is there any way to support multiple parallelism and consume sequentially? Please advise, thanks!

According to your description of the problem, It seems like you want to use multiple event sources and process them sequentially.
But it depends on what order that sequence is in.
You may check the concept of time semantics in flink.
If you can define event time for each event sent from multiple parallel sources, you can use Event Time Semantics together with AssignedWatermark.
So that when flink received them, it knows to process them in event time order regardless of the time flink receive them ( which is processing time).
Keywords are: Event Time (which is the default) and Processing Time

Related

Processing or Catching Up on Outstanding Kafka Messages

When we restart a Flink job, the job needs to process millions of outstanding messages from several Kafka topics that have accumulated when the Flink job was down. Suddenly, as soon as the job starts, there will be a burst of input messages into the Flink job. The challenge we have is to process those messages on multiple topics in the order they occurred, i.e. process based on their event time. What are the best ways/solutions to handle this scenario. Is it a good idea to sort/partition the messages by a key and sort them based on the event time using a PriorityQueue and define a window of 1 minute or more and output the messages for the downstream processing. Or there any better solutions to address this? Is there a way to resolve this using watermarks?
The messages may have been created/written a few hours apart.

Apache Flink timeout using onTimer and processElement

I am using the Apache Flink processElement1, processElement2 and onTimer streaming design pattern for implementing a timeout use case. I observed that the throughput of the system decreased by orders of magnitude when I included the timeout functionality.
Any hint on the internal implementation of the onTimer in Flink: is it one thread per key stream (unlikely), or a pool/single execution threads that continuously polls buffered callbacks and pick up the timed-out callbacks for execution.
To the best of my knowledge, Flink is based on the actor model and reactive patterns (AKKA) which encourages the judicious usage of few non-blocking threads, and hence one thread per key stream for onTimer or any other pattern is typically not used!
There are two kinds of timers in Flink, event time and processing time timers. The implementations are completely different, but in neither case should you see a significant performance impact. Something else must be going on. Can you share small, reproducible example, or at least show us more of what’s going on and how you are taking the measurements?

Flink, basic rule for checkpointing?

I have 2 questions regarding Flink checkpointing strategy,
I know that checkpoint is related to state (right?), so if I'm not using state (ValueState sort of things) explicitly in my job code, do I need to care about checkpoint? Is it still necessary?
If I need to enable the checkpointing, what should the interval be? Are there any basic rules for setting the interval? Suppose we're talking about a quite busy system (Kafka+Flink), like several billions messages per day.
Many thanks.
Even if you are not using state explicitly in your application, Flink's Kafka source and sink connectors are using state on your behalf in order to provide you with either at-least-once or exactly-once guarantees -- assuming you care about those guarantees. Also, some other operators will also use state somewhat transparently, on your behalf, such as windows and other streaming aggregations.
If your Flink job fails, then it will be rewound back to the most recent successful checkpoint, and resume processing from there. So, for example, if your checkpoint interval is 10 minutes, then after recovery your job might have 10+ minutes of data to catch up on before it can resume processing live data. So choose a checkpoint interval that you can live with from this perspective.

Flink Kafka connector 0.10.0 Event time Clarification and ProcessFunction Clarification

I'm struggling with an issue regarding event time of flink's kafka's consumer connector.
Citing Flink doc
Since Apache Kafka 0.10+, Kafka’s messages can carry timestamps, indicating the time the event has occurred (see “event time” in Apache Flink) or the time when the message has been written to the Kafka broker.
The FlinkKafkaConsumer010 will emit records with the timestamp attached, if the time characteristic in Flink is set to TimeCharacteristic.EventTime (StreamExecutionEnvironment.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)).
The Kafka consumer does not emit watermarks.
Some questions and issues come to mind:
How do I know if it timestamp taken is of the time it occurred or time written to the Kafka broker?
If the consumer does not emit watermarks and TimeCharacteristic.EventTime is set, does this mean a message late by a few days can still enter and be processed?
The main flow diagram does not contain a window function, and basically looks like the following: source(kafka)->filter->processFunction->Sink. Does this mean the the event is fired at the moment it is consumed by Kafka connector?
I currently use Kafka connector 0.10.0, TimeCharacteristic.EventTime set and use a processFunction which every expectedly X minutes does some state cleanup.
However I'm receiving a strange situation where the OnTimerContext contains timestamps which starts from 0 and grows until current timestamp when I start the flink program and is quite strange, is this a bug?
Thanks in advance to all helpers!
That depends on the configuration of the Kafka producer that's creating these events. The message.timestamp.type property should be set to either CreateTime or LogAppendTime.
Your flink application is responsible for creating watermarks; the kafka consumer will take care of the timestamps, but not the watermarks. It doesn't matter how late an event is, it will still enter your pipeline.
Yes.
It's not clear to me what part of this is strange.

whether flink supports suspend a flink job?

i am just beginning learning apache flink and meet the folling problem:
How can i suspend a flink job and then resume it ?
does flink support suspend a job using command line ?
Yes, you certainly can do this with Flink. You want to read about savepoints, which can be triggered from the command line or from the REST API.
Updated
Normally the goal of a stream processor is to do continuous, immediate processing of new elements as they become available. If you want to suspend processing, then I guess this might be with the goal of ignoring the source(s) for a while and dropping the arriving events, or with a desire to conserve computing resources for a time and to later resume without losing any input.
RichCoFlatmap and CoProcessFunction are building blocks you might find useful. You could setup a control stream connected to a socket (for example), and when you want to "suspend" the primary stream, send an event that causes the primary stream to either start dropping its input, or do a blocking read, or sleep, for example.
Or you might think about adding your own layer of abstraction on top of jobs, and cope with the fact that the jobids will change. Note that jobs can have names that remain unchanged across savepoints/restarts.

Resources