Flink sending same data to the same partition - apache-flink

I'm getting data from kafka topic, then exploding array and producing multiple events using flatMap.
Incoming event format:
Event(eventId: Long, time: Long)
IncomingEvent(customerId: Long, events: List[Event])
Event format after exploding incoming event:
EventAfterExploding(customerId: Long, eventId: Long, time: Long)
These events will be written to the MySQL by using JDBC sink provided by Flink.
The data stored in the same Kafka partitions has the same customer id, that means I don't have any ordering issue here. But there can be lots of eventIds in an event, so that means there can be lots of events in the same flink partition after flatMap operation. This will cause latency or maybe OOM issues because one operator have to process more data. In order to prevent this issue, I can apply repartition or increase parallelism. But there is also one more concern here is that every (customerId, eventId) pair have to be sent to the same sink operator because there can be race condition issues if different writers try to operate same pair. For example;
event1 => EventAfterExploding(1, 1, 1)
event2 => EventAfterExploding(1, 1, 2)
In this scenario, database have to contains event2 that has the latest time, but if these two data go to different sink partitions, event1 can be in database instead of event2.
How can I solve race condition problem and scaling problem that happens when there are lots of data in the same partition? Does applying code block given below solves these problem? I though code given below, because after keyBy operation data will be redistribute and it will also guarantee that the same data will ben sent to the same partition but just want to make sure. Thanks!
incomingEvents
.flatMap(new ExplodingFunction())
.keyBy(event => (event.customerId, event.eventId))
.addSink(JdbcSink.sink(...))

Yes, that code will have the effect you are looking for. All events for the same customerId and eventId will go to the same instance of the sink.

Related

How to handle the case for watermarks when num of kafka partitions is larger than Flink parallelism

I am trying to figure out a solution to the problem of watermarks progress when the number of Kafka partitions is larger than the Flink parallelism employed.
Consider for example that I have Flink app with parallelism of 3 and that it needs to read data from 5 Kafka partitions. My issue is that when starting the Flink app, it has to consume historical data from these partitions. As I understand it each Flink task starts consuming events from a corresponding partition (probably buffers a significant amount of events) and progress event time (therefore watermarks) before the same task transitions to another partition that now will have stale data according to watermarks already issued.
I tried considering a watermark strategy using watermark alignment of a few seconds but that
does not solve the problem since historical data are consumed immediately from one partition and therefore event time/watermark has progressed.Below is a snippet of code that showcases watermark strategy implemented.
WatermarkStrategy.forGenerator(ws)
.withTimestampAssigner(
(event, timestamp) -> (long) event.get("event_time))
.withIdleness(IDLENESS_PERIOD)
.withWatermarkAlignment(
GROUP,
Duration.ofMillis(DEFAULT_MAX_WATERMARK_DRIFT_BETWEEN_PARTITIONS),
Duration.ofMillis(DEFAULT_UPDATE_FOR_WATERMARK_DRIFT_BETWEEN_PARTITIONS));
I also tried using a downstream operator to sort events as described here Sorting union of streams to identify user sessions in Apache Flink but then again also this cannot effectively tackle my issue since event record times can deviate significantly.
How can I tackle this issue ? Do I need to have the same number of Flink tasks as the number of Kafka partitions or I am missing something regarding the way data are read from Kafka partitions
The easiest solution to this problem will be using the fromSource with WatermarkStrategy instead of assigning that by using assignTimestampsAndWatermarks.
When You use the WatermarkStrategy directly in fromSource with kafka connector, the watermarks will be partition aware, so the Watermark generated by the given operator will be minimum of all partitions assinged to this operator.
Assigning watermarks directly in source will solve the problem You are facing, but it has one main drawback, since the generated watermark in min of all partitions processed by the given operator, if some partition is idle watermark for this operator will not progress either.
The docs describe kafka connector watermarking here.

Ordering Guarantees in FlinkKinesisProducer

I'm implementing a real-time streaming ETL pipeline using Apache Flink. The pipeline has these characteristics:
Ingest a single Kinesis stream: stream-A
The stream has records of type EventA which have a category_id, representing distinct logical streams
Because of how they are written to Kinesis (separate producer per category_id, writing serially), these logical streams are guaranteed to be read in order by FlinkKinesisConsumer
Flink does some in-order processing work, keyed by the category_id, generating a stream of EventB data records
These records are written to Kinesis stream-B
A separate service ingests the data from stream-B and it is important that this happens in order.
The processing looks something like this:
val in_events = env.addSource(new FlinkKinesisConsumer[EventA]( # these are guaranteed ordered
"stream-A",
new EventASchema,
consumerConfig))
val out_events = in_events
.keyBy(event => event.category_id)
.process(new EventAStreamProcessor)
out_events.addSink(new FlinkKinesisProducer[EventB](
"stream-B",
new EventBSchema,
producerConfig))
# a separate service reads the out_events and wants them in-order
Based on the guidelines here, it seems like it is impossible to guarantee the ordering of EventB records written to the sink. I only care that events with the same category_id are written in order, since the downstream service will keyBy this. Thinking from first principles, if I were to implement the threading manually, I would have a separate queue per category_id KeyedStream and ensure those are written serially to Kinesis (this seems like a strict generalization over what is done by default, which is to use a ThreadPool, which has a single global queue). Does the FlinkKinesisProducer support this mechanism or is there a way around this limitation using Flink's keyBy or similar construct? Separate sink per category_id maybe? For this last option, I'm anticipating 100k category_ids so this might have too much of a memory overhead.
One option is to buffer events read from stream-B in the downstream service to order them (with high probability if buffer window is large). This in theory should work, but it makes the downstream service more complex then it needs to be, precludes determinism since it depends on random timing of network calls, and, more importantly, adds latency to the pipeline (though maybe less latency overall then forcing serial writes to stream-B?). So ideally, I'm hoping to go with another option. And, this feels like a common problem, so perhaps there are more clever solutions out there or I'm missing something obvious
Many thanks in advance.

Are Flink Stream messages sent to downstream in order

I am new to Flink. I have a question that if all the messages sent to the downstream nodes are in order? For example,
[Stream] -> [DownStream]
Stream: [1,2,3,4,5,6,7,8,9]
Downstream get [3,2,1,4,5,7,6,8,9]
If so, how do we handle this case if we want it in order?
Any help would be very appreciated!
An operator can have multiple input channels. It will process the events from each channel in the order in which they were received. (Operators can also have multiple output channels.)
If your job has more than one pathway between stream and downstream, then the events can race and the the ordering will be non-deterministic. Otherwise the ordering will be preserved.
An example: Suppose you are reading, in parallel, from a Kafka topic with multiple partitions. Further imagine that all events from a given user are in the same Kafka partition (and are in order, by timestamp, for each user). Then in Flink you can use keyBy(user) and be sure that the event stream for each user will remain in order. On the other hand, if the events for a given user are spread across multiple partitions, then keyBy(user) will end up creating a stream of events for each user that is (almost certainly) out of order, because it will be pulling together events from several different FlinkKafkaConsumer instances that are reading in parallel.

Partition the whole dataStream in flink at the start of source and maintain the partition till sink

I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by
Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).
processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.
Current slot per processFunction is
processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10
In Flink UI it shows backpressure starting from the processB, meaning C is very slow.
Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:
dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).
This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.
Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.
I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.
If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId).
I use Flink 1.8 with DataStream API. I consider 2 approaches:
Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each worker.
Generalise state of streamA to a stream keyed by just (userId), let's name it streamC, to have common key with the streamB. Then I am able to union streamC with streamB, order by processing time, and handle both types of events in state. It's more complex to handle generaised stream (more code in the process function), but not consume that much RAM to have all streamB on all nodes. Are they any more downsides or upsides of this solution?
I have also seen this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API where it is said:
In general, most of these follow the pattern of joining a main stream
of high throughput with one or several inputs of slowly changing or
static data:
[...]
Join stream with slowly evolving data: This is very similar to
the above case but the side input that we use for enriching is
evolving over time. This can be done by waiting for some initial data
to be available before processing the main input and the continuously
ingesting new data into the internal side input structure as it
arrives.
Unfortunately, it looks like a long time ahead to reach this feature https://issues.apache.org/jira/browse/FLINK-6131 and no alternatives are described. Therefore I would like to ask of the currently recommended approach for the described use case.
I've seen Combining low-latency streams with multiple meta-data streams in Flink (enrichment), but it not specify what are keys of that streams, and moreover it is answered at the time of Flink 1.4, so I expect the recommended solution might have changed.
Building on top of what Gaurav Kumar has already answered.
The main question is do you need to exactly match records from streamA and streamB or is it best effort match? For example, is it an issue for you, that because of a race condition some (a lot of?) records from streamA can be processed before some updates from streamB arrive, for example during the start up?
I would suggest to draw an inspiration from how Table API is solving this issue. Probably Temporal Table Join is the right choice for you, which would leave you with the choice: processing time or event time?
Both of the Gaurav Kumar's proposal are implementations of processing time Temporal Table joins, which assumes that records can be very loosely joined and do not have to timed properly.
If records from streamA and streamB have to be timed properly, then one way or another you have to buffer some of the records from both of the streams. There are various of ways how to do it, depending on what semantic you want to achieve. After deciding on that, the actual implementation is not that difficult and you can draw an inspiration from Table API join operators (org.apache.flink.table.runtime.join package in flink-table-planner module).
Side inputs (that you referenced) and/or input selection are just tools for controlling the amount of unnecessary buffered records. You can implement a valid Flink job without them, but the memory consumption can be hard to control if one stream significantly overtakes the other (in terms of event time - for processing time it's non-issue).
The answer depends on size of your state of streamB that needs to be used to enrich streamA
If you broadcast your streamB state, then you are putting all userIDs from streamB to each of the task managers. Each task on task manager will only have a subset of these userIds from streamA on it. So some userId data from streamB will never be used and will stay as a waste. So if you think that the size of streamB state is not big enough to really impact your job and doesn't take significant memory to leave less memory for state management, you can keep the whole streamB state. This is your #1.
If your streamB state is really huge and can consume considerable memory on task managers, you should consider approach #2. KeyBy same Id both the streams to make sure that elements with same userID reach the same tasks, and then you can use managed state to maintain the per key streamB state and enrich streamA elements using this managed state.

Resources