Combining low-latency streams with multiple meta-data streams in Flink (enrichment) - apache-flink

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.
A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:
This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?
I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.

Flink has several mechanisms that can be used for enrichment.
I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.
The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.
For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.
Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.
You will find a detailed example, plus code, in the Apache Flink training materials.
If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.
Update:
Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime
This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

Related

How to join two streams without time window in flink?

I am getting data from two streams. I want to join these two streams based on a key. For example, consider two streams. Data in stream A can come first. Sometimes data in stream B can come first. The joining data in the streams can come at any time. Because of this nature, I can't use a windowed join. Is it possible to join two unbounded streams in flink?
I believe a non-windowed Join will behave as you want: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/joins/#regular-joins
If you are using the DataStream API instead of the SQL API, a CoFlatMap operator implementing a shared state that keeps the elements from both sides and joins them when there is an update, would allow you to implement this behavior as well.
Take into account that this requires keeping both sides in state forever, which can make grow your state infinitely.
The note in the Flink SQL documentation advises looking at setting a state TTL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/config/#table-exec-state-ttl. That would be the DataStream equivalent: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/#state-time-to-live-ttl The problem is that if some records in the state expire and there is an update that would require to be joined with the expired element, the result will be incorrect.

Custom Key logic to avoid shuffling

I am using Flink 1.11.
My application read data from Kafka, so messages are already in ordered in Kafka partition. After consuming message from Kafka, I want to apply TumblingWindow. As per Flink Documentation, keyBy is required to use TumblingWindow. Using keyby , it means it will trigger shuffling of data, which I want to avoid. Since in each Task slot, records are already in ordered (due to its consumption from Kafka), how can shuffling be avoided ? Number of parallelism can be greater, equal or lesser to Kafka partitions. my concern is :
Can TumblingWindow be used without keyby ?
If not, how keyby can be customised to ensure data remain on same task slot and no shuffling is triggered.
What are you asking for is very difficult to achieve using the DataStream API. But the SQL/Table API automatically applies various optimizations when you use window-valued table functions, which will likely be good enough. See the docs for tumble window TVF, mini-batch aggregation and local/global aggregation.
Note however that window TVFs were added to Flink in 1.13.

How to preserve order of records when implementing an ETL job with Flink?

Suppose I want to implement an ETL job with Flink, source and sink of which are both Kafka topic with only one partition.
Order of records in source and sink matters to downstream(There are more jobs consume sink of my ETL, jobs are maintained by other teams.).
Is there any way make sure order of records in sink same as source, and make parallelism more than 1?
https://stackoverflow.com/a/69094404/2000823 covers parts of your question. The basic principle is that two events will maintain their relative ordering so long as they take the same path through the execution graph. Otherwise, the events will race against each other, and there is no guarantee regarding ordering.
If your job only has FORWARD connections between the tasks, then the order will always be preserved. If you use keyBy or rebalance (to change the parallel), then it will not.
A Kafka topic with one partition cannot be read from (or written to) in parallel. You can increase the parallelism of the job, but this will only have a meaningful effect on intermediate tasks (since in this case the source and sink cannot operate in parallel) -- which then introduces the possibility of events ending up out-of-order.
If it's enough to maintain the ordering on a key-by-key basis, then with just one partition, you'll always be fine. With multiple partitions being consumed in parallel, then if you use keyBy (or GROUP BY in SQL), you'll be okay only if all events for a key are always in the same Kafka partition.

Ordering Guarantees in FlinkKinesisProducer

I'm implementing a real-time streaming ETL pipeline using Apache Flink. The pipeline has these characteristics:
Ingest a single Kinesis stream: stream-A
The stream has records of type EventA which have a category_id, representing distinct logical streams
Because of how they are written to Kinesis (separate producer per category_id, writing serially), these logical streams are guaranteed to be read in order by FlinkKinesisConsumer
Flink does some in-order processing work, keyed by the category_id, generating a stream of EventB data records
These records are written to Kinesis stream-B
A separate service ingests the data from stream-B and it is important that this happens in order.
The processing looks something like this:
val in_events = env.addSource(new FlinkKinesisConsumer[EventA]( # these are guaranteed ordered
"stream-A",
new EventASchema,
consumerConfig))
val out_events = in_events
.keyBy(event => event.category_id)
.process(new EventAStreamProcessor)
out_events.addSink(new FlinkKinesisProducer[EventB](
"stream-B",
new EventBSchema,
producerConfig))
# a separate service reads the out_events and wants them in-order
Based on the guidelines here, it seems like it is impossible to guarantee the ordering of EventB records written to the sink. I only care that events with the same category_id are written in order, since the downstream service will keyBy this. Thinking from first principles, if I were to implement the threading manually, I would have a separate queue per category_id KeyedStream and ensure those are written serially to Kinesis (this seems like a strict generalization over what is done by default, which is to use a ThreadPool, which has a single global queue). Does the FlinkKinesisProducer support this mechanism or is there a way around this limitation using Flink's keyBy or similar construct? Separate sink per category_id maybe? For this last option, I'm anticipating 100k category_ids so this might have too much of a memory overhead.
One option is to buffer events read from stream-B in the downstream service to order them (with high probability if buffer window is large). This in theory should work, but it makes the downstream service more complex then it needs to be, precludes determinism since it depends on random timing of network calls, and, more importantly, adds latency to the pipeline (though maybe less latency overall then forcing serial writes to stream-B?). So ideally, I'm hoping to go with another option. And, this feels like a common problem, so perhaps there are more clever solutions out there or I'm missing something obvious
Many thanks in advance.

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId).
I use Flink 1.8 with DataStream API. I consider 2 approaches:
Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each worker.
Generalise state of streamA to a stream keyed by just (userId), let's name it streamC, to have common key with the streamB. Then I am able to union streamC with streamB, order by processing time, and handle both types of events in state. It's more complex to handle generaised stream (more code in the process function), but not consume that much RAM to have all streamB on all nodes. Are they any more downsides or upsides of this solution?
I have also seen this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API where it is said:
In general, most of these follow the pattern of joining a main stream
of high throughput with one or several inputs of slowly changing or
static data:
[...]
Join stream with slowly evolving data: This is very similar to
the above case but the side input that we use for enriching is
evolving over time. This can be done by waiting for some initial data
to be available before processing the main input and the continuously
ingesting new data into the internal side input structure as it
arrives.
Unfortunately, it looks like a long time ahead to reach this feature https://issues.apache.org/jira/browse/FLINK-6131 and no alternatives are described. Therefore I would like to ask of the currently recommended approach for the described use case.
I've seen Combining low-latency streams with multiple meta-data streams in Flink (enrichment), but it not specify what are keys of that streams, and moreover it is answered at the time of Flink 1.4, so I expect the recommended solution might have changed.
Building on top of what Gaurav Kumar has already answered.
The main question is do you need to exactly match records from streamA and streamB or is it best effort match? For example, is it an issue for you, that because of a race condition some (a lot of?) records from streamA can be processed before some updates from streamB arrive, for example during the start up?
I would suggest to draw an inspiration from how Table API is solving this issue. Probably Temporal Table Join is the right choice for you, which would leave you with the choice: processing time or event time?
Both of the Gaurav Kumar's proposal are implementations of processing time Temporal Table joins, which assumes that records can be very loosely joined and do not have to timed properly.
If records from streamA and streamB have to be timed properly, then one way or another you have to buffer some of the records from both of the streams. There are various of ways how to do it, depending on what semantic you want to achieve. After deciding on that, the actual implementation is not that difficult and you can draw an inspiration from Table API join operators (org.apache.flink.table.runtime.join package in flink-table-planner module).
Side inputs (that you referenced) and/or input selection are just tools for controlling the amount of unnecessary buffered records. You can implement a valid Flink job without them, but the memory consumption can be hard to control if one stream significantly overtakes the other (in terms of event time - for processing time it's non-issue).
The answer depends on size of your state of streamB that needs to be used to enrich streamA
If you broadcast your streamB state, then you are putting all userIDs from streamB to each of the task managers. Each task on task manager will only have a subset of these userIds from streamA on it. So some userId data from streamB will never be used and will stay as a waste. So if you think that the size of streamB state is not big enough to really impact your job and doesn't take significant memory to leave less memory for state management, you can keep the whole streamB state. This is your #1.
If your streamB state is really huge and can consume considerable memory on task managers, you should consider approach #2. KeyBy same Id both the streams to make sure that elements with same userID reach the same tasks, and then you can use managed state to maintain the per key streamB state and enrich streamA elements using this managed state.

Resources