Why Flink Stream not support left join expression? - flink-streaming

Flink Stream support inner join expressions like window-join, interval-join. But not support left join / full join expressions. It's surely the window-cogroup expression can implement the same semantics which have to wait a completely window-size time even if events have joined immediately. My Question is that:
How to explain Flink Stream does not support left join / full join expresiions from a design point of view ?
How could I achive it by Flink DataStream API(It's better if can forward joined-event immediately) ?
Is there a way to extend Flink DataStream API to support left join like:
.leftJoin()
.where()
.window()
.apply()

The difference between temporally constrained joins like windowed or interval joins, and regular joins, is that in a streaming context, regular joins require indefinite state retention.
Regular left/full joins are available using Flink's Table and SQL APIs. The direction the Flink community has been going is to not put any further effort into developing relational operations with the DataStream API, but to instead improve the interoperability between the DataStream and Table APIs. Flink 1.13 marked a new milestone in making it even easier to convert between streams and tables and back again, and this is the recommended approach whenever relational operations on DataStreams are required.

Related

How to join two streams without time window in flink?

I am getting data from two streams. I want to join these two streams based on a key. For example, consider two streams. Data in stream A can come first. Sometimes data in stream B can come first. The joining data in the streams can come at any time. Because of this nature, I can't use a windowed join. Is it possible to join two unbounded streams in flink?
I believe a non-windowed Join will behave as you want: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/joins/#regular-joins
If you are using the DataStream API instead of the SQL API, a CoFlatMap operator implementing a shared state that keeps the elements from both sides and joins them when there is an update, would allow you to implement this behavior as well.
Take into account that this requires keeping both sides in state forever, which can make grow your state infinitely.
The note in the Flink SQL documentation advises looking at setting a state TTL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/config/#table-exec-state-ttl. That would be the DataStream equivalent: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/#state-time-to-live-ttl The problem is that if some records in the state expire and there is an update that would require to be joined with the expired element, the result will be incorrect.

state clean up behavior with flink interval join

I am reading at
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/joins/#interval-joins,
It has following example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.order_id
AND o.order_time BETWEEN s.ship_time - INTERVAL '4' HOUR AND s.ship_time
I got following two questions:
If o.order_time and s.ship_time are normal time column, not event time attribute, then all the states will be saved in Flink, like normal regular inner join does? So that, maybe big size states will be kept in Flink
If o.order_time and s.ship_time are event time attributes, then flink will rely on watermark to do state clean up? so that small size states will be kept in Flink
Yes, that's correct. The reason Flink SQL has the notion of time attributes is so that suitable streaming queries can have their state automatically cleaned up, and an interval join is an example of such a query. Time windows and temporal joins on versioned tables also work in a similar way.

Stream Joins for Large Time Windows with Flink

I need to join two event sources based on a key. The gap between the events can be up to 1 year(ie. event1 with id1 may arrive today and the corresponding event2 with id1 from the second event source may arrive a year later). Assume I want to just stream out the joined event output.
I am exploring the option of using Flink with the RocksDB backend(I came across Table APIs which appear to suit my use case). I am not able to find references architectures that do this kind of long window joins. I am expecting the system to process about 200M events a day.
Questions:
Are there any obvious limitations/pitfalls of using Flink for this kind of Long Window joins?
Any recommendations on handling this kind of long window joins
Related: I am also exploring using Lambda with DynamoDB as the state to do stream joins(Related Question). I will be using managed AWS services if this info is relevant.
The obvious challenge of this use case are the large join window size of one year and the high ingestion rate which can result in a huge state size.
The main question here is whether this is a 1:1 join, i.e., whether a record from stream A joins exactly (or at most) once with a record from stream B. This is important, because if you have a 1:1 join, you can remove a record from the state as soon as it was joined with another record and you don't need to keep it around for the full year. Hence, your state only stores records that were not joined yet. Assuming that the majority of records is quickly joined, your state might remain reasonable small.
If you have a 1:1 join, the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API are not what you want. They are implemented as m:n joins because every record might join with more than one record of the other input. Hence they keep all records for the full window interval, i.e., for one year in your use case. If you have a 1:1 join, you should implement the join yourself as a KeyedCoProcessFunction.
If every record can join multiple times within one year, there's no way around buffering these records. In this case, you can use the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API.

Flink Dynamic Table vs Kafka Stream Ktable?

I was reading on the current several limitation of the joins in kafka stream such as Ktable KTable non-key join or KTable GlobalKTable ....
I discovered that Flink seems to support all of it. From what I read, A dynamic Table sound like a KTable.
I wonder if first of all, they are the same concept, and then somehow how does Flink achieve that, I could not find documentation about the underlying infrastructure. For instance i did not find the notion of broadcast join that happens with GlobalKtable. Is the underlying infrastructure achieving dynamic table distributed ??
Flink's dynamic table and Kafka's KTable are not the same.
In Flink, a dynamic table is a very generic and broad concept, namely a table that evolves over time. This includes arbitrary changes (INSERT, DELETE, UPDATE). A dynamic table does not need a primary key or unique attribute, but it might have one.
A KStream is a special type of dynamic table, namely a dynamic table that is only receiving INSERT changes, i.e., an ever-growing, append-only table.
A KTable is another type of dynamic table, i.e., a dynamic table that has a unique key and changes with INSERT, DELETE, and UPDATE changes on the key.
Flink supports the following types of joins on dynamic tables. Note that the references to Kafka's joins might not be 100% accurate (happy to fix errors!).
Time-windowed joins should correspond to KSQL's KStream-KStream joins
Temporal table joins are similar to KSQL's KStream-KTable joins. The temporal relation between both tables needs to be explicitly specified in the query to be able to run the same query with identical semantics on batch/offline data.
Regular joins are more generic than KSQL's KTable-KTable joins because they don't require the input tables to have unique keys. Moreover, Flink does not distinguish between primary- or foreign-key joins, but requires that joins are equi-joins, i.e., have at least one equality predicate. At this point, the streaming SQL planner does not support broadcast-forward joins (which I believe should roughly correspond to KTable-GlobalKTable joins).
I am not 100% sure because I don't know all the details of Flink's "dynamic table" concept, but it seems to me it's the same as a KTable in Kafka Streams.
However, there is a difference between a KTable and a GlobalKTable in Kafka Streams, and both are not the exact same thing. (1) A KTable is distributed/sharded while a GlobalKTable is replicated/broadcasted. (2) A KTable is event time synchronized while a GlobalKTable is not. For the same reason, a GlobalKTable is fully loaded/bootstrapped on startup while a KTable is updated based on the changelog records event timestamps when appropriate (in relationship to the event timestamps of other input streams). Furthermore, during processing updates to a KTable are event time synchronized while updates to a GlobalKTable are not (ie, they are applied immediately and thus can be considered non-deterministic).
Last note: Kafka Streams adds foreign-key KTable-KTable joins in upcoming 2.4 release. There is also a ticket to add KTable-GlobalKTabel joins but this feature was not requested very often yet, and thus not added yet: https://issues.apache.org/jira/browse/KAFKA-4628

Combining low-latency streams with multiple meta-data streams in Flink (enrichment)

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.
A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:
This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?
I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.
Flink has several mechanisms that can be used for enrichment.
I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.
The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.
For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.
Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.
You will find a detailed example, plus code, in the Apache Flink training materials.
If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.
Update:
Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime
This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

Resources