Stream Joins for Large Time Windows with Flink - apache-flink

I need to join two event sources based on a key. The gap between the events can be up to 1 year(ie. event1 with id1 may arrive today and the corresponding event2 with id1 from the second event source may arrive a year later). Assume I want to just stream out the joined event output.
I am exploring the option of using Flink with the RocksDB backend(I came across Table APIs which appear to suit my use case). I am not able to find references architectures that do this kind of long window joins. I am expecting the system to process about 200M events a day.
Questions:
Are there any obvious limitations/pitfalls of using Flink for this kind of Long Window joins?
Any recommendations on handling this kind of long window joins
Related: I am also exploring using Lambda with DynamoDB as the state to do stream joins(Related Question). I will be using managed AWS services if this info is relevant.

The obvious challenge of this use case are the large join window size of one year and the high ingestion rate which can result in a huge state size.
The main question here is whether this is a 1:1 join, i.e., whether a record from stream A joins exactly (or at most) once with a record from stream B. This is important, because if you have a 1:1 join, you can remove a record from the state as soon as it was joined with another record and you don't need to keep it around for the full year. Hence, your state only stores records that were not joined yet. Assuming that the majority of records is quickly joined, your state might remain reasonable small.
If you have a 1:1 join, the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API are not what you want. They are implemented as m:n joins because every record might join with more than one record of the other input. Hence they keep all records for the full window interval, i.e., for one year in your use case. If you have a 1:1 join, you should implement the join yourself as a KeyedCoProcessFunction.
If every record can join multiple times within one year, there's no way around buffering these records. In this case, you can use the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API.

Related

How to join two streams without time window in flink?

I am getting data from two streams. I want to join these two streams based on a key. For example, consider two streams. Data in stream A can come first. Sometimes data in stream B can come first. The joining data in the streams can come at any time. Because of this nature, I can't use a windowed join. Is it possible to join two unbounded streams in flink?
I believe a non-windowed Join will behave as you want: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/sql/queries/joins/#regular-joins
If you are using the DataStream API instead of the SQL API, a CoFlatMap operator implementing a shared state that keeps the elements from both sides and joins them when there is an update, would allow you to implement this behavior as well.
Take into account that this requires keeping both sides in state forever, which can make grow your state infinitely.
The note in the Flink SQL documentation advises looking at setting a state TTL: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/config/#table-exec-state-ttl. That would be the DataStream equivalent: https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/state/#state-time-to-live-ttl The problem is that if some records in the state expire and there is an update that would require to be joined with the expired element, the result will be incorrect.

state clean up behavior with flink interval join

I am reading at
https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/dev/table/sql/queries/joins/#interval-joins,
It has following example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.order_id
AND o.order_time BETWEEN s.ship_time - INTERVAL '4' HOUR AND s.ship_time
I got following two questions:
If o.order_time and s.ship_time are normal time column, not event time attribute, then all the states will be saved in Flink, like normal regular inner join does? So that, maybe big size states will be kept in Flink
If o.order_time and s.ship_time are event time attributes, then flink will rely on watermark to do state clean up? so that small size states will be kept in Flink
Yes, that's correct. The reason Flink SQL has the notion of time attributes is so that suitable streaming queries can have their state automatically cleaned up, and an interval join is an example of such a query. Time windows and temporal joins on versioned tables also work in a similar way.

Flink Dynamic Table vs Kafka Stream Ktable?

I was reading on the current several limitation of the joins in kafka stream such as Ktable KTable non-key join or KTable GlobalKTable ....
I discovered that Flink seems to support all of it. From what I read, A dynamic Table sound like a KTable.
I wonder if first of all, they are the same concept, and then somehow how does Flink achieve that, I could not find documentation about the underlying infrastructure. For instance i did not find the notion of broadcast join that happens with GlobalKtable. Is the underlying infrastructure achieving dynamic table distributed ??
Flink's dynamic table and Kafka's KTable are not the same.
In Flink, a dynamic table is a very generic and broad concept, namely a table that evolves over time. This includes arbitrary changes (INSERT, DELETE, UPDATE). A dynamic table does not need a primary key or unique attribute, but it might have one.
A KStream is a special type of dynamic table, namely a dynamic table that is only receiving INSERT changes, i.e., an ever-growing, append-only table.
A KTable is another type of dynamic table, i.e., a dynamic table that has a unique key and changes with INSERT, DELETE, and UPDATE changes on the key.
Flink supports the following types of joins on dynamic tables. Note that the references to Kafka's joins might not be 100% accurate (happy to fix errors!).
Time-windowed joins should correspond to KSQL's KStream-KStream joins
Temporal table joins are similar to KSQL's KStream-KTable joins. The temporal relation between both tables needs to be explicitly specified in the query to be able to run the same query with identical semantics on batch/offline data.
Regular joins are more generic than KSQL's KTable-KTable joins because they don't require the input tables to have unique keys. Moreover, Flink does not distinguish between primary- or foreign-key joins, but requires that joins are equi-joins, i.e., have at least one equality predicate. At this point, the streaming SQL planner does not support broadcast-forward joins (which I believe should roughly correspond to KTable-GlobalKTable joins).
I am not 100% sure because I don't know all the details of Flink's "dynamic table" concept, but it seems to me it's the same as a KTable in Kafka Streams.
However, there is a difference between a KTable and a GlobalKTable in Kafka Streams, and both are not the exact same thing. (1) A KTable is distributed/sharded while a GlobalKTable is replicated/broadcasted. (2) A KTable is event time synchronized while a GlobalKTable is not. For the same reason, a GlobalKTable is fully loaded/bootstrapped on startup while a KTable is updated based on the changelog records event timestamps when appropriate (in relationship to the event timestamps of other input streams). Furthermore, during processing updates to a KTable are event time synchronized while updates to a GlobalKTable are not (ie, they are applied immediately and thus can be considered non-deterministic).
Last note: Kafka Streams adds foreign-key KTable-KTable joins in upcoming 2.4 release. There is also a ticket to add KTable-GlobalKTabel joins but this feature was not requested very often yet, and thus not added yet: https://issues.apache.org/jira/browse/KAFKA-4628

Combining low-latency streams with multiple meta-data streams in Flink (enrichment)

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.
A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:
This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?
I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.
Flink has several mechanisms that can be used for enrichment.
I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.
The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.
For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.
Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.
You will find a detailed example, plus code, in the Apache Flink training materials.
If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.
Update:
Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime
This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

Adding new aggregations to a time series database

I'm implementing a database system in postgresql to support fast queries about time series data from users. Events are for example: User U executed action A at time T. Different event types are split into different tables, currently around 20. As the number of events currently are around 20M and will reach 1B pretty soon, I decided to create aggregation tables. The aggregations are for example: How many users executed at least one action at a particular day, or total number of actions executed each day.
I have created insert triggers that inserts data into the aggregation tables whenever a row is inserted into the event tables. This works great and offers great performance with the current amount of events, and I think it should scale good to.
However, if I want to create a new aggregation only events from that point forward would be aggregated. To have all the old events included, they would have to be re-inserted. I see two ways this could be achieved. The first is to create a "re-run" function that essentially does the following:
Find all the tables this aggregation depends on, and all tables those aggregation depends on etc. until you have all direct and indirect dependencies.
Copy the tables to temporary tables
Empty the tables and the aggregation tables.
Re-insert data from the temporary tables.
This poses some questions about atomicity. What if an event is inserted after copying? Should one lock all the tables involved during this operation?
The other solution would be to keep track, for each aggregation table, which rows in the event tables that has been aggregated, and then at some point aggregate all the event that is missing from that track table. This seems to me less prone to concurrency errors, but requires a lot of tracking storage.
Are there any other solutions, and if not, which of the above would you choose?

Resources