Flink Dynamic Table vs Kafka Stream Ktable? - apache-flink

I was reading on the current several limitation of the joins in kafka stream such as Ktable KTable non-key join or KTable GlobalKTable ....
I discovered that Flink seems to support all of it. From what I read, A dynamic Table sound like a KTable.
I wonder if first of all, they are the same concept, and then somehow how does Flink achieve that, I could not find documentation about the underlying infrastructure. For instance i did not find the notion of broadcast join that happens with GlobalKtable. Is the underlying infrastructure achieving dynamic table distributed ??

Flink's dynamic table and Kafka's KTable are not the same.
In Flink, a dynamic table is a very generic and broad concept, namely a table that evolves over time. This includes arbitrary changes (INSERT, DELETE, UPDATE). A dynamic table does not need a primary key or unique attribute, but it might have one.
A KStream is a special type of dynamic table, namely a dynamic table that is only receiving INSERT changes, i.e., an ever-growing, append-only table.
A KTable is another type of dynamic table, i.e., a dynamic table that has a unique key and changes with INSERT, DELETE, and UPDATE changes on the key.
Flink supports the following types of joins on dynamic tables. Note that the references to Kafka's joins might not be 100% accurate (happy to fix errors!).
Time-windowed joins should correspond to KSQL's KStream-KStream joins
Temporal table joins are similar to KSQL's KStream-KTable joins. The temporal relation between both tables needs to be explicitly specified in the query to be able to run the same query with identical semantics on batch/offline data.
Regular joins are more generic than KSQL's KTable-KTable joins because they don't require the input tables to have unique keys. Moreover, Flink does not distinguish between primary- or foreign-key joins, but requires that joins are equi-joins, i.e., have at least one equality predicate. At this point, the streaming SQL planner does not support broadcast-forward joins (which I believe should roughly correspond to KTable-GlobalKTable joins).

I am not 100% sure because I don't know all the details of Flink's "dynamic table" concept, but it seems to me it's the same as a KTable in Kafka Streams.
However, there is a difference between a KTable and a GlobalKTable in Kafka Streams, and both are not the exact same thing. (1) A KTable is distributed/sharded while a GlobalKTable is replicated/broadcasted. (2) A KTable is event time synchronized while a GlobalKTable is not. For the same reason, a GlobalKTable is fully loaded/bootstrapped on startup while a KTable is updated based on the changelog records event timestamps when appropriate (in relationship to the event timestamps of other input streams). Furthermore, during processing updates to a KTable are event time synchronized while updates to a GlobalKTable are not (ie, they are applied immediately and thus can be considered non-deterministic).
Last note: Kafka Streams adds foreign-key KTable-KTable joins in upcoming 2.4 release. There is also a ticket to add KTable-GlobalKTabel joins but this feature was not requested very often yet, and thus not added yet: https://issues.apache.org/jira/browse/KAFKA-4628

Related

Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?

Stages of MongoDB aggregation pipeline are always executed sequentially. Can the documents that the pipeline processes be changed between the stages? E.g. if stage1 matches some docs from collection1 and stage2 matches some docs from collection2 can some documents from collection2 be written to during or just after stage1 (i.e. before stage2)? If so, can such behavior be prevented?
Why this is important: Say stage2 is a $lookup stage. Lookup is the NoSQL equivalent to SQL join. In a typical SQL database, a join query is isolated from writes. Meaning while the join is being resolved, the data affected by the join cannot change. I would like to know if I have the same guarantee in MongoDB. Please note that I am coming from noSQL world (just not MongoDB) and I understand the paradigm well. No need to suggest e.g. duplicating the data, if there was such a solution, I would not be asking on SO.
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved. However, MongoDB documentation does not say anything about aggregation pipeline locks. Does aggregation pipeline hold read (shared) locks to all the collections it reads? Or just to the collection used by the current pipeline stage?
More context: I need to run a "query" with multiple "joins" through several collections. The query is generated dynamically, I do not know upfront what collections will be "joined". Aggregation pipeline is the supposed way to do that. However, to get consistent "query" data, I need to ensure that no writes are interleaved between the stages of the pipeline.
E.g. a delete between $match and $lookup stage could remove one of the joined ("lookuped") documents making the entire result incorrect/inconsistent. Can this happen? How to prevent it?
#user20042973 already provided a link to https://www.mongodb.com/docs/manual/reference/read-concern-snapshot/#mongodb-readconcern-readconcern.-snapshot- in the very first comment, but considering followup comments and questions from OP regarding transactions, it seems it requires full answer for clarity.
So first of all transactions are all about writes, not reads. I can't stress it enough, so please read it again - transaction, or how mongodb introduced the "multidocument transactions" are there to ensure multiple updates have a single atomic operation "commit". No changes made within a transaction are visible outside of the transaction until it is committed, and all of the changes become visible at once when the transaction is committed. The docs: https://www.mongodb.com/docs/manual/core/transactions/#transactions-and-atomicity
The OP is concerned that any concurrent writes to the database can affect results of his aggregation operation, especially for $lookup operations that query other collections for each matching document from the main collection.
It's a very reasonable consideration, as MongoDB has always been eventually consistent and did not provide guarantees that such lookups will return the same results if the linked collection were changed during aggregation. Generally speaking it doesn't even guarantee a unique key is unique within a cursor that uses this index - if a document was deleted, and then a new one with same unique key was inserted there is non-zero chance to retrieve both.
The instrument to workaround this limitation is called "read concern", not "transaction". There are number of read concerns available to balance between speed and reliability/consistency: https://www.mongodb.com/docs/v6.0/reference/read-concern/ OP is after the most expensive one - "snapshot", as ttps://www.mongodb.com/docs/v6.0/reference/read-concern-snapshot/ put it:
A snapshot is a complete copy of the data in a mongod instance at a specific point in time.
mongod in this context spells "the whole thing" - all databases, collections within these databases, documents within these collections.
All operations within a query with "snapshot" concern are executed against the same version of data as it was when the node accepted the query.
Transactions use this snapshot read isolation under the hood and can be used to guarantee consistent results for $lookup queries even if there are no writes within the transaction. I'd recommend to use read concern explicitly instead - less overhead, and more importantly it clearly shows the intent to devs who are going to maintain your app.
Now, regarding this part of the question:
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved.
It would be nice to have source of this claim. As of today (v5.0+) aggregation is lock-free, i.e. it is not blocked even if other operation holds an exclusive X lock on the collection: https://www.mongodb.com/docs/manual/faq/concurrency/#what-are-lock-free-read-operations-
When it cannot use lock-free read, it gets intended shared lock on the collection. This lock prevents only write locks on collection level, like these ones: https://www.mongodb.com/docs/manual/faq/concurrency/#which-administrative-commands-lock-a-collection-
IS lock on a collections still allows X locks on documents within the collection - insert, update or delete of a document requires only intended IX lock on collection, and exclusive X lock on the single document being affected by the write operation.
The final note - if such read isolation is critical to the business, and you must guarantee strict consistency, I'd advise to consider SQL databases. It might be more performant than snapshot queries. There are much more factors to consider, so I'll leave it to you. The point is mongo shines where eventual consistency is acceptable. It does pretty good with causal consistency within a server session, which gives enough guarantee for much wider range of usecases. I encourage you to test how good it will do with snapshots queries, especially if you are running multiple lookups, which can by its own be slow enough on larger datasets and might not even work without allowing disk use.
Q: Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?
A: Depending on how the transactions are isolated from each other.
Snapshot isolation refers to transactions seeing a consistent view of data: transactions can read data from a “snapshot” of data committed at the time the transaction starts. Any conflicting updates will cause the transaction to abort.
MongoDB transactions support a transaction-level read concern and transaction-level write concern. Clients can set an appropriate level of read & write concern, with the most rigorous being snapshot read concern combined with majority write concern.
To achieve it, set readConcern=snapshot and writeConcern=majority on connection string/session/transaction (but not on database/collection/operation as under a transaction database/collection/operation concern settings are ignored).
Q: Do transactions apply to all aggregation pipeline stages as well?
A: Not all operations are allowed in transaction.
For example, according to mongodb docs db.collection.aggregate() is allowed in transaction but some stages (e.g $merge) is excluded.
Full list of supported operation inside transaction: Refer mongodb doc.
Yes, MongoDB documents processed by an aggregation pipeline can be affected by external writes during pipeline execution. This is because the MongoDB aggregation pipeline operates on the data at the time it is processed, and it does not take into account any changes made to the data after the pipeline has started executing.
For example, if a document is being processed by the pipeline and an external write operation modifies or deletes the same document, the pipeline will not reflect those changes in its results. In some cases, this may result in incorrect or incomplete data being returned by the pipeline.
To avoid this situation, you can use MongoDB's snapshot option, which guarantees that the documents returned by the pipeline are a snapshot of the data as it existed at the start of the pipeline execution, regardless of any external writes that occur during the execution. However, this option can affect the performance of the pipeline.
Alternatively, it is possible to use a transaction in MongoDB 4.0 and later versions, which allows to have atomicity and consistency of the write operations on the documents during the pipeline execution.

DB Table Concurrency Issue in Microservices

How to handle concurrency-related issues on a DB table if multiple applications are reading and writing on it? This case may not be specific to microservices.
OPERATION
STATUS
GET_ORDER
COMPLETE
CALCULATE_PRICE
RUNNING
A very basic use-case: multiple applications are writing in the above table. Before writing, they check if same operation is already present in RUNNING status. If not present, they insert the entry. Otherwise they just skip. Both read and write operations are simple SQL queries.
Problem is - 2 different applications can read at the same time and find that there is no 'CREATE_INVOICE' operation RUNNING, so they both will insert it in the table which will now look like:
OPERATION
STATUS
GET_ORDER
COMPLETE
CALCULATE_PRICE
RUNNING
CREATE_INVOICE
RUNNING
CREATE_INVOICE
RUNNING
As a result the table has two duplicate CREATE_INVOICE records. Besides applying unique constraint on the table, what are the ways to resolve this?
By "2 different applications" do you mean that there are two completely separate applications which create invoices, or just 2 instances of the same application?
If the former, I'd be curious why there are two applications doing the same thing writing to the same DB.
If the latter, those instances will need to coordinate in some way (a uniqueness constraint on the table is an example of such coordination), and it's important to note that this coordination makes the application a little more stateful.
My preferred way of dealing with this would be to be event driven (e.g. by tapping into database change data capture) and sharding: for instance, when a GET_ORDER record is marked COMPLETE in the DB (resulting in a CDC record being published), based on the order ID, that CDC record is always routed to the same shard in the invoice creation application (or the price calculation application for that matter; your second table seems to imply that invoice creation can be simultaneous with price calculation), thus avoiding the conflict.

Stream Joins for Large Time Windows with Flink

I need to join two event sources based on a key. The gap between the events can be up to 1 year(ie. event1 with id1 may arrive today and the corresponding event2 with id1 from the second event source may arrive a year later). Assume I want to just stream out the joined event output.
I am exploring the option of using Flink with the RocksDB backend(I came across Table APIs which appear to suit my use case). I am not able to find references architectures that do this kind of long window joins. I am expecting the system to process about 200M events a day.
Questions:
Are there any obvious limitations/pitfalls of using Flink for this kind of Long Window joins?
Any recommendations on handling this kind of long window joins
Related: I am also exploring using Lambda with DynamoDB as the state to do stream joins(Related Question). I will be using managed AWS services if this info is relevant.
The obvious challenge of this use case are the large join window size of one year and the high ingestion rate which can result in a huge state size.
The main question here is whether this is a 1:1 join, i.e., whether a record from stream A joins exactly (or at most) once with a record from stream B. This is important, because if you have a 1:1 join, you can remove a record from the state as soon as it was joined with another record and you don't need to keep it around for the full year. Hence, your state only stores records that were not joined yet. Assuming that the majority of records is quickly joined, your state might remain reasonable small.
If you have a 1:1 join, the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API are not what you want. They are implemented as m:n joins because every record might join with more than one record of the other input. Hence they keep all records for the full window interval, i.e., for one year in your use case. If you have a 1:1 join, you should implement the join yourself as a KeyedCoProcessFunction.
If every record can join multiple times within one year, there's no way around buffering these records. In this case, you can use the time-window joins of Flink's Table API (and SQL) and the Interval join of the DataStream API.

Combining low-latency streams with multiple meta-data streams in Flink (enrichment)

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.
A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:
This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?
I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.
Flink has several mechanisms that can be used for enrichment.
I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.
The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.
For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.
Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.
You will find a detailed example, plus code, in the Apache Flink training materials.
If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.
Update:
Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime
This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

Adding new aggregations to a time series database

I'm implementing a database system in postgresql to support fast queries about time series data from users. Events are for example: User U executed action A at time T. Different event types are split into different tables, currently around 20. As the number of events currently are around 20M and will reach 1B pretty soon, I decided to create aggregation tables. The aggregations are for example: How many users executed at least one action at a particular day, or total number of actions executed each day.
I have created insert triggers that inserts data into the aggregation tables whenever a row is inserted into the event tables. This works great and offers great performance with the current amount of events, and I think it should scale good to.
However, if I want to create a new aggregation only events from that point forward would be aggregated. To have all the old events included, they would have to be re-inserted. I see two ways this could be achieved. The first is to create a "re-run" function that essentially does the following:
Find all the tables this aggregation depends on, and all tables those aggregation depends on etc. until you have all direct and indirect dependencies.
Copy the tables to temporary tables
Empty the tables and the aggregation tables.
Re-insert data from the temporary tables.
This poses some questions about atomicity. What if an event is inserted after copying? Should one lock all the tables involved during this operation?
The other solution would be to keep track, for each aggregation table, which rows in the event tables that has been aggregated, and then at some point aggregate all the event that is missing from that track table. This seems to me less prone to concurrency errors, but requires a lot of tracking storage.
Are there any other solutions, and if not, which of the above would you choose?

Resources