Is RETRACT stream necessary in Flink? - apache-flink

When using RETRACT stream in Flink, update message is finished by two stages, while when using UPSERT stream, it's finished by one stage, which is more efficient.
However, UPSERT stream is not allowed when converting a Dynamic table to a DataStream according to this page, why such a limitation exists? What kinds of problems will be met when we totally replace RETRACT with UPSERT in Flink's design?

What kinds of problems will be met when we totally replace RETRACT with UPSERT in Flink's design?
As the page you linked to mentions, upsert streams require a primary key. Getting rid of retract streams as a concept, consequently, would break all scenarios where you do not have such a key (at all or just not defined).

Related

Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?

Stages of MongoDB aggregation pipeline are always executed sequentially. Can the documents that the pipeline processes be changed between the stages? E.g. if stage1 matches some docs from collection1 and stage2 matches some docs from collection2 can some documents from collection2 be written to during or just after stage1 (i.e. before stage2)? If so, can such behavior be prevented?
Why this is important: Say stage2 is a $lookup stage. Lookup is the NoSQL equivalent to SQL join. In a typical SQL database, a join query is isolated from writes. Meaning while the join is being resolved, the data affected by the join cannot change. I would like to know if I have the same guarantee in MongoDB. Please note that I am coming from noSQL world (just not MongoDB) and I understand the paradigm well. No need to suggest e.g. duplicating the data, if there was such a solution, I would not be asking on SO.
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved. However, MongoDB documentation does not say anything about aggregation pipeline locks. Does aggregation pipeline hold read (shared) locks to all the collections it reads? Or just to the collection used by the current pipeline stage?
More context: I need to run a "query" with multiple "joins" through several collections. The query is generated dynamically, I do not know upfront what collections will be "joined". Aggregation pipeline is the supposed way to do that. However, to get consistent "query" data, I need to ensure that no writes are interleaved between the stages of the pipeline.
E.g. a delete between $match and $lookup stage could remove one of the joined ("lookuped") documents making the entire result incorrect/inconsistent. Can this happen? How to prevent it?
#user20042973 already provided a link to https://www.mongodb.com/docs/manual/reference/read-concern-snapshot/#mongodb-readconcern-readconcern.-snapshot- in the very first comment, but considering followup comments and questions from OP regarding transactions, it seems it requires full answer for clarity.
So first of all transactions are all about writes, not reads. I can't stress it enough, so please read it again - transaction, or how mongodb introduced the "multidocument transactions" are there to ensure multiple updates have a single atomic operation "commit". No changes made within a transaction are visible outside of the transaction until it is committed, and all of the changes become visible at once when the transaction is committed. The docs: https://www.mongodb.com/docs/manual/core/transactions/#transactions-and-atomicity
The OP is concerned that any concurrent writes to the database can affect results of his aggregation operation, especially for $lookup operations that query other collections for each matching document from the main collection.
It's a very reasonable consideration, as MongoDB has always been eventually consistent and did not provide guarantees that such lookups will return the same results if the linked collection were changed during aggregation. Generally speaking it doesn't even guarantee a unique key is unique within a cursor that uses this index - if a document was deleted, and then a new one with same unique key was inserted there is non-zero chance to retrieve both.
The instrument to workaround this limitation is called "read concern", not "transaction". There are number of read concerns available to balance between speed and reliability/consistency: https://www.mongodb.com/docs/v6.0/reference/read-concern/ OP is after the most expensive one - "snapshot", as ttps://www.mongodb.com/docs/v6.0/reference/read-concern-snapshot/ put it:
A snapshot is a complete copy of the data in a mongod instance at a specific point in time.
mongod in this context spells "the whole thing" - all databases, collections within these databases, documents within these collections.
All operations within a query with "snapshot" concern are executed against the same version of data as it was when the node accepted the query.
Transactions use this snapshot read isolation under the hood and can be used to guarantee consistent results for $lookup queries even if there are no writes within the transaction. I'd recommend to use read concern explicitly instead - less overhead, and more importantly it clearly shows the intent to devs who are going to maintain your app.
Now, regarding this part of the question:
Based on my research, MongoDb read query acquires a shared (read) lock that prevents writes on the same collection until it is resolved.
It would be nice to have source of this claim. As of today (v5.0+) aggregation is lock-free, i.e. it is not blocked even if other operation holds an exclusive X lock on the collection: https://www.mongodb.com/docs/manual/faq/concurrency/#what-are-lock-free-read-operations-
When it cannot use lock-free read, it gets intended shared lock on the collection. This lock prevents only write locks on collection level, like these ones: https://www.mongodb.com/docs/manual/faq/concurrency/#which-administrative-commands-lock-a-collection-
IS lock on a collections still allows X locks on documents within the collection - insert, update or delete of a document requires only intended IX lock on collection, and exclusive X lock on the single document being affected by the write operation.
The final note - if such read isolation is critical to the business, and you must guarantee strict consistency, I'd advise to consider SQL databases. It might be more performant than snapshot queries. There are much more factors to consider, so I'll leave it to you. The point is mongo shines where eventual consistency is acceptable. It does pretty good with causal consistency within a server session, which gives enough guarantee for much wider range of usecases. I encourage you to test how good it will do with snapshots queries, especially if you are running multiple lookups, which can by its own be slow enough on larger datasets and might not even work without allowing disk use.
Q: Can MongoDB documents processed by an aggregation pipeline be affected by external write during pipeline execution?
A: Depending on how the transactions are isolated from each other.
Snapshot isolation refers to transactions seeing a consistent view of data: transactions can read data from a “snapshot” of data committed at the time the transaction starts. Any conflicting updates will cause the transaction to abort.
MongoDB transactions support a transaction-level read concern and transaction-level write concern. Clients can set an appropriate level of read & write concern, with the most rigorous being snapshot read concern combined with majority write concern.
To achieve it, set readConcern=snapshot and writeConcern=majority on connection string/session/transaction (but not on database/collection/operation as under a transaction database/collection/operation concern settings are ignored).
Q: Do transactions apply to all aggregation pipeline stages as well?
A: Not all operations are allowed in transaction.
For example, according to mongodb docs db.collection.aggregate() is allowed in transaction but some stages (e.g $merge) is excluded.
Full list of supported operation inside transaction: Refer mongodb doc.
Yes, MongoDB documents processed by an aggregation pipeline can be affected by external writes during pipeline execution. This is because the MongoDB aggregation pipeline operates on the data at the time it is processed, and it does not take into account any changes made to the data after the pipeline has started executing.
For example, if a document is being processed by the pipeline and an external write operation modifies or deletes the same document, the pipeline will not reflect those changes in its results. In some cases, this may result in incorrect or incomplete data being returned by the pipeline.
To avoid this situation, you can use MongoDB's snapshot option, which guarantees that the documents returned by the pipeline are a snapshot of the data as it existed at the start of the pipeline execution, regardless of any external writes that occur during the execution. However, this option can affect the performance of the pipeline.
Alternatively, it is possible to use a transaction in MongoDB 4.0 and later versions, which allows to have atomicity and consistency of the write operations on the documents during the pipeline execution.

Flink : Handling Keyed Streams with data older than application watermark

I'm using Flink with a kinesis source and event time keyed windows. The application will be listening to a live stream of data, windowing (event time windows) and processing each keyed stream. I have another use-case where i also need to be able to support backfill of older data for certain key streams (These will be new key streams with event-time < watermark).
Given that I'm using Watermarks, this poses to be a problem since Flink doesn't support per - key watermark. Hence any keyed stream for backfill will end up being ignored since the event time for this stream will be < application watermark maintained by the live stream.
I have gone through other similar questions but wasn't able to get a possible approach.
Here are possible approaches I'm considering but still have some open questions.
Possible Approach - 1
(i) Maintain a copy of the application specifically for backfill purpose. The backfill job will happen rarely (~ a few times a month). The stream of data sent to the application copy will have an indicator for start and stop in the stream. Using that I plan on starting / resetting the watermark.
Open Question ? Is it possible to reset the watermark using an indicator from the stream ? I understand that this is not best practise but can't think of an alternative solution.
Follow up to : Clear Flink watermark state in DataStream [No definitive solution provided.]
Possible Approach - 2
Have parallel instances for each key since its possible for having different watermark per task. -> Not going with this since i'll be having > 5k keyed streams.
Let me know if any other details are needed.
You can address this by running the backfill jobs in BATCH execution mode. When the DataStream API operates in batch mode, the input is bounded (finite), and known in advance. This allows Flink to sort the input by key and by timestamp, and the processing will proceed correctly according to event time without any concern for watermarks or late events.

Flink :Write retract stream to kafka sink

Can anyone share working example to write retract stream to kafka sink?
I tried as below which is not working.
DataStream<Tuple2<Boolean, User>> resultStream =
tEnv.toRetractStream(result, User.class);
resultStream.addsink(new FlinkKafkaProducer(OutputTopic, new ObjSerializationSchema(OutputTopic),
props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE))
Generally the simplest solution would be to do smth like:
resultStream.map(elem -> elem.f1)
This will allow You to write the User objects to Kafka.
But this isn't really that simple from business point of view or at least it depends on the use-case. Kafka is an append-only log and retract stream represents ADD, UPDATE and DELETE operations. So, while the solution above will allow You to write the data to Kafka, the results in Kafka may not correctly represent the actual computation results, because they won't represent update and delete operations.
To be able to write actual correct computation results to Kafka You may try to do one of the following things:
If You know that Your use-case will never cause any DELETE or UPDATE operations then You can safely use the solution above.
If the duplicates may be only produced in some regular intervals (for example record may only be updated/deleted after 1 hr after they are produced), you may want to use windows to aggregate all updates and write one final result to Kafka.
Finally, You can extend User class to add a field which marks whether this record is a retract operation and keep that information when writing data to Kafka topic. This means that You would have to handle all possible UPDATE or DELETE operations downstream (in the consumer of this data).
The easiest solution would be to use the upsert-kafka connector, as a table sink. This is designed to consume a retract stream and write it to kafka.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html

Flink Dynamic Table vs Kafka Stream Ktable?

I was reading on the current several limitation of the joins in kafka stream such as Ktable KTable non-key join or KTable GlobalKTable ....
I discovered that Flink seems to support all of it. From what I read, A dynamic Table sound like a KTable.
I wonder if first of all, they are the same concept, and then somehow how does Flink achieve that, I could not find documentation about the underlying infrastructure. For instance i did not find the notion of broadcast join that happens with GlobalKtable. Is the underlying infrastructure achieving dynamic table distributed ??
Flink's dynamic table and Kafka's KTable are not the same.
In Flink, a dynamic table is a very generic and broad concept, namely a table that evolves over time. This includes arbitrary changes (INSERT, DELETE, UPDATE). A dynamic table does not need a primary key or unique attribute, but it might have one.
A KStream is a special type of dynamic table, namely a dynamic table that is only receiving INSERT changes, i.e., an ever-growing, append-only table.
A KTable is another type of dynamic table, i.e., a dynamic table that has a unique key and changes with INSERT, DELETE, and UPDATE changes on the key.
Flink supports the following types of joins on dynamic tables. Note that the references to Kafka's joins might not be 100% accurate (happy to fix errors!).
Time-windowed joins should correspond to KSQL's KStream-KStream joins
Temporal table joins are similar to KSQL's KStream-KTable joins. The temporal relation between both tables needs to be explicitly specified in the query to be able to run the same query with identical semantics on batch/offline data.
Regular joins are more generic than KSQL's KTable-KTable joins because they don't require the input tables to have unique keys. Moreover, Flink does not distinguish between primary- or foreign-key joins, but requires that joins are equi-joins, i.e., have at least one equality predicate. At this point, the streaming SQL planner does not support broadcast-forward joins (which I believe should roughly correspond to KTable-GlobalKTable joins).
I am not 100% sure because I don't know all the details of Flink's "dynamic table" concept, but it seems to me it's the same as a KTable in Kafka Streams.
However, there is a difference between a KTable and a GlobalKTable in Kafka Streams, and both are not the exact same thing. (1) A KTable is distributed/sharded while a GlobalKTable is replicated/broadcasted. (2) A KTable is event time synchronized while a GlobalKTable is not. For the same reason, a GlobalKTable is fully loaded/bootstrapped on startup while a KTable is updated based on the changelog records event timestamps when appropriate (in relationship to the event timestamps of other input streams). Furthermore, during processing updates to a KTable are event time synchronized while updates to a GlobalKTable are not (ie, they are applied immediately and thus can be considered non-deterministic).
Last note: Kafka Streams adds foreign-key KTable-KTable joins in upcoming 2.4 release. There is also a ticket to add KTable-GlobalKTabel joins but this feature was not requested very often yet, and thus not added yet: https://issues.apache.org/jira/browse/KAFKA-4628

Combining low-latency streams with multiple meta-data streams in Flink (enrichment)

I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.
A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:
This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?
I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.
Flink has several mechanisms that can be used for enrichment.
I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.
The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.
For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.
Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.
You will find a detailed example, plus code, in the Apache Flink training materials.
If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.
Update:
Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime
This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.

Resources