Can anyone share working example to write retract stream to kafka sink?
I tried as below which is not working.
DataStream<Tuple2<Boolean, User>> resultStream =
tEnv.toRetractStream(result, User.class);
resultStream.addsink(new FlinkKafkaProducer(OutputTopic, new ObjSerializationSchema(OutputTopic),
props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE))
Generally the simplest solution would be to do smth like:
resultStream.map(elem -> elem.f1)
This will allow You to write the User objects to Kafka.
But this isn't really that simple from business point of view or at least it depends on the use-case. Kafka is an append-only log and retract stream represents ADD, UPDATE and DELETE operations. So, while the solution above will allow You to write the data to Kafka, the results in Kafka may not correctly represent the actual computation results, because they won't represent update and delete operations.
To be able to write actual correct computation results to Kafka You may try to do one of the following things:
If You know that Your use-case will never cause any DELETE or UPDATE operations then You can safely use the solution above.
If the duplicates may be only produced in some regular intervals (for example record may only be updated/deleted after 1 hr after they are produced), you may want to use windows to aggregate all updates and write one final result to Kafka.
Finally, You can extend User class to add a field which marks whether this record is a retract operation and keep that information when writing data to Kafka topic. This means that You would have to handle all possible UPDATE or DELETE operations downstream (in the consumer of this data).
The easiest solution would be to use the upsert-kafka connector, as a table sink. This is designed to consume a retract stream and write it to kafka.
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/connectors/upsert-kafka.html
Related
When using RETRACT stream in Flink, update message is finished by two stages, while when using UPSERT stream, it's finished by one stage, which is more efficient.
However, UPSERT stream is not allowed when converting a Dynamic table to a DataStream according to this page, why such a limitation exists? What kinds of problems will be met when we totally replace RETRACT with UPSERT in Flink's design?
What kinds of problems will be met when we totally replace RETRACT with UPSERT in Flink's design?
As the page you linked to mentions, upsert streams require a primary key. Getting rid of retract streams as a concept, consequently, would break all scenarios where you do not have such a key (at all or just not defined).
I'm implementing a real-time streaming ETL pipeline using Apache Flink. The pipeline has these characteristics:
Ingest a single Kinesis stream: stream-A
The stream has records of type EventA which have a category_id, representing distinct logical streams
Because of how they are written to Kinesis (separate producer per category_id, writing serially), these logical streams are guaranteed to be read in order by FlinkKinesisConsumer
Flink does some in-order processing work, keyed by the category_id, generating a stream of EventB data records
These records are written to Kinesis stream-B
A separate service ingests the data from stream-B and it is important that this happens in order.
The processing looks something like this:
val in_events = env.addSource(new FlinkKinesisConsumer[EventA]( # these are guaranteed ordered
"stream-A",
new EventASchema,
consumerConfig))
val out_events = in_events
.keyBy(event => event.category_id)
.process(new EventAStreamProcessor)
out_events.addSink(new FlinkKinesisProducer[EventB](
"stream-B",
new EventBSchema,
producerConfig))
# a separate service reads the out_events and wants them in-order
Based on the guidelines here, it seems like it is impossible to guarantee the ordering of EventB records written to the sink. I only care that events with the same category_id are written in order, since the downstream service will keyBy this. Thinking from first principles, if I were to implement the threading manually, I would have a separate queue per category_id KeyedStream and ensure those are written serially to Kinesis (this seems like a strict generalization over what is done by default, which is to use a ThreadPool, which has a single global queue). Does the FlinkKinesisProducer support this mechanism or is there a way around this limitation using Flink's keyBy or similar construct? Separate sink per category_id maybe? For this last option, I'm anticipating 100k category_ids so this might have too much of a memory overhead.
One option is to buffer events read from stream-B in the downstream service to order them (with high probability if buffer window is large). This in theory should work, but it makes the downstream service more complex then it needs to be, precludes determinism since it depends on random timing of network calls, and, more importantly, adds latency to the pipeline (though maybe less latency overall then forcing serial writes to stream-B?). So ideally, I'm hoping to go with another option. And, this feels like a common problem, so perhaps there are more clever solutions out there or I'm missing something obvious
Many thanks in advance.
I want to read history from state. if state is null, then read hbase and update the state and using onTimer to set state ttl. The problem is how to batch read hbase, because read single record from hbase is not efficient.
In general, if you want to cache/mirror state from an external database in Flink, the most performant approach is to stream the database mutations into Flink -- in other words, turn Flink into a replication endpoint for the database's change data capture (CDC) stream, if the database supports that.
I have no experience with hbase, but https://github.com/mravi/hbase-connect-kafka is an example of something that might work (by putting kafka in-between hbase and flink).
If you would rather query hbase from Flink, and want to avoid making point queries for one user at a time, then you could build something like this:
-> queryManyUsers -> keyBy(uId) ->
streamToEnrich CoProcessFunction
-> keyBy(uID) ------------------->
Here you would split your stream, sending one copy through something like a window or process function or async i/o to query hbase in batches, and send the results into a CoProcessFunction that holds the cache and does the enrichment.
When records arrive in this CoProcessFunction directly, along the bottom path, if the necessary data is in the cache, then it is used. Otherwise the record is buffered, pending the arrival of data for the cache from the upper path.
I am evaluating Flink for a streaming analytics scenario and haven't found sufficient information on how to fulfil a kind of ETL setup we are doing in a legacy system today.
A very common scenario is that we have keyed, slow throughput, meta-data streams that we want to use for enrichment on high throughput data streams, something in the line of:
This raises two questions concerning Flink: How does one enrich a fast moving stream with slowly updating streams where the time windows overlap, but are not equal (Meta-data can live for days while data lives for minutes)? And how does one efficiently join multiple (up to 10) streams with Flink, say one data stream and nine different enrichment streams?
I am aware that I can fulfil my ETL scenario with non-windowed external ETL caches, for example with Redis (which is what we use today), but I wanted to see what possibilities Flink offers.
Flink has several mechanisms that can be used for enrichment.
I'm going to assume that all of the streams share a common key that can be used to join the corresponding items.
The simplest approach is probably to use a RichFlatmap and load static enrichment data in its open() method (docs about rich functions). This is only suitable if the enrichment data is static, or if you are willing to restart the enrichment job whenever you want to update the enrichment data.
For the other approaches described below, you should store the enrichment data as managed, keyed state (see the docs about working with state in Flink). This will enable Flink to restore and resume your enrichment job in the case of failures.
Assuming you want to actually stream in the enrichment data, then a RichCoFlatmap is more appropriate. This is a stateful operator that can be used to merge or join two connected streams. However, with a RichCoFlatmap you have no ability to take the timing of the stream elements into account. If are concerned about one stream getting ahead of, or behind the other, for example, and want the enrichment to be performed in a repeatable, deterministic fashion, then using a CoProcessFunction is the right approach.
You will find a detailed example, plus code, in the Apache Flink training materials.
If you have many streams (e.g., 10) to join, you can cascade a series of these two-input CoProcessFunction operators, but that does become, admittedly, rather awkward at some point. An alternative would be to use a union operator to combine all of the meta-data streams together (note that this requires that all the streams have the same type), followed by a RichCoFlatmap or CoProcessFunction that joins this unified enrichment stream with the primary stream.
Update:
Flink's Table and SQL APIs can also be used for stream enrichment, and Flink 1.4 expands this support by adding streaming time-windowed inner joins. See Table API joins and SQL joins. For example:
SELECT *
FROM Orders o, Shipments s
WHERE o.id = s.orderId AND
o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime
This example joins orders with their corresponding shipments if the shipment occurred within 4 orders of the order being placed.
I often use Camel's idempotent pattern to prevent duplicate processing of discrete messages. What's the best practice to do this when the data stream in question is a large volume of messages each with a timestamp?
Consider this route configuration (pseudocode):
timer -> idempotent( search_splunk_as_batch -> split -> sql(insert))
We want to periodically query from splunk and write to sql. We don't want to miss any messages and we don't want any duplicate messages.
Instead of persisting an idempotent marker for each message, I'd like to note the cutoff time for each batch and begin the next query at the cutoff time.
Your method will probably work as long as you can rely on some assumptions:
Your indexers never load data that appears in the past (according to the _time field)
Your camel route is never running in more than one process at a time that is sending to the same database table.
If you can make sure these are met, then you can just store the maximum timestamp that you receive from the search and use that with the "earliest" parameter of the splunk search command. Storing and retrieving the max timestamp could be done with something like a file, a separate database table, or using a column in your target table.