can I use KSQL to generate processing-time timeouts? - apache-flink

I am trying to use KSQL to do whatever processing I can within a time limit and get the results at that time limit. See Timely (and Stateful) Processing with Apache Beam under "Processing Time Timers" for the same idea illustrated using Apache Beam.
Given:
A stream of transactions with unique keys;
Updates to these transactions in the same stream; and
A downstream processor that wants to receive the updated transactions at a specific timeout - say 20 seconds - after the transactions appeared in the first stream.
Conceptually, I was thinking of creating a KTable of the first stream to hold the latest state of the transactions, and using KSQL to create an output stream by querying the KTable for keys with (create_time + timeout) < current_time. (and adding the timeouts as "updates" to the first stream so I could filter those out from the KTable)
I haven't found a way to do this in the KSQL docs, and even if there were a built-in current_time, I'm not sure it would be evaluated until another record came down the stream.
How can I do this in KSQL? Do I need a custom UDF? If it can't be done in KSQL, can I do it in KStreams?
=====
Update: It looks like KStreams does not support this today - Apache Flink appears to be the way to go for this use case (and many others). If you know of a clever way around KStreams' limitations, tell me!

Take a look at the punctuate() functionality in the Processor API of Kafka Streams, which might be what you are looking for. You can use punctuate() with stream-time (default: event-time) as well as with processing-time (via PunctuationType.WALL_CLOCK_TIME). Here, you would implement a Processor or a Transformer, depending on your needs, which will use punctuate() for the timeout functionality.
See https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html for more information.
Tip: You can use such a Processor/Transformer also in the DSL of Kafka Streams. This means you can keep using the more convenient DSL, if you like to, and only need to plug in the Processor/Transformer at the right place in your DSL-based code.

Related

How to get a collection of all latest attributes values from DynamoDB?

I have a one table where I store all of the sensors data.
Id is a Partition key, TimeEpoch is a sort key.
Example table looks like this:
Id
TimeEpoch
AirQuality
Temperature
WaterTemperature
LightLevel
b8a76d85-f1b1-4bec-abcf-c2bed2859285
1608208992
95
3a6930c2-752a-4103-b6c7-d15e9e66a522
1608208993
23.4
cb44087d-77da-47ec-8264-faccc2a50b17
1608287992
5.6
latest
1608287992
95
5.6
23.4
1000
I need to get all the latest attributes values from the table.
For now I used additional Item with Id = latest where I'm storing all the latest values, but I know that this is a hacky way that requires sensor to put data in with new GUID as the Id and to the Id = latest at the same time.
The attributes are all known and it's possible that one sensor under one Id can store AirQuality and Temperature at the same time.
NoSQL databases like DynamoDB are a tricky thing, because they don't offer the same query "patterns" like traditional relational databases.
Therefore, you often need non-traditional solutions to valid challenges like the one you present.
My proposal for one such solution would be to use a DynamoDB feature called DynamoDB Streams.
In short, DynamoDB Streams will be triggered every time an item in your table is created, modified or removed. Streams will then send the new (and old) version of that item to a "receiver" you specify. Typically, that would be a Lambda function.
The solution I would propose is to use streams to send new items to a Lambda. This Lambda could then read the attributes of the item that are not empty and write them to whatever datastore you like. Could be another DynamoDB table, could be S3 or whatever else you like. Obviously, the Lambda would need to make sure to overwrite previous values etc, but the detailed business logic is then up to you.
The upside of this approach is, that you could have some form of up-to-date version of all of those values that you can always read without any complicated logic to find the latest value of each attribute. So reading would be simplified.
The downside is, that writing becomes a bit more complex. Not at least because you introduce more parts to your solution (DynamoDB Streams, Lambda, etc.). This also will increase your cost a bit, depending on how often your data changes. Since you seem to store sensor data that might be quite often. So keep in mind to check the cost. This solution will also introduce more delay. So if delay is an issue, it might not be for you.
At last I want to mention that it is recommend to only have at most two "receivers" of a tables stream. That means that for production I would recommend to only have a single receiver Lambda and then let that Lambda create an AWS EventBridge event (e.g. "item created", "item modified", "item removed"). This will allow you to have a lot more Lambdas etc. "listening" to such events and process them, mitigating the streams limitation. This is an event-driven solution then. As before, this will add delay.

Iteration over multiple streams in Apache Flink

My Question in regarding iteration over multiple streams in Apache Flink.
I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.
For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.
Is it possible to do this in Flink? Thanks for any help!
This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)
If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.
If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Implement bunch of transformations applied to same source stream in Apache Flink in parallel and combine result

Could you please help me - I'm trying to use Apache Flink for machine learning tasks with external ensemble/tree libs like XGBoost, so my workflow will be like this:
receive single stream of data which atomic event looks like a simple vector event=(X1, X2, X3...Xn) and it can be imagined as POJO fields so initially we have DataStream<event> source=...
a lot of feature extractions code applied to the same event source:
feature1 = source.map(X1...Xn) feature2 = source.map(X1...Xn) etc. For simplicity lets DataStream<int> feature(i) = source.map() for all features
then I need to create a vector with extracted features (feature1, feature2, ...featureK) for now it will be 40-50 features, but I'm sure it will contain more items in future and easily can contains 100-500 features and more
put these extracted features to dataset/table columns by 10 minutes window and run final machine learning task on such 10 minutes data
In simple words I need to apply several quite different map operations to the same single event in stream and then combine result from all map functions in single vector.
So for now I can't figure out how to implement final reduce step and run all feature extraction map jobs in parallel if possible. I spend several days on flink docs site, youtube videos, googling, reading Flink's sources but it seems I'm really stuck here.
The easy solution here will be to use single map operation and run each feature extraction code sequentially one by one in huge map body, and then return final vector (Feature1...FeatureK) for each input event. But it should be crazy and non optimal.
Another solution for each two pair of features use join since all feature DataStreams has same initial event and same key and only apply some transformation code, but it looks ugly: write 50 joins code with some window. And I think that joins and cogroups developed for joining different streams from different sources and not for such map/reduce operations.
As for me for all map operations here should be a something simple which I'm missing.
Could you please point me how you guys implement such tasks in Flink, and if possible with example of code?
Thanks!
What is the number of events per second that you wish to process? If it’s high enough (~number of machines * number of cores) you should be just fine processing more events simultaneously. Instead of scaling with number of features, scale with number of events. If you have a single data source you still could randomly shuffle events before applying your transformations.
Another solution might be to:
Assign unique eventId and split the original event using flatMap into tuples: <featureId, Xi, eventId>.
keyBy(featureId, eventId) (or maybe do random partitioning with shuffle()?).
Perform your transformations.
keyBy(eventId, ...).
Window and reduce back to one record per event.

Can I rely on CamelSplitComplete when streaming?

We have a process where we are processing a large file. We are using a splitter and using streaming().
The docs say
streaming If enabled then Camel will split in a streaming fashion, which means it will split the input message in chunks. This reduces the memory overhead. For example if you split big messages its recommended to enable streaming. If streaming is enabled then the sub-message replies will be aggregated out-of-order, eg in the order they come back. If disabled, Camel will process sub-message replies in the same order as they where splitted.
So I know that exchanges can be aggregated out of order. So does the splitter mark the last exchange it handles with the CamelSplitComplete set to true? If so, then it could get aggregated out of order and I'll end up considering my aggregation complete before I've aggregated all messages. This would lead to missing data.
If instead it marks the exchange CamelSplitComplete only when it knows it's the last one to be aggregated, then I believe I can rely on it.
UPDATE:
Assuming that it is safe to rely on CamelSplitComplete in the case above, is it safe to rely on it if my routes do filtering? I assume not, because the last row might match the filter criteria and be removed.
I have done split of large files with streaming and I have used the CamelSplitComplete property to do some processing after split is done. So yes, you can rely on it to be the last exchange. Off course, it is best to have a Camel unit test to verify test. But it worked for me. I can't say about filter, since what if you filtered out the last exchange?

Resources