My Question in regarding iteration over multiple streams in Apache Flink.
I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.
For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.
Is it possible to do this in Flink? Thanks for any help!
This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)
If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.
If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.
Related
I am trying to use KSQL to do whatever processing I can within a time limit and get the results at that time limit. See Timely (and Stateful) Processing with Apache Beam under "Processing Time Timers" for the same idea illustrated using Apache Beam.
Given:
A stream of transactions with unique keys;
Updates to these transactions in the same stream; and
A downstream processor that wants to receive the updated transactions at a specific timeout - say 20 seconds - after the transactions appeared in the first stream.
Conceptually, I was thinking of creating a KTable of the first stream to hold the latest state of the transactions, and using KSQL to create an output stream by querying the KTable for keys with (create_time + timeout) < current_time. (and adding the timeouts as "updates" to the first stream so I could filter those out from the KTable)
I haven't found a way to do this in the KSQL docs, and even if there were a built-in current_time, I'm not sure it would be evaluated until another record came down the stream.
How can I do this in KSQL? Do I need a custom UDF? If it can't be done in KSQL, can I do it in KStreams?
=====
Update: It looks like KStreams does not support this today - Apache Flink appears to be the way to go for this use case (and many others). If you know of a clever way around KStreams' limitations, tell me!
Take a look at the punctuate() functionality in the Processor API of Kafka Streams, which might be what you are looking for. You can use punctuate() with stream-time (default: event-time) as well as with processing-time (via PunctuationType.WALL_CLOCK_TIME). Here, you would implement a Processor or a Transformer, depending on your needs, which will use punctuate() for the timeout functionality.
See https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html for more information.
Tip: You can use such a Processor/Transformer also in the DSL of Kafka Streams. This means you can keep using the more convenient DSL, if you like to, and only need to plug in the Processor/Transformer at the right place in your DSL-based code.
I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).
Could you please help me - I'm trying to use Apache Flink for machine learning tasks with external ensemble/tree libs like XGBoost, so my workflow will be like this:
receive single stream of data which atomic event looks like a simple vector event=(X1, X2, X3...Xn) and it can be imagined as POJO fields so initially we have DataStream<event> source=...
a lot of feature extractions code applied to the same event source:
feature1 = source.map(X1...Xn) feature2 = source.map(X1...Xn) etc. For simplicity lets DataStream<int> feature(i) = source.map() for all features
then I need to create a vector with extracted features (feature1, feature2, ...featureK) for now it will be 40-50 features, but I'm sure it will contain more items in future and easily can contains 100-500 features and more
put these extracted features to dataset/table columns by 10 minutes window and run final machine learning task on such 10 minutes data
In simple words I need to apply several quite different map operations to the same single event in stream and then combine result from all map functions in single vector.
So for now I can't figure out how to implement final reduce step and run all feature extraction map jobs in parallel if possible. I spend several days on flink docs site, youtube videos, googling, reading Flink's sources but it seems I'm really stuck here.
The easy solution here will be to use single map operation and run each feature extraction code sequentially one by one in huge map body, and then return final vector (Feature1...FeatureK) for each input event. But it should be crazy and non optimal.
Another solution for each two pair of features use join since all feature DataStreams has same initial event and same key and only apply some transformation code, but it looks ugly: write 50 joins code with some window. And I think that joins and cogroups developed for joining different streams from different sources and not for such map/reduce operations.
As for me for all map operations here should be a something simple which I'm missing.
Could you please point me how you guys implement such tasks in Flink, and if possible with example of code?
Thanks!
What is the number of events per second that you wish to process? If it’s high enough (~number of machines * number of cores) you should be just fine processing more events simultaneously. Instead of scaling with number of features, scale with number of events. If you have a single data source you still could randomly shuffle events before applying your transformations.
Another solution might be to:
Assign unique eventId and split the original event using flatMap into tuples: <featureId, Xi, eventId>.
keyBy(featureId, eventId) (or maybe do random partitioning with shuffle()?).
Perform your transformations.
keyBy(eventId, ...).
Window and reduce back to one record per event.
We have a process where we are processing a large file. We are using a splitter and using streaming().
The docs say
streaming If enabled then Camel will split in a streaming fashion, which means it will split the input message in chunks. This reduces the memory overhead. For example if you split big messages its recommended to enable streaming. If streaming is enabled then the sub-message replies will be aggregated out-of-order, eg in the order they come back. If disabled, Camel will process sub-message replies in the same order as they where splitted.
So I know that exchanges can be aggregated out of order. So does the splitter mark the last exchange it handles with the CamelSplitComplete set to true? If so, then it could get aggregated out of order and I'll end up considering my aggregation complete before I've aggregated all messages. This would lead to missing data.
If instead it marks the exchange CamelSplitComplete only when it knows it's the last one to be aggregated, then I believe I can rely on it.
UPDATE:
Assuming that it is safe to rely on CamelSplitComplete in the case above, is it safe to rely on it if my routes do filtering? I assume not, because the last row might match the filter criteria and be removed.
I have done split of large files with streaming and I have used the CamelSplitComplete property to do some processing after split is done. So yes, you can rely on it to be the last exchange. Off course, it is best to have a Camel unit test to verify test. But it worked for me. I can't say about filter, since what if you filtered out the last exchange?
I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.