Flink filter before partition - apache-flink

Apache Flink uses DAG style lazy processing model similar to Apache Spark (correct me if I'm wrong). That being said, if I use following code
DataStream<Element> data = ...;
DataStream<Element> res = data.filter(...).keyBy(...).timeWindow(...).apply(...);
.keyBy() converts DataStream to KeyedStream and distributes it among Flink worker nodes.
My question is, how will flink handle filter here? Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?

Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?
Yes, that's right. The only thing I might say differently would be to clarify that the original stream data will typically already be distributed (parallel) from the source. The filtering will be applied in parallel, across multiple tasks, after which the keyBy will then reparition/redistribute the stream among the workers.
You can use Flink's web UI to examine a visualization of the execution graph produced from your job.

From my understanding filter is applied before the keyBy. As you said it is a DAG (D == Directed). Do you see any indicators which tells you that this is not the case?

Related

can I use KSQL to generate processing-time timeouts?

I am trying to use KSQL to do whatever processing I can within a time limit and get the results at that time limit. See Timely (and Stateful) Processing with Apache Beam under "Processing Time Timers" for the same idea illustrated using Apache Beam.
Given:
A stream of transactions with unique keys;
Updates to these transactions in the same stream; and
A downstream processor that wants to receive the updated transactions at a specific timeout - say 20 seconds - after the transactions appeared in the first stream.
Conceptually, I was thinking of creating a KTable of the first stream to hold the latest state of the transactions, and using KSQL to create an output stream by querying the KTable for keys with (create_time + timeout) < current_time. (and adding the timeouts as "updates" to the first stream so I could filter those out from the KTable)
I haven't found a way to do this in the KSQL docs, and even if there were a built-in current_time, I'm not sure it would be evaluated until another record came down the stream.
How can I do this in KSQL? Do I need a custom UDF? If it can't be done in KSQL, can I do it in KStreams?
=====
Update: It looks like KStreams does not support this today - Apache Flink appears to be the way to go for this use case (and many others). If you know of a clever way around KStreams' limitations, tell me!
Take a look at the punctuate() functionality in the Processor API of Kafka Streams, which might be what you are looking for. You can use punctuate() with stream-time (default: event-time) as well as with processing-time (via PunctuationType.WALL_CLOCK_TIME). Here, you would implement a Processor or a Transformer, depending on your needs, which will use punctuate() for the timeout functionality.
See https://kafka.apache.org/documentation/streams/developer-guide/processor-api.html for more information.
Tip: You can use such a Processor/Transformer also in the DSL of Kafka Streams. This means you can keep using the more convenient DSL, if you like to, and only need to plug in the Processor/Transformer at the right place in your DSL-based code.

Iteration over multiple streams in Apache Flink

My Question in regarding iteration over multiple streams in Apache Flink.
I am a Flink beginner, and I am currently trying to execute a recursive query (e.g., datalog) on Flink.
For example, a query calculates the transitive closure for every 5mins (tumbling window). If I have one input stream inputStream (consists of initial edge informations), another outputStream (the transitive closure) which is initialised by the inputStream. And I want to iteratively enrich the outputStream by joining the inputStream. For each iteration, the feedback should be the outputStream, and the iteration will last until no more edge can be appended on outputStream. The computation of my transitive closure should trigger periodically for every 5 mins. During the iteration, the inputStream should be "hold" and provide the data for my outputStream.
Is it possible to do this in Flink? Thanks for any help!
This sounds like a side-input issue, where you want to treat the "inputStream" as a batch dataset (with refresh) that's joined to the other "outputStream". Unfortunately Flink doesn't provide an easy way to implement that currently (see https://stackoverflow.com/a/48701829/231762)
If both of these streams are coming from data sources, then one approach is to create a wrapper source that controls the ordering of the records. It would have to emit something like a Tuple2 where one side or the other is null, and then in a downstream (custom) Function you'd essentially split these, and do the joining.
If that's possible, then this source can block the "output" tuples while it emits the "input" tuples, plus other logic it sounds like you need (5 minute refresh, etc). See my response to the other SO issue above for skeleton code that does this.

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Obtain KeyedStream from custom partitioning in Flink

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream.
On the other hand, you cannot override the partitioning strategy for a KeyedStream.
I do want to use KeyedStream, because the API for DataStream does not have reduce and sum operators and because of automatically partitioned internal state.
I mean, if the word count is:
words.map(s -> Tuple2.of(s, 1)).keyBy(0).sum(1)
I wish I could write:
words.map(s -> Tuple2.of(s, 1)).partitionCustom(myPartitioner, 0).sum(1)
Is there any way to accomplish this?
Thank you!
From Flink's documentation (as of Version 1.2.1), what partitioners do is to partition data physically with respect to their keys, only specifying their locations stored in the partition physically in the machine, which actually have not logically grouped the data to keyed stream. To do the summarization, we still need to group them by keys using "keyBy" operator, then we are allowed to do the "sum" operations.
Details please refer to "https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning" :)

Stream loadbalancing

I have two streams. One is an event stream, the other is a database update stream. I want to enrich the event stream with information built from the DB update stream.
The event stream is very voluminous and is partitioned using 5 fields. This gives me good distribution. The DB stream is a lot less chattier, and is partitioned using two fields. I am currently connecting the two streams using the two common fields and using a flapMap to enrich the first stream. The flatMap operator uses ValueState to maintain state, which is automatically keyed by the two common fields.
I find that the load in the event stream tends to be skewed in terms of the two common fields. This causes uneven loadbalancing across the flapMap instances and a few instances are around 10 times more loaded than the others.
I am thinking a better approach would be to broadcast the DB update stream across all flatMap instances and simply forward the event stream based on its existing partitioning scheme. However the issue is that because there are no keys specified for the connect operator, I cannot use ValueState.
Other than implementing custom logic to manually extract the key and update maintain state, is there any anything else I can do?
Is there a simpler approach I am missing?
You can implement the Checkpointed interface with the CoFlatMapFunction to checkpoint the broadcasted DB updates instead of using the key-value state interface.

Resources