Obtain KeyedStream from custom partitioning in Flink - apache-flink

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream.
On the other hand, you cannot override the partitioning strategy for a KeyedStream.
I do want to use KeyedStream, because the API for DataStream does not have reduce and sum operators and because of automatically partitioned internal state.
I mean, if the word count is:
words.map(s -> Tuple2.of(s, 1)).keyBy(0).sum(1)
I wish I could write:
words.map(s -> Tuple2.of(s, 1)).partitionCustom(myPartitioner, 0).sum(1)
Is there any way to accomplish this?
Thank you!

From Flink's documentation (as of Version 1.2.1), what partitioners do is to partition data physically with respect to their keys, only specifying their locations stored in the partition physically in the machine, which actually have not logically grouped the data to keyed stream. To do the summarization, we still need to group them by keys using "keyBy" operator, then we are allowed to do the "sum" operations.
Details please refer to "https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning" :)

Related

Flink filter before partition

Apache Flink uses DAG style lazy processing model similar to Apache Spark (correct me if I'm wrong). That being said, if I use following code
DataStream<Element> data = ...;
DataStream<Element> res = data.filter(...).keyBy(...).timeWindow(...).apply(...);
.keyBy() converts DataStream to KeyedStream and distributes it among Flink worker nodes.
My question is, how will flink handle filter here? Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?
Will filter be applied to incoming DataStream before partitioning/distributing the stream and DataStream will only be created of Element's that pass the filter criteria?
Yes, that's right. The only thing I might say differently would be to clarify that the original stream data will typically already be distributed (parallel) from the source. The filtering will be applied in parallel, across multiple tasks, after which the keyBy will then reparition/redistribute the stream among the workers.
You can use Flink's web UI to examine a visualization of the execution graph produced from your job.
From my understanding filter is applied before the keyBy. As you said it is a DAG (D == Directed). Do you see any indicators which tells you that this is not the case?

Get Operator Name In Flink Latency Metric

I am trying to estimate end to end tuple latency of my events using the latency metrics exported by Flink (I am using a Prometheus metrics reporter). All is good and I can see the latency metric in my Grafana/Prom dashboard. Looks something like
flink_taskmanager_job_latency_source_id_source_subtask_index_operator_id_operator_subtask_index_latency{
host="",instance="",job="",
job_id="",job_name="",operator_id="",operator_subtask_index="0",
quantile="0.99",source_id="",source_subtask_index="0",tm_id=""}
This test job I have is a simple source->map->sink operation, with parallelism set to 1. I can see from the Flink dashboard that all them gets chained together into one task. For one run of my job, I see two sets of latency metrics. Each set shows all quantiles like (.5, .95..). Only thing different between the two sets is the operator_id. I assumed this means one operator_id belongs to the map operator and the other belongs to the sink.
Now my problem is that is no intuitive way to distinguish between the two (find out which operator_id is the map vs sink), just by looking at the metrics. So my questions are essentially:
Is my assumption correct?
What is the best way to distinguish the two operators? I tried assigning names to my map and sink. Even though these names show up in other metrics like numRecordsIn, the names does not show up in the latency metric.
Is there a way to get the mapping between operator_id and operator_name?
The operator_id is currently a hash value either computed from the hash values of the inputs and the node itself or if you have set a UID via uid for an operator, it is computed as the murmur3_128 hash of this id.
Please open a JIRA issue to add this feature to Flink.

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Does anyone have a good example of a ProcessFunction that sums or aggregates data at some frequency

I am looking mimic the behaviour of a window().reduce() operation but without a key at the task manager level. Sort of like a .windowAll().reduce() does for a stream, but I am looking to get individual results from each task manager.
I tried searching for "flink processFunction examples" but not finding anything useful to look at.
For ProcessFunction examples, I suggest the examples in the Flink docs and in the Flink training materials.
Another approach would be to use windows with a random key selector. That's not as easy as it sounds: you can't just select by a random number, as the value of the key must be deterministic for each stream element. So you could add a field that you set to a random value, and then keyBy that field. Compared to the ProcessFunction approach this will force a shuffle, but be simpler.

What magics does Flink use in distinct()? How are surrogate keys generated?

Regarding generating surrogate key, the first step is to get the distinct and then build an incremental key for each tuple.
So I use Java Set to get the distinct elements and it's out of heap space.
Then, I use Flink's distinct() and it totally works.
Could I ask what make this difference?
Another related question is, can Flink generate surrogate key in mapper?
Flink executes a distinct() internally as a GroupBy followed by a ReduceGroup operator, where the reduce operator returns the first element of the group only.
The GroupBy is done by sorting the data. Sorting is done on a binary data representation, if possible in-memory, but might spill to disk if not enough memory is available. This blog post gives some insight about that. GroupBy and Sort are memory-safe in Flink and will not fail with an OutOfMemoryError.
You can also do a distinct on a custom key, by using DataSet.distinct(KeySelector ks). The key selector is basically a MapFunction that generates a custom key.

Resources