Local aggregation for data stream in Flink

Local aggregation for data stream in Flink - apache-flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?

There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Related

What would you prefer using the sort() method or the aggregation $sort in MongoDb and Why?

I have been browsing over the internet for quite few hours now and didn't came to a satisfactory answer for why one is better over another. If this is situation dependent than what are the situations to use one over the other.It would be great if you could provide me a solution on this with example if there can be one. I understand that since the aggregation operators came later so probably they are the better one, but i have still seen people using the find()+sort() method.

You shouldn't think of this as an issue of "which method is better?", but "what kind of query do I need to perform?"
The MongoDB aggregation pipeline exists to handle a different set of problems than a simple .find() query. Specifically, aggregation is meant to allow processing of data on the database end in order to reduce the workload on the application server. For example, you can use aggregation to generate a numerical analysis on all of the documents in a collection.
If all you want to do is retrieve some documents in sorted order, use find() and sort(). If you want to perform a lot of processing on the data before retrieving the results, then use aggregation with a $sort stage.

Get Operator Name In Flink Latency Metric

I am trying to estimate end to end tuple latency of my events using the latency metrics exported by Flink (I am using a Prometheus metrics reporter). All is good and I can see the latency metric in my Grafana/Prom dashboard. Looks something like
flink_taskmanager_job_latency_source_id_source_subtask_index_operator_id_operator_subtask_index_latency{
host="",instance="",job="",
job_id="",job_name="",operator_id="",operator_subtask_index="0",
quantile="0.99",source_id="",source_subtask_index="0",tm_id=""}
This test job I have is a simple source->map->sink operation, with parallelism set to 1. I can see from the Flink dashboard that all them gets chained together into one task. For one run of my job, I see two sets of latency metrics. Each set shows all quantiles like (.5, .95..). Only thing different between the two sets is the operator_id. I assumed this means one operator_id belongs to the map operator and the other belongs to the sink.
Now my problem is that is no intuitive way to distinguish between the two (find out which operator_id is the map vs sink), just by looking at the metrics. So my questions are essentially:
Is my assumption correct?
What is the best way to distinguish the two operators? I tried assigning names to my map and sink. Even though these names show up in other metrics like numRecordsIn, the names does not show up in the latency metric.
Is there a way to get the mapping between operator_id and operator_name?

The operator_id is currently a hash value either computed from the hash values of the inputs and the node itself or if you have set a UID via uid for an operator, it is computed as the murmur3_128 hash of this id.
Please open a JIRA issue to add this feature to Flink.

Does anyone have a good example of a ProcessFunction that sums or aggregates data at some frequency

I am looking mimic the behaviour of a window().reduce() operation but without a key at the task manager level. Sort of like a .windowAll().reduce() does for a stream, but I am looking to get individual results from each task manager.
I tried searching for "flink processFunction examples" but not finding anything useful to look at.

For ProcessFunction examples, I suggest the examples in the Flink docs and in the Flink training materials.
Another approach would be to use windows with a random key selector. That's not as easy as it sounds: you can't just select by a random number, as the value of the key must be deterministic for each stream element. So you could add a field that you set to a random value, and then keyBy that field. Compared to the ProcessFunction approach this will force a shuffle, but be simpler.

How to make this query using Prometheus?

I'm really new to Prometheus and for the moment I want to do some tests with the query to be a bit more familiar with it.
So with the query container_last_seen[10s] it returns me an array :
container_last_seen{container_label_com_docker_compose_config_hash="dc8a2ab1347ad16ab37ff0ad03f3a00f86b381ea2d85d45a11367331526c3640",container_label_com_docker_compose_container_number="1",container_label_com_docker_compose_oneoff="False",container_label_com_docker_compose_project="dockprom",container_label_com_docker_compose_service="cadvisor",container_label_com_docker_compose_version="1.10.0",container_label_org_label_schema_group="monitoring",id="/docker/2b448d19a33b50411941a55435b03f5a4af19e3b3e9581054a67e4da3363ef19",image="google/cadvisor:v0.24.1",instance="cadvisor:8080",job="cadvisor",name="cadvisor"}
And I want to get only the attribute name.
So my idea was to do something like this:
container_last_seen[10s][name]
But I have a parse error. So how can I make this query ?

It may seem a little counterintuitive for this purpose, but the aggregation operators allow reducing labels with the by and without clauses.
sum by(name) (container_last_seen{..criteria..})
should get you closer to what you are wanting by returning objects with only the name key.
I think you want a little further though - you don't want values and you don't want the object part - you just want strings. Unfortunately Prometheus deals with numeric metrics that can have labels, and specifically not string metrics.
While it requires additional software, it is officially recommended by Prometheus so I will mention it here as it gets you very close to what I believe is your desired solution:
If you were to chart that query in Grafana, either with all the keys or just the name key, the legend format {{name}} should get you exactly what you want. Grafana also provides label_values to help with this purpose in regards to filtering.
Lastly if this is not the right direction for you, for intensive string-based metrics ELK/EFK stack may be a better fit. There are projects like prometheus-es-exporter that can report the results from ElasticSearch queries as metrics.

This is not possible as labels like 'name' are separate to the metric value. You should look at the JSON the query and query_range endpoints return to see how this is exposed.

Obtain KeyedStream from custom partitioning in Flink

I know that Flink comes with custom partitioning APIs. However, the problem is that, after invoking partitionCustom on a DataStream you get a DataStream back and not a KeyedStream.
On the other hand, you cannot override the partitioning strategy for a KeyedStream.
I do want to use KeyedStream, because the API for DataStream does not have reduce and sum operators and because of automatically partitioned internal state.
I mean, if the word count is:
words.map(s -> Tuple2.of(s, 1)).keyBy(0).sum(1)
I wish I could write:
words.map(s -> Tuple2.of(s, 1)).partitionCustom(myPartitioner, 0).sum(1)
Is there any way to accomplish this?
Thank you!

From Flink's documentation (as of Version 1.2.1), what partitioners do is to partition data physically with respect to their keys, only specifying their locations stored in the partition physically in the machine, which actually have not logically grouped the data to keyed stream. To do the summarization, we still need to group them by keys using "keyBy" operator, then we are allowed to do the "sum" operations.
Details please refer to "https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/datastream_api.html#physical-partitioning" :)