Get Operator Name In Flink Latency Metric - apache-flink

I am trying to estimate end to end tuple latency of my events using the latency metrics exported by Flink (I am using a Prometheus metrics reporter). All is good and I can see the latency metric in my Grafana/Prom dashboard. Looks something like
flink_taskmanager_job_latency_source_id_source_subtask_index_operator_id_operator_subtask_index_latency{
host="",instance="",job="",
job_id="",job_name="",operator_id="",operator_subtask_index="0",
quantile="0.99",source_id="",source_subtask_index="0",tm_id=""}
This test job I have is a simple source->map->sink operation, with parallelism set to 1. I can see from the Flink dashboard that all them gets chained together into one task. For one run of my job, I see two sets of latency metrics. Each set shows all quantiles like (.5, .95..). Only thing different between the two sets is the operator_id. I assumed this means one operator_id belongs to the map operator and the other belongs to the sink.
Now my problem is that is no intuitive way to distinguish between the two (find out which operator_id is the map vs sink), just by looking at the metrics. So my questions are essentially:
Is my assumption correct?
What is the best way to distinguish the two operators? I tried assigning names to my map and sink. Even though these names show up in other metrics like numRecordsIn, the names does not show up in the latency metric.
Is there a way to get the mapping between operator_id and operator_name?

The operator_id is currently a hash value either computed from the hash values of the inputs and the node itself or if you have set a UID via uid for an operator, it is computed as the murmur3_128 hash of this id.
Please open a JIRA issue to add this feature to Flink.

Related

An Alternative Approach for Broadcast stream

I have two different streams in my flink job;
First one is representing set of rules which will be applied to the actual stream. I've just broadcasted these set of rules. Changes are come from kafka, and there can be a few changes each hour (like 100-200 per hour).
Second one is actual stream called as customer stream which contains some numeric values for each customer. This is basically keyed stream based on customerId.
So, basically I'm preparing my actual customer stream data, then applying some rules on keyed stream, and getting the calculated results.
And, I also know which rules should be calculated by checking a field of customer stream data. For example; a field of customer data contains value X, that means job have to apply only rule1, rule2, rule5 instead of calculating all the rules (let's say there are 90 rules) for the given customer. Of course, in this case, I have to get and filter all rules by field value of incoming data.
Everything is ok in this scenario, and perfectly fits broadcast pattern usage. But the problem here is that huge broadcast size. Sometimes it can be very huge, like 20 GB or more. It supposes it's very huge for broadcast state.
Is there any alternative approach to solve this limitation? Like, using rocks db backend (I know it's not supported, but I can implement custom state backend for broadcast state if there is no limitation about this).
Is there any changes if I connect both streams without broadcasting rules stream?
From your description it sounds like you might be able to avoid broadcasting the rules (by turning this around and broadcasting the primary stream to the rules). Maybe this could work:
make sure each incoming customer event has a unique ID
key-partition the rules so that each rule has a distinct key
broadcast the primary stream events to the rules (and don't store the customer events)
union the outputs from applying all the rules
keyBy the unique ID from step (1) to bring together the results from applying each of the rules to a given customer event, and assemble a unified result
https://gist.github.com/alpinegizmo/5d5f24397a6db7d8fabc1b12a15eeca6 shows how to do fan-out/fan-in with Flink -- see that for an example of steps 1, 4, and 5 above.
If there's no way to partition the rules dataset, then I don't think you get a win by trying to connect streams.
I would check out Apache Ignite as a way of sharing the rules across all of the subtasks processing the customer stream. See this article for a description of how this could be one.

Local aggregation for data stream in Flink

I'm trying to find a good way to combine Flink keyed WindowedStream locally for Flink application. The idea is to similar to a combiner in MapReduce: to combine partial results in each partition (or mapper) before the data (which is still a keyed WindowedStream) is sent to a global aggregator (or reducer). The closest function I found is: aggregate but I was't be able to find a good example for the usage on WindowedStream.
It looks like aggregate doesn't allow a WindowedStream output. Is there any other way to solve this?
There have been some initiatives to provide pre-aggregation in Flink. You have to implement your own operator. In the case of stream environment you have to extend the class AbstractStreamOperator.
KurtYoung implemented a BundleOperator. You can also use the Table API on top of the stream API. The Table API is already providing a local aggregation. I also have one example of the pre-aggregate operator that I implemented myself. Usually, the drawback of all those solutions is that you have to set the number of items to pre-aggregate or the timeout to pre-aggregate. If you don't have it you can run out of memory, or you never shuffle items (if the threshold number of items is not achieved). In other words, they are rule-based. What I would like to have is something that is cost-based, more dynamic. I would like to have something that adjusts those parameters in run-time.
I hope these links can help you. And, if you have ideas for the cost-based solution, please come to talk with me =).

Implement bunch of transformations applied to same source stream in Apache Flink in parallel and combine result

Could you please help me - I'm trying to use Apache Flink for machine learning tasks with external ensemble/tree libs like XGBoost, so my workflow will be like this:
receive single stream of data which atomic event looks like a simple vector event=(X1, X2, X3...Xn) and it can be imagined as POJO fields so initially we have DataStream<event> source=...
a lot of feature extractions code applied to the same event source:
feature1 = source.map(X1...Xn) feature2 = source.map(X1...Xn) etc. For simplicity lets DataStream<int> feature(i) = source.map() for all features
then I need to create a vector with extracted features (feature1, feature2, ...featureK) for now it will be 40-50 features, but I'm sure it will contain more items in future and easily can contains 100-500 features and more
put these extracted features to dataset/table columns by 10 minutes window and run final machine learning task on such 10 minutes data
In simple words I need to apply several quite different map operations to the same single event in stream and then combine result from all map functions in single vector.
So for now I can't figure out how to implement final reduce step and run all feature extraction map jobs in parallel if possible. I spend several days on flink docs site, youtube videos, googling, reading Flink's sources but it seems I'm really stuck here.
The easy solution here will be to use single map operation and run each feature extraction code sequentially one by one in huge map body, and then return final vector (Feature1...FeatureK) for each input event. But it should be crazy and non optimal.
Another solution for each two pair of features use join since all feature DataStreams has same initial event and same key and only apply some transformation code, but it looks ugly: write 50 joins code with some window. And I think that joins and cogroups developed for joining different streams from different sources and not for such map/reduce operations.
As for me for all map operations here should be a something simple which I'm missing.
Could you please point me how you guys implement such tasks in Flink, and if possible with example of code?
Thanks!
What is the number of events per second that you wish to process? If it’s high enough (~number of machines * number of cores) you should be just fine processing more events simultaneously. Instead of scaling with number of features, scale with number of events. If you have a single data source you still could randomly shuffle events before applying your transformations.
Another solution might be to:
Assign unique eventId and split the original event using flatMap into tuples: <featureId, Xi, eventId>.
keyBy(featureId, eventId) (or maybe do random partitioning with shuffle()?).
Perform your transformations.
keyBy(eventId, ...).
Window and reduce back to one record per event.

Does anyone have a good example of a ProcessFunction that sums or aggregates data at some frequency

I am looking mimic the behaviour of a window().reduce() operation but without a key at the task manager level. Sort of like a .windowAll().reduce() does for a stream, but I am looking to get individual results from each task manager.
I tried searching for "flink processFunction examples" but not finding anything useful to look at.
For ProcessFunction examples, I suggest the examples in the Flink docs and in the Flink training materials.
Another approach would be to use windows with a random key selector. That's not as easy as it sounds: you can't just select by a random number, as the value of the key must be deterministic for each stream element. So you could add a field that you set to a random value, and then keyBy that field. Compared to the ProcessFunction approach this will force a shuffle, but be simpler.

CouchBase view get for multiple ranges

I'm evaluating CouchBase for an application, and trying to figure out something about range queries on views. I know I can do a view get for a single key, multiple keys, or a range. Can I do a get for multiple ranges? i.e. I want to retrieve items with view key 0-10, 50-100, 5238-81902. I might simultaneously need 100 different ranges, so having to make 100 requests to the database seems like a lot of overhead.
As far as I know in couchbase there is no way to implement getting values from multiple ranges with one view. May be there are (or will be implemented in future) some features in Couchbase N1QL, but I didn't work with it.
Answering your question 100 requests will not be a big overhead. Couchbase is quiet fast and it's designed to handle a lot of operations per second. Also, if your view is correctly designed, it will not be "recalculated" on each query.
Also there is another way:
1. Determine minimum and maximum value of your range (it will be 0..81902 according to your example)
2. Query view that will return only document ids and a value that range was based on, without including all docs in result.
3. On client side filter array of results from previous step according to your ranges (0-10, 50-100, 5238-81902)
and then use getMulti with document ids that left in array.
I don't know your data structure, so you can try both ways, test them and choose the best one that will fit your demands.

Resources