Data streams re-use across Flink transformations - apache-flink

I have a DataStream (let's say inStream) on which I need to apply two different lists of transformations to generate two different output streams (let's say outStream1 and outStream2).
inStream is also constructed after applying a complex list of transformations on the source stream.
Now, my question is that since inStream needs to be reused across two branches of transformations, is there a way to cache this inStream and reuse it across the 2 branches ?
The way this problem is solved in Spark is using rdd.cache() method wherein inStream is cached in memory and re-used across transformations. I want to know whether some similar such construct exists in Flink to solve this.
Another question is that how does program execution get triggered in Flink ?
In Spark, program is lazily evaluated and execution gets triggered when a Spark action is encountered.
As per my understanding (please correct, if wrong) of Flink, Flink JobManager creates an execution graph which decides program flow. But, does it do the above specified DataStream reusing on its own ?
Thanks

Related

Ordering Guarantees in FlinkKinesisProducer

I'm implementing a real-time streaming ETL pipeline using Apache Flink. The pipeline has these characteristics:
Ingest a single Kinesis stream: stream-A
The stream has records of type EventA which have a category_id, representing distinct logical streams
Because of how they are written to Kinesis (separate producer per category_id, writing serially), these logical streams are guaranteed to be read in order by FlinkKinesisConsumer
Flink does some in-order processing work, keyed by the category_id, generating a stream of EventB data records
These records are written to Kinesis stream-B
A separate service ingests the data from stream-B and it is important that this happens in order.
The processing looks something like this:
val in_events = env.addSource(new FlinkKinesisConsumer[EventA]( # these are guaranteed ordered
"stream-A",
new EventASchema,
consumerConfig))
val out_events = in_events
.keyBy(event => event.category_id)
.process(new EventAStreamProcessor)
out_events.addSink(new FlinkKinesisProducer[EventB](
"stream-B",
new EventBSchema,
producerConfig))
# a separate service reads the out_events and wants them in-order
Based on the guidelines here, it seems like it is impossible to guarantee the ordering of EventB records written to the sink. I only care that events with the same category_id are written in order, since the downstream service will keyBy this. Thinking from first principles, if I were to implement the threading manually, I would have a separate queue per category_id KeyedStream and ensure those are written serially to Kinesis (this seems like a strict generalization over what is done by default, which is to use a ThreadPool, which has a single global queue). Does the FlinkKinesisProducer support this mechanism or is there a way around this limitation using Flink's keyBy or similar construct? Separate sink per category_id maybe? For this last option, I'm anticipating 100k category_ids so this might have too much of a memory overhead.
One option is to buffer events read from stream-B in the downstream service to order them (with high probability if buffer window is large). This in theory should work, but it makes the downstream service more complex then it needs to be, precludes determinism since it depends on random timing of network calls, and, more importantly, adds latency to the pipeline (though maybe less latency overall then forcing serial writes to stream-B?). So ideally, I'm hoping to go with another option. And, this feels like a common problem, so perhaps there are more clever solutions out there or I'm missing something obvious
Many thanks in advance.

Flink when to split stream to jobs, using uid, rebalance

I am pretty new to flink and about to load our first production version. We have a stream of data. The stateful filter is checking if the data is new.
would it be better to split the stream to different jobs to gain more control on the parallelism as shown in option 1 or option 2 is better ?
following the documentation recommendation. should I put uid per operator e.g :
dataStream
.uid("firstid")
.keyBy(0)
.flatMap(flatMapFunction)
.uid("mappedId)
should I add rebalance after each uid if at all?
what is the difference if I setMaxParallelism as described here or setting parallelism from flink UI/cli ?
You only need to define .uid("someName") for your stateful operators. Not much need for operators which do not hold state as there is nothing in the savepoints that needs to be mapped back to them (more on this here). Won't hurt if you do though.
rebalance will only help you in the presence of data skew and that only if you aren't using keyed streams. If you process data based on a key, and your load isn't uniformly distributed across your keys (ie you have loads of "hot" keys) then rebalancing won't help you much.
In your example above I would start Option 2 and potentially move to Option 1 if the job proves to be too heavy. In general stateless processes are very fast in Flink so unless you want to add other consumers to the output of your stateful filter then don't bother to split it up at this stage.
There isn't right and wrong though, depends on your problem. Start simple and take it from there.
[Update] Re 4, setMaxParallelism if I am not mistaken defines the number of key groups and thus the maximum number of parallel instances your stream can be rescaled to. This is used by Flink internally but it doesn't set the parallelism of your job. You usually have to set that to some multiple of the actually parallelism you set for you job (via -p <n> in the CLI/UI when you deploy it).

Flink Map function with multi-parallelism, and how to make sure the order of the final sink

the pipeline simple code is fellows:
source = env.addSource(kafkaConsumer)
.map(func).setParallelism(2).sink()
how to make sure the order of out?
To begin, let's assume that everything else in your example has a parallelism of one, and only the map function is going to run in parallel. (Though to actually achieve that, it would have to be configured somewhere; the default parallelism is higher than one.)
Let's also assume that your Kafka consumer is reading from a single topic with one partition, and you are asking how to implement a parallel transformation that preserves the ordering that was present in the input.
With those assumptions, the answer is that there's not a lot you can do. There's a race between the two instances of the map operator, and the non-parallel sink is going to interleave those two incoming streams in an arbitrary way.
If the stream records are marked in some way, say with ascending timestamps or ids, then you could hypothetically introduce some buffering and re-establish the original ordering, either in a custom sink or in a non-parallel RichCoMap function between your map and sink operators.
If on the other hand, your source is partitioned or keyed in some way, and you only need to maintain or establish an ordering on a per-key basis, then there are better answers.

Flink - asynchronous windows

This is a two question topic about flink streaming based on experiments I did myself and I need some clarification. The questions are:
When we use windows on a KeyedStream in flink, are the computations of the apply function asynchronous? Specifically, will flink create separate windows per key and process these windows independently from one another?
Assume that we use the apply function (do some computations) on a windowed stream which will then create a DataStream. If we do some transformations on the resulting DataStream, will flink hold the entire WindowedStream in memory? And will flink wait until all the apply functions of the WindowedStream are finished and then move on to the transformations on the resulting stream?
In all the experiments I did I used event time and I read the data from a file. I have observed the above statements in my experiments and I need some clarification.
Ad. 1 Yes, each key is processed independently. It is also the way windows computations are parallelised.
Ad.2 Flink will keep windows state until the window can be emitted (plus some extra time in case of allowedLateness). Once results for a window are emitted(in your case are forwarded to next operator), the state can be cleared.

Integration of non-parallelizable task with high memory demands in Flink pipeline

I am using Flink in a Yarn Cluster to process data using various sources and sinks. At some point in the topology, there is an operation that cannot be parallelized and furthermore needs access to a lot of memory. In fact, the API I am using for this step needs its input in array-form. Right now, I have implemented it something like
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Pojo> input = ...
List<Pojo> inputList = input.collect();
Pojo[] inputArray = inputList.toArray();
Pojo[] resultArray = costlyOperation(inputArray);
List<Pojo> resultList = Arrays.asList(resultArray);
DataSet<Pojo> result = env.fromCollection(resultList);
result.otherStuff()
This solution seems rather unnatural. Is there a straight-forward way to incorporate this task into my Flink pipeline?
I have read in another thread that the collect() function should not be used for large datasets. I believe the fact that collecting the dataset into a list and then an array does not happen parallely is not my biggest problem right now, but would you still prefer to write what I called input above into a file and build an array from that?
I have also seen the options to configure managed memory in flink. In principle, it might be possible to tune this in a way so that enough heap is left for the expensive operation. On the other hand, I am afraid that the performance of all the other operators in the topology might suffer. What is your opinion on this?
You could replace the "collect->array->costlyOperation->array->fromCollection" step by a key-less reduce operation with a surrogate key that has a unique value for all tuples such that you get only a single partition. This would be Flink like.
In your costly operation itself, that is implemented as a GroupReduceFunction, you will get an iterator over the data. If you do not need to access all data "at once", you also safe heap space as you do not need to keep all data in-memory within reduce (but this depends of course what your costly operation computes).
As an alternative, you could also call reduce() without a previous groupBy(). However, you do not get an iterator or an output collector and can only compute partial aggregates. (see "Reduce" in https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations)
Using Flink style operations has the advantage, that the data is kept in the cluster. If you do collect() the result is transfered to the client, the costly operation is executed in the client, and the result is transfered back to the cluster. Furthermore, if the input is large, Flink will automatically spill the intermediate result to disc for you.

Resources