Does Flink keyby on the same field which isn't changed cause a shuffle? - apache-flink

dataStream.map(func1).keyBy("key") //(1)
.process(func2).keyBy("key") //(2)
.timeWindow().aggregate(func3).addSink(sink)
Method process() doesn't change the field(key) value of records. Given that the parallelism of all operators is 2, does keyBy() at (2) also result in network shuffle? Maybe keyBy() at (2) has the effect of forward strategy avoiding network communication cost due to the unchanged key value?
Thx soooo much~

A keyBy is always expensive, because it forces the records to go through ser/de. But in the case where the communication is local -- i.e., within the same task slot -- then Flink will use a shared buffer to communicate the serialized bytes, rather than going through the whole netty tcp stack. So yes, in your case the second keyBy is less expensive than the first one. But I would not say the cost is small.
If you know that the keyBy is completely unnecessary, you can use reinterpretAsKeyedStream to get back to having a KeyedStream again without any of this overhead.

Related

Best approach to reduce checkpoint size for broadcast state

I have fairly large broadcast state (about 62MB when serialized as state). I noticed that each instance of my operator is saving a copy of this state during checkpointing. With a parallelism of 400, that's 24gb of checkpoint state, most of it duplicated.
This matches the description of Important Considerations in the docs. On the other hand, Checkpointing under backpressure says:
Broadcast partitioning is often used to implement a broadcast state which should be equal across all operators. Flink implements the broadcast state by checkpointing only a single copy of the state from subtask 0 of the stateful operator. Upon restore, we send that copy to all of the operators. Therefore it might happen that an operator will get the state with changes applied for a record that it will soon consume from its checkpointed channels.
The bit about "checkpointing only a singe copy of the state from subtask 0" doesn't match what I'm seeing, hoping someone can clarify.
Regardless...is there any typical workaround for this? For example, I could set up my TMs with one slot (even though they have 8 cores), and then use a thread pool to process incoming non-broadcast elements. This would reduce by 8x the parallelism of the operator. Assuming I deal with concurrency issues (threads accessing state while it's being updated), what other issues are there? E.g. can the collector be saved & then safely called asynchronously by a thread? I don't have watermarks, but wondering about things like checkpoint barriers.
Or I could bail on using a broadcast stream, and replicate the data myself (with carefully constructed keys), but that's also a helicopter stunt.
The bit about "checkpointing only a single copy of the state from subtask 0" is incorrect (I verified this with the author of that sentence). In the current implementation of BroadcastState all operators snapshot their state.
I'm afraid that doesn't help answer your real question, but hopefully clarifies the situation.

Partition the whole dataStream in flink at the start of source and maintain the partition till sink

I am consuming trail logs from a Queue (Apache Pulsar). I use 5 keyedPrcoessFunction and finally sink the payload to Postgres Db. I need ordering per customerId for each of the keyedProcessFunction. Right now I achieve this by
Datasource.keyBy(fooKeyFunction).process(processA).keyBy(fooKeyFunction).process(processB).keyBy(fooKeyFunction).process(processC).keyBy(fooKeyFunction).process(processE).keyBy(fooKeyFunction).sink(fooSink).
processFunctionC is very time consuming and takes 30 secs on worst-case to finish. This leads to backpressure. I tried assigning more slots to processFunctionC but my throughput never remains constant. it mostly remains < 4 messages per second.
Current slot per processFunction is
processFunctionA: 3
processFunctionB: 30
processFunctionc: 80
processFunctionD: 10
processFunctionC: 10
In Flink UI it shows backpressure starting from the processB, meaning C is very slow.
Is there a way to use apply partitioning logic at the source itself and assing the same slots per task to each processFunction. For example:
dataSoruce.magicKeyBy(fooKeyFunction).setParallelism(80).process(processA).process(processB).process(processC).process(processE).sink(fooSink).
This will lead to backpressure to happen for only a few of the tasks and not skew the backpressure which is caused by multiple KeyBy.
Another approach that I can think of is to combine all my processFunction and sink into single processFunction and apply all those logic in the sink itself.
I don't think there exists anything quite like this. The thing that is the closest is DataStreamUtils.reinterpretAsKeyedStream, which recreates the KeyedStream without actually sending any data between the operators since it's using the partitioner that only forwards data locally. This is more or less something You wanted, but it still adds partitioning operator and under the hood recreates the KeyedStream, but it should be simpler and faster and perhaps it will solve the issue You are facing.
If this does not solve the issue, then I think the best solution would be to group operators so that the backpressure is minimalized as You suggested i.e. merge all operators into one bigger operator, this should minimize backpressure.

Enrich fast stream keyed by (X,Y) with a slowly change stream keyed by (X) in Flink

I need to enrich my fast changing streamA keyed by (userId, startTripTimestamp) with slowly changing streamB keyed by (userId).
I use Flink 1.8 with DataStream API. I consider 2 approaches:
Broadcast streamB and join stream by userId and most recent timestamp. Would it be equivalent of DynamicTable from the TableAPI? I can see some downsides of this solution: streamB needs to fit into RAM of each worker node, it increase utilization of RAM as whole streamB needs to be stored in RAM of each worker.
Generalise state of streamA to a stream keyed by just (userId), let's name it streamC, to have common key with the streamB. Then I am able to union streamC with streamB, order by processing time, and handle both types of events in state. It's more complex to handle generaised stream (more code in the process function), but not consume that much RAM to have all streamB on all nodes. Are they any more downsides or upsides of this solution?
I have also seen this proposal https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API where it is said:
In general, most of these follow the pattern of joining a main stream
of high throughput with one or several inputs of slowly changing or
static data:
[...]
Join stream with slowly evolving data: This is very similar to
the above case but the side input that we use for enriching is
evolving over time. This can be done by waiting for some initial data
to be available before processing the main input and the continuously
ingesting new data into the internal side input structure as it
arrives.
Unfortunately, it looks like a long time ahead to reach this feature https://issues.apache.org/jira/browse/FLINK-6131 and no alternatives are described. Therefore I would like to ask of the currently recommended approach for the described use case.
I've seen Combining low-latency streams with multiple meta-data streams in Flink (enrichment), but it not specify what are keys of that streams, and moreover it is answered at the time of Flink 1.4, so I expect the recommended solution might have changed.
Building on top of what Gaurav Kumar has already answered.
The main question is do you need to exactly match records from streamA and streamB or is it best effort match? For example, is it an issue for you, that because of a race condition some (a lot of?) records from streamA can be processed before some updates from streamB arrive, for example during the start up?
I would suggest to draw an inspiration from how Table API is solving this issue. Probably Temporal Table Join is the right choice for you, which would leave you with the choice: processing time or event time?
Both of the Gaurav Kumar's proposal are implementations of processing time Temporal Table joins, which assumes that records can be very loosely joined and do not have to timed properly.
If records from streamA and streamB have to be timed properly, then one way or another you have to buffer some of the records from both of the streams. There are various of ways how to do it, depending on what semantic you want to achieve. After deciding on that, the actual implementation is not that difficult and you can draw an inspiration from Table API join operators (org.apache.flink.table.runtime.join package in flink-table-planner module).
Side inputs (that you referenced) and/or input selection are just tools for controlling the amount of unnecessary buffered records. You can implement a valid Flink job without them, but the memory consumption can be hard to control if one stream significantly overtakes the other (in terms of event time - for processing time it's non-issue).
The answer depends on size of your state of streamB that needs to be used to enrich streamA
If you broadcast your streamB state, then you are putting all userIDs from streamB to each of the task managers. Each task on task manager will only have a subset of these userIds from streamA on it. So some userId data from streamB will never be used and will stay as a waste. So if you think that the size of streamB state is not big enough to really impact your job and doesn't take significant memory to leave less memory for state management, you can keep the whole streamB state. This is your #1.
If your streamB state is really huge and can consume considerable memory on task managers, you should consider approach #2. KeyBy same Id both the streams to make sure that elements with same userID reach the same tasks, and then you can use managed state to maintain the per key streamB state and enrich streamA elements using this managed state.

Flink Map function with multi-parallelism, and how to make sure the order of the final sink

the pipeline simple code is fellows:
source = env.addSource(kafkaConsumer)
.map(func).setParallelism(2).sink()
how to make sure the order of out?
To begin, let's assume that everything else in your example has a parallelism of one, and only the map function is going to run in parallel. (Though to actually achieve that, it would have to be configured somewhere; the default parallelism is higher than one.)
Let's also assume that your Kafka consumer is reading from a single topic with one partition, and you are asking how to implement a parallel transformation that preserves the ordering that was present in the input.
With those assumptions, the answer is that there's not a lot you can do. There's a race between the two instances of the map operator, and the non-parallel sink is going to interleave those two incoming streams in an arbitrary way.
If the stream records are marked in some way, say with ascending timestamps or ids, then you could hypothetically introduce some buffering and re-establish the original ordering, either in a custom sink or in a non-parallel RichCoMap function between your map and sink operators.
If on the other hand, your source is partitioned or keyed in some way, and you only need to maintain or establish an ordering on a per-key basis, then there are better answers.

Integration of non-parallelizable task with high memory demands in Flink pipeline

I am using Flink in a Yarn Cluster to process data using various sources and sinks. At some point in the topology, there is an operation that cannot be parallelized and furthermore needs access to a lot of memory. In fact, the API I am using for this step needs its input in array-form. Right now, I have implemented it something like
ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
DataSet<Pojo> input = ...
List<Pojo> inputList = input.collect();
Pojo[] inputArray = inputList.toArray();
Pojo[] resultArray = costlyOperation(inputArray);
List<Pojo> resultList = Arrays.asList(resultArray);
DataSet<Pojo> result = env.fromCollection(resultList);
result.otherStuff()
This solution seems rather unnatural. Is there a straight-forward way to incorporate this task into my Flink pipeline?
I have read in another thread that the collect() function should not be used for large datasets. I believe the fact that collecting the dataset into a list and then an array does not happen parallely is not my biggest problem right now, but would you still prefer to write what I called input above into a file and build an array from that?
I have also seen the options to configure managed memory in flink. In principle, it might be possible to tune this in a way so that enough heap is left for the expensive operation. On the other hand, I am afraid that the performance of all the other operators in the topology might suffer. What is your opinion on this?
You could replace the "collect->array->costlyOperation->array->fromCollection" step by a key-less reduce operation with a surrogate key that has a unique value for all tuples such that you get only a single partition. This would be Flink like.
In your costly operation itself, that is implemented as a GroupReduceFunction, you will get an iterator over the data. If you do not need to access all data "at once", you also safe heap space as you do not need to keep all data in-memory within reduce (but this depends of course what your costly operation computes).
As an alternative, you could also call reduce() without a previous groupBy(). However, you do not get an iterator or an output collector and can only compute partial aggregates. (see "Reduce" in https://ci.apache.org/projects/flink/flink-docs-release-0.10/apis/programming_guide.html#transformations)
Using Flink style operations has the advantage, that the data is kept in the cluster. If you do collect() the result is transfered to the client, the costly operation is executed in the client, and the result is transfered back to the cluster. Furthermore, if the input is large, Flink will automatically spill the intermediate result to disc for you.

Resources