Flink: An abstraction that implements CheckpointListener w/o element processing - apache-flink

I'm new to Flink and am looking for a way to run some code once a checkpoint completes (supposedly by implementing CheckpointListener) without processing events (void processElement(StreamRecord<IN> element)). Currently, I have an operator MyOperator that runs my code within notifyCheckpointComplete function. However, I see a lot of traffic sent to that operator. The operator chain looks as follows:
input = KafkaStream
input -> IcebergSink
input -> MyOperator
I can't find how to register CheckpointListener in Flink execution environment. Is it possible?
Also, I have the following ideas:
map input stream elements to Void, Unit before sending to MyOperator
use Side Output without emitting data to side output. I'm wondering if notifyCheckpointComplete will be still called.

Related

SwiftNIO: How "expensive" is transformation in each ChannelHandler?

Checking this tutorial: https://rderik.com/blog/understanding-swiftnio-by-building-a-text-modifying-server/
One thing I do not understand that the main point using NIO directly is to increase speed of a backend service.
But, when we has this pipe:
Client: hello
|
v
Server
|
v
BackPressureHandler (Receives a ByteBuffer - passes a ByteBuffer)
|
v
UpcaseHandler(Receives a ByteBuffer - passes a [CChar])
|
v
VowelsHandler(Receives a [CChar] - passes a ByteBuffer)
|
v
ColourHandler(Receives a ByteBuffer - passes a ByteBuffer)
|
v
Client: receives
H*LL* (In green colour)
parameter gets transformed many times. In UpcaseHandler NIOAny -> ByteBuffer -> string -> CChar -> NIOAny
then in VowelsHandler again: NIOAny -> ByteBuffer -> string -> CChar -> NIOAny
What is the advantage to have so many independent handlers?
If server receive a 'flat' JSON, is it worth to process it with with JSONEncoder, if speed, each microseconds are critical? try JSONEncoder().encode(d2)
Or is it worth, is it common to implement own JSON processor. I.e. an event driven JSON parser?
I think it's useful to use things like an UppercasingHandler when trying to learn and understand SwiftNIO. In the real world however, this is too fine grained for a ChannelHandler.
Typically, the use-case for a ChannelHandler is usually one of the following (not exhaustive):
a whole network protocol (example NIOSSLClientHandler which adds TLS for a client connection)
added value that may be useful with multiple protocols (such as the BackpressureHandler)
added value that may be useful for debugging (example NIOWritePCAPHandler)
So whilst the overhead of a ChannelHandler isn't huge, it is definitely not completely free and I would recommend not overusing them. Abstraction is useful but even in a SwiftNIO-based application or library we should try to express everything as ChannelHandlers in a ChannelPipeline.
The value-add of having something in a ChannelHandler is mostly around reusability (the HTTP/1, HTTP/2, ... implementations don't need to know about TLS), testability (we can test a network protocol without actually needing a network connection) and debuggability (if something goes wrong, we can easily log the inputs/outputs of a ChannelHandler).
The NIOWritePCAPHandler for example is a great example: In most cases, we don't need it. But if something goes wrong, we can add it in between a TLS handler and say the HTTP/2 handler(s) and we get a plaintext .pcap file without having to touch any code apart from the code that inserts it into the ChannelPipeline which can even be done dynamically after the TCP connection is already established.
There's absolutely nothing wrong with a very short ChannelPipeline. Many great examples have just a few handlers, for example:
TLS handler <--> network protocol handler(s) [HTTP/1.1 for example] <--> application handler (business logic)

Akka HTTP streaming API with cycles never completes

I'm building an application where I take a request from a user, call a REST API to get back some data, then based on that response, make another HTTP call and so on. Basically, I'm processing a tree of data where each node in the tree requires me to recursively call this API, like this:
A
/ \
B C
/ \ \
D E F
I'm using Akka HTTP with Akka Streams to build the application, so I'm using it's streaming API, like this:
val httpFlow = Http().cachedConnection(host = "localhost")
val flow = GraphDSL.create() { implicit builder =>
import GraphDSL.Implicits._
val merge = b.add(Merge[Data](2))
val bcast = b.add(Broadcast[ResponseData](2))
takeUserData ~> merge ~> createRequest ~> httpFlow ~> processResponse ~> bcast
merge <~ extractSubtree <~ bcast
FlowShape(takeUserData.in, bcast.out(1))
}
I understand that the best way to handle recursion in an Akka Streams application is to handle recursion outside of the stream, but since I'm recursively calling the HTTP flow to get each subtree of data, I wanted to make sure that the flow was properly backpressured in case the API becomes overloaded.
The problem is that this stream never completes. If I hook it up to a simple source like this:
val source = Source.single(data)
val sink = Sink.seq[ResponseData]
source.via(flow).runWith(sink)
It prints out that it's processing all the data in the tree and then stops printing anything, just idling forever.
I read the documentation about cycles and the suggestion was to put a MergePreferred in there, but that didn't seem to help. This question helped, but I don't understand why MergePreferred wouldn't stop the deadlock, since unlike their example, the elements are removed from the stream at each level of the tree.
Why doesn't MergePreferred avoid the deadlock, and is there another way of doing this?
MergePreferred (in the absence of eagerComplete being true) will complete when all the inputs have completed, which tends to generally be true of stages in Akka Streams (completion flows down from the start).
So that implies that the merge can't propagate completion until both the input and extractSubtree signal completion. extractSubtree won't signal completion (most likely, without knowing the stages in that flow) until bcast signals completion which (again most likely) won't happen until processResponse signals completion which* won't happen until httpFlow signals completion which* won't happen until createRequest signals completion, which* won't happen until merge signals completion. Because detecting this cycle in general is impossible (consider that there are stages for which completion is entirely dynamic), Akka Streams effectively takes the position that if you want to create a cycle like this, it's on you to determine how to break the cycle.
As you've noticed, eagerComplete being true changes this behavior, but since it will complete as soon as any input completes (which in this case will always be the input, thanks to the cycle) merge completes and cancels demand on extractSubtree (which by itself could (depending on whether the Broadcast has eagerCancel set) cause the downstream to cancel), which will likely result in at least some elements emitted by extractSubtree not getting processed.
If you're absolutely sure that the input completing means that the cycle will eventually dry up, you can use eagerComplete = false if you have some means to complete extractSubtree once the cycle is dry and the input has completed. A broad outline (without knowing what, specifically, is in extractSubtree) for going about this:
map everything coming into extractSubtree from bcast into a Some of the input
prematerialize a Source.actorRef to which you can send a None, save the ActorRef (which will be the materialized value of this source)
merge the input with that prematerialized source
when extracting the subtree, use a statefulMapConcat stage to track whether a) a None has been seen and b) how many subtrees are pending (initial value 1, add the number of (first generation) children of this node minus 1, i.e. no children subtracts 1); if a None has been seen and no subtrees are pending emit a List(None), otherwise emit a List of each subtree wrapped in a Some
have a takeWhile(_.isDefined), which will complete once it sees a None
if you have more complex things (e.g. side effects) in extractSubtrees, you'll have to figure out where to put them
before merging the outside input, pass it through a watchTermination stage, and in the future callback (on success) send a None to the ActorRef you got when prematerializing the Source.actorRef for extractSubtrees. Thus, when the input completes, watchTermination will fire successfully and effectively send a message to extractSubtrees to watch for when it's completed the inflight tree.

Flink async IO operator chaining with with another sync operator

I have a usecase where I am using async IO operators with normal mappers in flink. I am using flink 1.8. So, async operator is going to have to be at the head of the operator chain. So my operator flows looks like this:
Source -> Mapper1 -> AsyncOperator -> Mapper2 -> Sink
Because of the requirement of async operator being head, there are two operator chains and hence two tasks- 1. Source + Mapper1 2. AsyncOperator+Mapper2+Sink.
I have question regarding the second chain. I think the second chain should be comprised within a single task if they are chained correctly. I am not sure if there is a wait time between async operator and mapper 2 on the task threads or the Mapper2 gets bound to the response handler for the Async Operator internally ? Ideally, it should be second, but I can't find any documentation for the same - hence wondering.
Reference:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/asyncio.html
The AsyncWaitOperator spins up an Emitter in a thread, so as soon as results are available they get sent to the operator's collector. Note though that if you specify ordered results there can be a "wait time" due to completion order not matching the order of incoming elements.
BTW, the restriction that the AsyncWaitOperator must be at the head of a chain was removed in Flink 1.11. See FLINK-16219. The only remaining limitation was that it could not follow a SourceFunction. The AsyncWaitOperator can follow the new sources introduced in Flink 1.12.

Akka Streams buffer on SubFlows based on parent Flow

I am using akka-streams and I hit an exception because of maxing out the Http Pool on akka-http.
There is a Source of list-elements, which get split and thus transformed to SubFlows.
The SubFlows issue http requests. Although I put a buffer on the SubFlow, it seems the buffer takes effect per SubFlow.
Is there a way to have a buffer based on the Source that takes effect on the SubFlows?
My mistake was that I was merging the substreams without taking into consideration the parallelism by using
def mergeSubstreams(): Flow[In, Out, Mat]
From the documentation
This is identical in effect to mergeSubstreamsWithParallelism(Integer.MAX_VALUE).
Thus my workaround was to use
def mergeSubstreamsWithParallelism(parallelism: Int): Flow[In, Out, Mat]

Flink trigger on a custom window

I'm trying to evaluate Apache Flink for the use case we're currently running in production using custom code.
So let's say there's a stream of events each containing a specific attribute X which is a continuously increasing integer. That is a bunch of contiguous events have this attributes set to N, then the next batch has it set to N+1 etc.
I want to break the stream into windows of events with the same value of X and then do some computations on each separately.
So I define a GlobalWindow and a custom Trigger where in onElement method I check the attribute of any given element against the saved value of the current X (from state variable) and if they differ I conclude that we've accumulated all the events with X=CURRENT and it's time to do computation and increase the X value in the state.
The problem with this approach is that the element from the next logical batch (with X=CURRENT+1) has been already consumed but it's not a part of the previous batch.
Is there a way to put it back somehow into the stream so that it is properly accounted for the next batch?
Or maybe my approach is entirely wrong and there's an easier way to achieve what I need?
Thank you.
I think you are on a right track.
Trigger specifies when a window can be processed and results for a window can be emitted.
The WindowAssigner is the part which says to which window element will be assigned. So I would say you also need to provide a custom implementation of WindowAssigner that will assign same window to all elements with equal value of X.
A more idiomatic way to do this with Flink would be to use stream.keyBy(X).window(...). The keyBy(X) takes care of grouping elements by their particular value for X. You then apply any sort of window you like. In your case a SessionWindow may be a good choice. It will fire for each key after that key hasn't been seen for some configurable period of time.
This approach will be much more robust with regard to unordered data which you must always assume in a stream processing system.

Resources