apache flink - the correct way of error handling - apache-flink

I wonder if there is an option of built in error handling in Flink.
there may be 2 cases:
the current message from Kafka (in my case) is invalid, continue to next one
uncaught exception - from what I saw it can stop the stream aggregation completely.
ho can I handle these 2 cases? (java code)

1) This is done idiomatically with a flatMap: if your message is valid, you go on with a list containing your valid element (maybe already processed in the same step). If it's not valid, you simply return an empty list so that no elements are produced by that step. I could provide Scala code but I'm not familiar with Java APIs so I don't want to put you off track. Just check the flatMap call.
2) This depends on the type of exception: if it's provoked by your own code, just catch it and handle it inside the operator, or simply log it and move on. Without any further information about a specific case, this is the best I know of, but again, coming from Scala I haven't experienced runtime exceptions.

Related

Handling poison messages in Apache Flink

I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?
I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.

AsyncIO Exceptions in Apache Flink

In Apache Flink, I'm using the RichAsyncFunction for data enrichment. In the case of errors/exceptions, I want to funnel those error records into an error stream. I can see that other functions have a "side output" for this sort of scenario, but how is it handled in RichAsyncFunction? I also see use of ResultFuture<>.completeExceptionally, but what does this do or mean when it occurs? Does the stream stop, is it just logged, what is the state with regards to the output element of the stream? All the docs seem to just point out how to handle the happy path or to call completeExceptionally with no explanation of what happens next. What is the proper way to handle/capture errors in RichAsyncFunction?
Thanks!

Flink: what's the best way to handle exceptions inside Flink jobs

I have a flink job that takes in Kafaka topics and goes through a bunch of operators. I'm wondering what's the best way to deal with exceptions that happen in the middle.
My goal is to have a centralized place to handle those exceptions that may be thrown from different operators and here is my current solution:
Use ProcessFunction and output sideOutput to context in the catch block, assuming there is an exception, and have a separate sink function for the sideOutput at the end where it calls an external service to update the status of another related job
However, my question is that by doing so it seems I still need to call collector.collect() and pass in a null value in order to proceed to following operators and hit last stage where sideOutput will flow into the separate sink function. Is this the right way to do it?
Also I'm not sure what actually happens if I don't call collector.collect() inside a operator, would it hang there and cause memory leak?
It's fine to not call collector.collect(). And you don't need to call collect() with a null value when you use the side output to capture the exception - each operator can have its own side output. Finally, if you have multiple such operators with a side output for exceptions, you can union() the side outputs together before sending that stream to a sink.
If for some reason the downstream operator(s) need to know that there was an exception, then one approach is to output an Either<good result, Exception>, but then each downstream operator would of course need to have code to check what it's receiving.

Camel idempotentConsumer always use PUT instead of GET

I am using camel idempotent. Can someone please explain the logic behind idempotentConsumer xml tag.
I received file for first time. All good the idempotentconsumer block executed. on infinispan server I see a log PUT.
I dropped a duplicate file. Now idempotentconsumer identifies duplicated but on infinispan server I see a log with PUT instead of GET. I am wondering is this issue with server side or camel-client?
<idempotentConsumer messageIdRepositoryRef="infinispanRepo" >
<header>CamelFileAbsolutePath</header>
</idempotentConsumer>
No this is working as designed. The Idempotent Consumer EIP will attempt to put the key to the cache with a fixed value of true - that would be an atomic operation on Infinispan. The result of that put operation is then used to know if there was a duplicate or not.
If you do two operations with a GET and then PUT its no longer an atomic operation and you can end up with problems.
See the code at:
https://github.com/apache/camel/blob/master/components/camel-infinispan/src/main/java/org/apache/camel/component/infinispan/processor/idempotent/InfinispanIdempotentRepository.java#L68

MPQueue - what is it and how do I use it?

I encountered a bug that has me beat. Fortunately, I found a work around here (not necessary reading to answer this q) -
http://lists.apple.com/archives/quartz-dev/2009/Oct/msg00088.html
The problem is, I don't understand all of it. I am ok with the event taps etc, but I am supposed to 'set up a thread-safe queue) using MPQueue, add events to it pull them back off later.
Can anyone tell me what an MPQueue is, and how I create one - also how to add items and read/remove items? Google hasn't helped at all.
It's one of the Multiprocessing Services APIs.
… [A] message queue… can be used to notify (that is, send) and wait for (that is, receive) messages consisting of three pointer-sized values in a preemptively safe manner.

Resources