Handling poison messages in Apache Flink

Handling poison messages in Apache Flink - apache-flink

I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?

I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.

Related

Multithreading inside Flink's Map/Process function

I have an use case where I need to apply multiple functions to every incoming message, each producing 0 or more results.
Having a loop won't scale for me, and ideally I would like to be able to emit results as soon as they are ready instead of waiting for the all the functions to be applied.
I thought about using AsyncIO for this, maintaining a ThreadPool but if I am not mistaken I can only emit one record using this API, which is not a deal-breaker but I'd like to know if there are other options, like using a ThreadPool but in a Map/Process function so then I can send the results as they are ready.
Would this be an anti-pattern, or cause any problems in regards to checkpointing, at-least-once guarantees?

Depending on the number of different functions involved, one solution would be to fan each incoming message out to n operators, each applying one of the functions.
I fear you'll get into trouble if you try this with a multi-threaded map/process function.
How about this instead:
You could have something like a RichCoFlatMap (or KeyedCoProcessFunction, or BroadcastProcessFunction) that is aware of all of the currently active functions, and for each incoming event, emits n copies of it, each being enriched with info about a specific function to be performed. Following that can be an async i/o operator that has a ThreadPool, and it takes care of executing the functions and emitting results if and when they become available.

C design pattern performing a list of actions without blocking?

Embedded C. I have a list of things I want to do, procedurally, mostly READ and WRITE and MODIFY actions, acting on the results of the last statement. They can take up to 2 seconds each, I can’t block.
Each action can have states of COMPLETE and ERROR which has sub-states for reason the error occurred. Or on compete I’ll want to check or modify some data.
Each list of actions is a big switch and to re-enter I keep a list of which step I’m on, a success step++ and I come back in further down the list next time.
Pretty simple, but I’m finding that to not block I’m spending a ton of effort checking states and errors and edges constantly. Over and over.
I would say 80% of my code is just checks and moving the system along. There has to be a better way!
Are there any design patterns for async do thing and come back later for results in a way that efficiently handles some of the exception/edge/handling?
Edit: I know how to use callbacks but don’t really see that as “a solution” as I just need to get back to a different part of the same list for the next thing to do. Maybe it’s would be beneficial to know the backend to how async and await in other languages work?
Edit2: I do have an RTOS for other projects but this specific question, assume no threads/tasks, just bare metal superloop.

Your predicament is a perfect fit for state machines (really, probably UML statecharts). Each different request can each be handled in its own state machine, which handle events (such as COMPLETE or ERROR indications) in a non-blocking, run-to-completion manner. As the events come in, the request's state machine moves through its different states towards completion.
For embedded systems, I often use the QP event-driven framework for such cases. In fact, when I looked up this link, I noticed the very first paragraph uses the term "non-blocking". The framework provides much more than state machines with hierarchy (states within states), which is already very powerful.
The site also has some good information on approaches to your specific problem. I would suggest starting with the site's Key Concepts page.
To get you a taste of the content and its relevance to your predicament:
In spite of the fundamental event-driven nature, most embedded systems
are traditionally programmed in a sequential manner, where a program
hard-codes the expected sequence of events by waiting for the specific
events in various places in the execution path. This explicit waiting
for events is implemented either by busy-polling or blocking on a
time-delay, etc.
The sequential paradigm works well for sequential problems, where the
expected sequence of events can be hard-coded in the sequential code.
Trouble is that most real-life systems are not sequential, meaning
that the system must handle many equally valid event sequences. The
fundamental problem is that while a sequential program is waiting for
one kind of event (e.g., timeout event after a time delay) it is not
doing anything else and is not responsive to other events (e.g., a
button press).
For this and other reasons, experts in concurrent programming have
learned to be very careful with various blocking mechanisms of an
RTOS, because they often lead to programs that are unresponsive,
difficult to reason about, and unsafe. Instead, experts recommend [...] event-driven programming.
You can also do state machines yourself without using an event-driven framework like the QP, but you will end up re-inventing the wheel IMO.

Flink: what's the best way to handle exceptions inside Flink jobs

I have a flink job that takes in Kafaka topics and goes through a bunch of operators. I'm wondering what's the best way to deal with exceptions that happen in the middle.
My goal is to have a centralized place to handle those exceptions that may be thrown from different operators and here is my current solution:
Use ProcessFunction and output sideOutput to context in the catch block, assuming there is an exception, and have a separate sink function for the sideOutput at the end where it calls an external service to update the status of another related job
However, my question is that by doing so it seems I still need to call collector.collect() and pass in a null value in order to proceed to following operators and hit last stage where sideOutput will flow into the separate sink function. Is this the right way to do it?
Also I'm not sure what actually happens if I don't call collector.collect() inside a operator, would it hang there and cause memory leak?

It's fine to not call collector.collect(). And you don't need to call collect() with a null value when you use the side output to capture the exception - each operator can have its own side output. Finally, if you have multiple such operators with a side output for exceptions, you can union() the side outputs together before sending that stream to a sink.
If for some reason the downstream operator(s) need to know that there was an exception, then one approach is to output an Either<good result, Exception>, but then each downstream operator would of course need to have code to check what it's receiving.

Making db.put() failsafe

I would like to make a db.put() operation in my Google App Engine service as resilient as possible, trying to maximize the likelihood of success even in the event of infrastructure issues or overload. What I have come up with at the moment is to catch every possible exception that could occur and to create a task that retries the commit if the first attempt fails:
try:
db.put(new_user_record)
except DeadlineExceededError:
deferred.defer(db.put,new_user_record)
except:
deferred.defer(db.put,new_user_record)
Does this code trap all possible error paths? Or are there other ways db.put() can fail that would not by caught by this code?
Edit on March 28, 2013 - To clarify when failure is expected
It seems that the answers so far assume that if db.put() fails then it is because the datastore is down. In my experience of having run fairly high-workload applications this is not necessarily a requirement. Sometimes you run into workload-specific API bottlenecks, sometimes the slowness of one API causes the request deadline to expire in another. Even though such events have a low frequency, their number can be sizable if traffic is high. These are the situations I am trying to cover.

I wouldn't say this is the best approach - whatever caused the original exception is just likely to happen again. What I would do for extra resilience is first load the record to be saved into memcache and in the event of an exception with the put (any exception) it could attempt a certain number of retries (for example 3) with a short sleep between each attempt. Depending on your application this could be either a synchronous operation or using deferred tasks it could be done asynchronously using the data in memcache.
Finally I'd actually do a query on the record in the data store even if there wasn't an exception to confirm the row has actually been written.

Well, i don't think that it is a good idea to try such a fallback at all. If the datastore is down, its down and youre out of luck (shouldn't happen frequently :)
Some thoughts to your code:
There are way more exceptions that could be raised during a put-opertation (like InternalError, Timeout, CommittedButStillApplying, TransactionFailedError)
Some of them don't mean that the put has failed. (ie. CommittedButStillApplying just means the put-operation is delayed). With your approach, you would end up having that entry twice in the datastore after your deferred call succeeds.
Tasks are limited to ~100KB (total size, not payload). If your payload is close to or above that limit, the deferred-api will automatically try to
serialize your payload to the datastore in order to keep the task itself below that limit. If the datastore is really unavailable, this will fail, too.
So its probably better to catch datastore errors, and inform your user that his request failed.

Its all good to retry, however use exponential backoff and most important proper transaction use so that fail xoesnt end up o a partial write.

What are the most frequently used flow controls for handling protocol communication?

I am rewriting code to handle some embedded communications and right now the protocol handling is implemented in a While loop with a large case/switch statement. This method seems a little unwieldy. What are the most commonly used flow control methods for implementing communication protocols?

It sounds like the "while + switch/case" is a statemachine implementation. I believe that a well thought out statemachine is often the easiest and most readable way to implement a protocol.
When it comes to statemachines, breaking some of the traditional programming rules comes with the territory. Rules like "every function should be less than 25 lines" just don't work. One might even argue that statemachines are GOTOs in disguise.

For cases where you key off of a field in a protocol header to direct you to the next stage of processing for that protocol, arrays of function pointers can be used. You use the value from the protocol header to index into the array and call the function for that protocol.
You must handle all possible values in this array, even those which are not valid. Eventually you will get a packet containing the invalid value, either because someone is trying an attack or because a future rev of the protocol adds new values.

If it is all one protocol being handled then a switch/case statement may be your best bet. However you should break all the individual message handlers into their own functions.
If your switch statement contains any code to actually handle the messages than you would be better off breaking them out.
If it is handling multiple similar protocols you could create a class to handle each one based off the same abstract class and when the connection comes in you could determine which protocol it is and create an instance of the appropriate handler class to decode and handle the communications.

I would think this depends largely on the language you are using, and what sort of data set objects you have available to you.
In python, for example, you could create a Dictionary object of all the different handling statements, and just iterate through that to find the right method/function to call.
Case/Switch statements aren't bad things, but if they get huge(like they can with massive amounts of protocol handlers) then they can become unwieldy to work with.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight