How to Handle Application Errors in Flink

How to Handle Application Errors in Flink - apache-flink

I am currently wondering how to handle application errors in Apache Flink streaming applications. In general, I see two cases:
Transient errors, where you want the input data to be replayed and processing might succeed on second try. An example would be a dependency on an external service, which is temporarily unavailable.
Permanent errors, where repeated processing will still fail; for example invalid input data.
For the first case, it looks like the common solution is to just throw some exception. Or is there a better way, e.g. a special kind of exception for more efficient handling such as FailedException from Apache Storm Trident (see Error handling in Storm Trident topologies).
For permanent errors, I couldn't find any information online. A map() operation, for example, always has to return something so one cannot just silently drop messages as you would in Trident.
What are the available APIs or best practices? Thanks for your help.

Since this question was asked, there has been some development:
This discussion holds the background of why side outputs should help, key extract:
Side outputs(a.k.a Multi-outputs) is one of highly requested features
in high fidelity stream processing use cases. With this feature, Flink
can
Side output corrupted input data and avoid job fall into “fail -> restart -> fail” cycle
Side output sparsely received late arriving events while issuing aggressive watermarks in window computation.
This resulted in jira: FLINK-4460 which has been resolved in Flink 1.1.3 and above.
I hope this helps, if an even more generic solution would be desireable, please think a bit on your usecase and consider to create a jira for it.

Related

Camel Intercept transaction commit

We are considering using Camel framework inside one of our integration systems.
Speed processing is one of our main requirements and part of our architecture we are considering a custom caching mechanism. At a high level if some data exists in the cache there is no need to bring it from our storage (say database).
However data consistency is another important requirement. We don't want to have data in the cache that does not reflect the data in the storage. In other words we need our cache being written (updated) only if the storage commit was successful.
So my question is what would be the best way to execute an action (such us updating the cache) only if a transnational route completes. In a normal spring only application I would use either a TransactionalEventListener with a TransactionPhase.AFTER_COMMIT phase or just a plain non declarative transaction (begin and end of transaction managed inside the code).
Camel frameworks comes instead with so many good to have features that it would be a pity not to use it because this is not possible. I am sure our use case it is not that unusual and hope to hear some advice about how to achieve this.
Thank you in advance,
Julian

Check out onCompletion.
By default the onCompletion will be triggered when the Exchange is complete and regardless if the Exchange completed with success or with an failure (such as an Exception was thrown). You can limit the trigger to only occur onCompleteOnly or by onFailureOnly

Synchronous response with Apache Flink

I could not find any answer to my question on the web so far, so I thought its good to ask here.
I know that Apache Flink is by design asynchronous, but I was wondering if there is any project or design, which aims to build a synchronous pipeline with Flink.
With synchronous response I mean in e.g. having an API endpoint, where I send my data to, the processing is done by Flink, and the outcome of the processing is given back (in what form ever) in the body of the answer to the API call e.g. a 200
I already looked into RabbitMQ RPC but I was not able to successfully implement it.
I'm happy for any direction or suggestion.
Thanks,
Jon

The closest thing that comes into my mind seems to be deploying Flink job with TcpSource available in Apache Bahir. You could have an HTTP endpoint that would receive some data and call Flink on the specified address then process it and create a response. The problem is that there is only TcpSource available in Bahir, which means You would need to create large part of the code (whole Sink) by yourself.
There can be also other ways of doing that (like trying to assign an id to each message and then waiting for message with that Id to arrive on Kafka and sending it as a response, but seems to be troublesome and error-prone)
The other way would be to make the response asynchronous(I know the question specifically mentions sync response but mentioning that just for sake of completeness)
However, I would like to say that this seems like a misuse of Flink to me. Flink was primary designed to allow real-time computations on multiple nodes, which doesn't seem to be a case here. I would suggest looking into different streaming libraries that are much more lightweight, easier to compose, and can offer the functionality You want out-of-the-box. You may want to take a look at Akka Streams for example.

How can I add/change new patterns/queries dynamically/at runtime in apache Flink?

I am trying to build a CEP system with Apache Flink for correlating events. One of the requirements is to be able to add new patterns for anomaly detection at runtime without losing the system availability. Any Ideas of how could I do that?
For instance, If I have a stream of security events (e.g. accesses, authentications) and a pattern for detecting anomalies (e.g. >10 logins to the same machine in 1 minute) I would like to be able to change the pattern parameters, for instance instead of 10 logins, maybe I would like 8 and I at the same time I would like to be able to create other patterns (maintaining the same stream) to detect a new type of anomaly without losing events/system availability.
Regards.

This is already answered at Adding patterns dynamically in Apache Flink without restarting job
Basically, as Flink CEP is coded right now, you can't do it. It is expected this feature for future releases, but we keep waiting for it.

Akka Actors: Handling DB Failures Without Losing Data

Scenario
The DB for an application has gone down. This results in any actor responsible for committing important data to the DB failing to get a connection
Preferred Behaviour
The important data is written to the db when it comes back up sometime in the future.
Current Implementation
The actor catches the DBException, wraps the data in a DBWriteFailed case class, and sends the message to its supervisor. The supervisor then schedules another write for sometime in the future (e.g. 1 minute) using system.scheduler.scheduleOnce(...) so that we don't spin in circles too much while waiting for the DB to come back up.
This implementation certainly works but I feel there might be a better way.
The protocol gets a bit messier when the committing actor has to respond to the original sender after a successful commit.
The regular flow of messages to the committing actor is not throttled in any way and the actor will happily process the new messages, likely failing to connect to the DB for each and every one of them.
If messages get caught in this retry loop for too long, the mailboxes of the committing actors will start to balloon. It is important that this data be committed, but none of it matters if the application crawls to a halt or crashes due to excessive memory usage.
I am an akka novice and I am largely inexperienced when it comes to supervisor strategies, but I feel as though I may be able to leverage one of those to handle some of this retry logic.
Is there a common approach in akka for solving a problem like this? Am I on the right track or should I be heading in a different direction?
Any help is appreciated.

You can use Akka Circuit Breaker to reduce connection attempts. Instead of using the scheduler as retry queue I would use a buffer (with max size limit) inside the actor and retry those when circuit breaker becomes closed again (onClose callback should send message to self actor). An alternative could be to combine the circuit breaker with a stashing mailbox.

If you're planning to implement full failover in your app
Don't.
Do not bubble database failover responsibility up into the app layer. As far as your app is concerned, the database should just be up and ready to accept reads and writes.
If your database goes down often, spend time making your database more robust (there's a multitude of resources on the web already for this: search the web for terms like 'replication', 'high availability', 'load-balancing' and 'clustering', and learn from the war stories of others at highscalability.com). It all really depends on what the cause of your DB outages are (e.g. I once maxed out the NIC on the DB master, and "fixed" the problem intermittently by enabling GZIP on the wire).
You'll be glad you adhered to a separation of concerns if you go down this route.
If you're planning to implement the odd sprinkling of retry logic and handling DB brown-outs
If you're not expecting your app to become a replacement database, then Patrik's answer is the best way to go.

Should you test an external system prior to using it?

Note: This is not for unit testing or integration testing. This is for when the application is running.
I am working on a system which communicates to multiple back end systems, which can be grouped into three types
Relational database
SOAP or WCF service
File system (network share)
Due to the environment this will run in, there are no guarantees that any of those will be available at run time. In fact some of them seem pretty brittle and go down multiple times a day :(
The thinking is to have a small bit of test code which runs before the actual code. If there is a problem then persist the request and poll until the target system until it is available. Tests could possibly be rerun within the code to check it is still available at logical points. The ultimate goal is to have a very stable system, regardless of the stability (or lack thereof) of the systems it communicates to.
My questions around this design are:
Are there major issues with it? (small things like the fact it may fail between the test completing and the code running are understandable)
Are there better ways to implement this sort of design?
Would using traditional exception handling and/or transactions be better?
Updates
The system needs to talk to the back end systems in a coordinated way.
The system is very async in nature so using things like queuing technologies is fine.
The system must run even if one or more backend systems are down as others may be up and processing of some information is possible.

You will be needing that traditional exception handling no matter what, since as you point out there's always the chance that things'll fail between your last check and the actual request. So I really think any solution you find should try to interact smoothly with this.
You are not stating if these flaky resources need to interact in some kind of coordinated manner, which would indicate that you should probably be using a transaction manager of some sort to do this. I do not believe you want to get into the footwork of transaction management in application code for most needs.
Sometimes I have also seen people use AOP to encapsulate retry logic to back-end systems that fail (for instance due to time-out issues). Used sparingly this may be a decent solution.
In some cases you can also use message queuing technology to alleviate unstable back-ends. You could for instance commit to a message queue as part of a transaction, and only pop off the queue when successful. But this design is normally only possible when you're able to live with an asynchronous process.
And as always, real stability can only be achieved by attacking the root cause of the problem. I had a 25-year old bug fixed in a mainframe TCP/IP stack fixed because we were overrunning it, so it is possible.

The Microsoft Smartclient framework provides a ConnectionMonitor class. Should be easy to use or duplicate.

Our approach to this kind of issue was to run a really basic 'sanity tester' prior to bringing up our main application. This was thick client so we could run the test every time the app started. This sanity test would go out and check things like database availability, and external network (extranet) access, and it could have been extended to do webservices as well.
If there was a failure, the user was informed, and crucially an email was also sent to the support/dev team. These emails soon became unweildy as so many were being created, but we then setup filters, so we knew when somethings really bad was happening. Overall the approach worked pretty well, our biggest win was being able to tell users that the system was down, before they had entered data, and got part way through a long winded process. They absolutely loved it.
At a technica level the sanity was written in C#, it used exception handling in a conventional way not to find the problems it was looking for. The sanity program became a mini app in its own right, and it was standalone from the main app. If I were doing it again I'd using a logging framework to capture issues, which is more flexible then our hard coded approach.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight