In Apache Flink, I'm using the RichAsyncFunction for data enrichment. In the case of errors/exceptions, I want to funnel those error records into an error stream. I can see that other functions have a "side output" for this sort of scenario, but how is it handled in RichAsyncFunction? I also see use of ResultFuture<>.completeExceptionally, but what does this do or mean when it occurs? Does the stream stop, is it just logged, what is the state with regards to the output element of the stream? All the docs seem to just point out how to handle the happy path or to call completeExceptionally with no explanation of what happens next. What is the proper way to handle/capture errors in RichAsyncFunction?
Thanks!
Related
I have two streams, one is main stream let's say in example of fraud detection I have transactions stream and then I have second stream which is configs, in our example it is rules. So I connect main stream to config stream in order to do processing. But when first time flink starts and we are adding job it starts consuming from transactions and configs stream parallel and when wants process transaction it sometimes see that there is no config and we have to send transaction to dead letter queue. However, what I want to achieve is, if there is patential config which I could get a bit later I want to get that config first then get transaction in order to process it rather then sending it to dead letter queue. I have the same key for transactions and configs.
long story short, is there a way telling flink when first time job starts try to consume one stream until there isn't new value then start processing main stream? How I can make them kind of sequential?
The recommended way to approach this is to connect the 2 streams and apply a RichCoFlatMap that will allow you to buffer events from main while you're waiting to receive the config events.
Check out this useful section of the Flink tutorials. The very last paragraph actually describes your problem.
It is important to recognize that you have no control over the order in which the flatMap1 and flatMap2 callbacks are called. These two input streams are racing against each other, and the Flink runtime will do what it wants to regarding consuming events from one stream or the other. In cases where timing and/or ordering matter, you may find it necessary to buffer events in managed Flink state until your application is ready to process them. (Note: if you are truly desperate, it is possible to exert some limited control over the order in which a two-input operator consumes its inputs by using a custom Operator that implements the InputSelectable interface.
So in a nutshell you should connect your 2 streams and have some kind of ListState where you can "buffer" your main elements while waiting to receive the rules. When you receive an element from the config stream, you check whether you had some pending elements "waiting" for that config in your ListState (your buffer). If you do, you can then process these elements and emit them through the collector of your flatmap.
Starting with version 1.16, you can use the hybrid source support in Flink to read all of once source (configs, in your case) before reading the second source. Though I imagine you'd have to map the events to an Either<config, transaction> so that the data stream has consistent record types.
I am trying to figure out the best practices to deal with poison messages / unhandled exceptions with Apache Flink. We have a Job doing real time event processing of location data from IoT devices. There are two potential scenarios where this can arise:
Data is bad in some way - e.g. invalid value
Data triggers a bug due to some edge case we have not anticipated.
Currently, all my data processing stops because of just one message.
I've seen two suggestions:
Catch the exceptions - this requires me wrapping every piece of logic with something to catch every runtime exception
Use side outputs as a kind of DLQ - from what I can tell this seems to be a variation on #1 where I have to catch all the exceptions and send them to the side output.
Is there really no way to do this other than wrap every piece of logic with exception handling? Is there no generic way to catch exceptions and not have processing continue?
I think the idea is not to catch all kinds of exceptions and send them elsewhere, but rather to have well-tested and functioning code and use dead letters only for invalid inputs.
So a typical pipeline would be
source => validate => ... => sink
\=> dead letter queue
As soon as your record passes your validate operator, you want all errors to bubble up, as any error in these operators may result in corrupted aggregates and data that - once written - cannot be reverted easily.
The validate step would work with any of the two approaches that you outlined. Typically, side-outputs have better semantics, but you may end up with more code.
Now you may have a service with high SLAs and actually want it to produce output even if it is corrupted just to produce data. Or you have simple transformation pipeline, where you'd miss some events but keep the majority (and downstream can deal with incomplete data). Then you are right that you need to wrap the code of all operators with try-catch. However, you'd typically still would only do it for the fragile operators and not for all of them. Trivial operators should be tested and then trusted to work. Further, you'd usually only catch specific kinds of exceptions to limit the scope to the kind of expected exceptions that can happen.
You might wonder why Flink doesn't have it incorporated as a default pattern. There are two reasons as far as I can see:
If Flink silently ignores any kind of exception and sends an extra message to a secondary sink, how can Flink ensure that the throwing operator is in a sane state afterwards? How can it avoid any kind of leaks that may happen because cleanup code is not executed?
It's more common in Java to let the developers explicitly reason about exceptions and exception handling. It's also not straight-forward to see what the requirements are: Do you want to have the input only? Do you also want to store the exception? What about the operator state that may have influenced the outcome? Should Flink still fail when too many errors have been received in a given time window? It quickly becomes a huge feature for something that should not happen at all in an ideal world where high quality data is ingested and properly processed.
So while it looks easy for your case because you exactly know which kinds of information you want to store, it's not easy to have a solution for all purposes, especially since the extra code that a user has to write is tiny compared to the generic solution.
What you could do is to extract most of the complicated logic things into a single ProcessFunction and use side-outputs as you have outlined. Since it's a central piece, you'd only need to write the side-output function once. If it's done multiple times, you could extract a helper function where you pass your actual code as a RunnableWithException lambda which hides all the side-output logic. Make sure you use plenty of finally blocks to ensure a sane state.
I'd also add quite a few IT cases and use mutation testing to harden your pipeline quicker. If you keep your test data inline, the mutants may also exactly simulate your unexpected data issues, such that your validate operator gets more complete.
I'm trying to make some validations in a stream, i'm currently checking for invalid card numbers and was asked if a could persist those invalid card numbers for future validations.
What is the best way to achieve that on Apache Flink.
Thanks
Okay, so if You want to be able to restart the job and keep the datam, then I would suggest using a Flink state which is checkpointed. I don't know the exact use case, so I can't tell whether You should use the KeyedState or Operator State. But basically the idea is to keep the card numbers or anything that You are using for validation in state and then cancel Your job with savepoint and whenever You will want to start it again, You can start from the given savepoint. This way You will never have empty list of invalid card numbers. You can read more on the state here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html
As for the case when You want to store the invalid card numbers externally, then You can for example side output the card numbers that were invalid and sink them to Kafka or file. This way You will be able to access them in any application or component. You can find more on side outputs here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html
I am using camel idempotent. Can someone please explain the logic behind idempotentConsumer xml tag.
I received file for first time. All good the idempotentconsumer block executed. on infinispan server I see a log PUT.
I dropped a duplicate file. Now idempotentconsumer identifies duplicated but on infinispan server I see a log with PUT instead of GET. I am wondering is this issue with server side or camel-client?
<idempotentConsumer messageIdRepositoryRef="infinispanRepo" >
<header>CamelFileAbsolutePath</header>
</idempotentConsumer>
No this is working as designed. The Idempotent Consumer EIP will attempt to put the key to the cache with a fixed value of true - that would be an atomic operation on Infinispan. The result of that put operation is then used to know if there was a duplicate or not.
If you do two operations with a GET and then PUT its no longer an atomic operation and you can end up with problems.
See the code at:
https://github.com/apache/camel/blob/master/components/camel-infinispan/src/main/java/org/apache/camel/component/infinispan/processor/idempotent/InfinispanIdempotentRepository.java#L68
I wonder if there is an option of built in error handling in Flink.
there may be 2 cases:
the current message from Kafka (in my case) is invalid, continue to next one
uncaught exception - from what I saw it can stop the stream aggregation completely.
ho can I handle these 2 cases? (java code)
1) This is done idiomatically with a flatMap: if your message is valid, you go on with a list containing your valid element (maybe already processed in the same step). If it's not valid, you simply return an empty list so that no elements are produced by that step. I could provide Scala code but I'm not familiar with Java APIs so I don't want to put you off track. Just check the flatMap call.
2) This depends on the type of exception: if it's provoked by your own code, just catch it and handle it inside the operator, or simply log it and move on. Without any further information about a specific case, this is the best I know of, but again, coming from Scala I haven't experienced runtime exceptions.