Keep the values from a stream in Apache Flink

Keep the values from a stream in Apache Flink - apache-flink

I'm trying to make some validations in a stream, i'm currently checking for invalid card numbers and was asked if a could persist those invalid card numbers for future validations.
What is the best way to achieve that on Apache Flink.
Thanks

Okay, so if You want to be able to restart the job and keep the datam, then I would suggest using a Flink state which is checkpointed. I don't know the exact use case, so I can't tell whether You should use the KeyedState or Operator State. But basically the idea is to keep the card numbers or anything that You are using for validation in state and then cancel Your job with savepoint and whenever You will want to start it again, You can start from the given savepoint. This way You will never have empty list of invalid card numbers. You can read more on the state here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/state/state.html
As for the case when You want to store the invalid card numbers externally, then You can for example side output the card numbers that were invalid and sink them to Kafka or file. This way You will be able to access them in any application or component. You can find more on side outputs here: https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/side_output.html

Related

Flink window aggregation with state

I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?

The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.

Can hackers impact information inside state?

in my react application I calculate the price of the order in the back-end, and then transfer it to the state. But at the end, the paypal order amount is passed through the state. Which means, if a hacker can find a way to change the state to "$1", they can get the items cheaper.
This is just one case of me calculating stuff inside my state, and I was wondering if a scenario of hacker changing the state is possible.
One more case of me doing sensitive stuff with state :
When a user tries to reset password and their ip is not blacklisted for too many tries, I transfer them to a page where they need to enter the pin-code that they received to their phone. If they enter invalid pin I increase the "failedTries" state and won't accept their submission if they have failed 3 times. This is done instead of going all the way to the db and storing their failed pin codes. If a hacker changes the state to 0, they can simply brute force the phone pin which is only 6 digit long.

I think you should save failedTries in database not in UI part, as calculated price.
You should get the protected content from a server, and this server should only deliver the content when the user sends a valid token.
This way, yes, anyone can flip the switch in the client, but that only shows the UI components, without any data.
This is the usual approach when creating single-page applications. As long as you don't have secret or sensitive data right in your client from the beginning, they are as safe as your server / API that delivers the data.

AsyncIO Exceptions in Apache Flink

In Apache Flink, I'm using the RichAsyncFunction for data enrichment. In the case of errors/exceptions, I want to funnel those error records into an error stream. I can see that other functions have a "side output" for this sort of scenario, but how is it handled in RichAsyncFunction? I also see use of ResultFuture<>.completeExceptionally, but what does this do or mean when it occurs? Does the stream stop, is it just logged, what is the state with regards to the output element of the stream? All the docs seem to just point out how to handle the happy path or to call completeExceptionally with no explanation of what happens next. What is the proper way to handle/capture errors in RichAsyncFunction?
Thanks!

Adding patterns dynamically in Apache Flink without restarting job

My use case is that I want to apply different CEP patterns to the same datastream. the CEP patterns come dynamically & i want them to be added to flink without having to restart the job. While all conditions can be handled via custom classes that implement IterativeCondition, my main problem is that the temporal condition accepts only TimeWindow; which cannot be handled. Is there some way that the value passed to .within() be set based on the input elements?
Something similar was asked here: Flink and Dynamic templates recognition
Best Answer:
"What one could add is a co-flat map operator which receives on one input channel the events and on the other input channel patterns. For each newly received pattern one either updates the existing NFA (this functionality is missing) or compiles a new one. In the latter case, one would apply incoming events to all stored NFAs."
I am trying to implement this but I am facing some difficulty. Specifically, on the point of "In the latter case, one would apply incoming events to all stored NFAs"
Reason being that I apply stream to pattern using: PatternStream matchStream = CEP.pattern(tmatchStream, pattern);
But the stream "tmatchStream" would not be defined inside the co-flatMap. Am I missing something here??? Any help would be greatly appreciated.

Unfortunately the answer to the linked question is still valid. Flink CEP does not support dynamic patterns at that moment. There is already a JIRA ticket for that though: FLINK-7129
The earliest reasonable target version for that feature will be 1.6.0

Camel condition on aggregate of messages

I'm looking for a way to conditionally handle messages based on the aggregation of messages. I've looked into a lot of ways to do this, but it seems that Apache Camel doesn't support it. I'll explain the scenario and then the solutions I tried.
Scenario:
I'm trying to conditionally clean a directory. I poll from the directory every x days and fetch all the files (file://...). I route this into an aggregation, that aggregates the files into a single size (directorySize). I then check if this size passes a certain threshold.
Here is where the problem lies. I now want to remove certain files if this condition passes, but I don't have access to the original messages anymore because they were aggregated in a new exchange.
Solutions:
I tried to fetch the files again to process them. Problem is that you can't make a consumer fetch on demand as far as I know. I tried using pollEnrich, but that will only fetch a single file and not all files in the directory.
I tried to filter/stop the parent route. The problem here is that filter()/choice...stop()/end() will only stop the aggregated route with the directory size and not the parent route with the file messages. I can't conditionally process these.
I tried to move the aggregated condition to another route that I would call, but this causes the same problem as the first solution.
Things I consider doing:
Rewrite the aggregation strategy to not only aggregate the size, but also the files itself into a groupedExchange. This way I can split the aggregation again after the check. I don't really like this solution because it causes a lot boilerplate, both in code as during runtime.
Move the file size calculator to a processor instead of the aggregator. This would defeat the purpose of using camel in the first place.. I would manually be fetching the files and adding the sizes.. And that for every single file..
Use a ControlBus to dynamically start the delete route on that directory. Once again a lot of workaround to achieve something that I feel should be able to be done in a simple route.
I would like to set the calculated size on every parent message, but I have no clue how this could be achieved?
Another way to stop the parent route that I haven't thought of?
I'm a bit stunned that you can't elegantly filter messages based on the aggregation of these messages. Is there something that I missed in Camel that would provide an elegant solution? Or is this a case of the least bad solution?
Simple Schema
Message(File)
Message(File) --> AggregatedMessage(directorySize) --> delete certain Files?
Message(File)

Camel is really awesome, but sometimes it's sure difficult to see exactly which design pattern to use ;)
Firstly, you need to keep a copy of the file objects, because you don't know whether to delete them or not until you reach your threshold - there are basically (at least) two ways to do this.
Alternative 1
The first way is to use a List in an exchange property. This property will hang around no matter what you do with the exchange body. If you have a look at the source code for GroupedExchangeAggregationStrategy, it does precisely this:
list = new ArrayList<Exchange>();
answer.setProperty(Exchange.GROUPED_EXCHANGE, list);
// ...
list.add(newExchange);
Or you could do the same thing manually on your own exchange property. In any case, it's completely fine to use the Grouped aggregation strategy as you have done.
Alternative 2
The second way to "keep" old messages is to send a copy to a stopped SEDA queue. So you would do to("seda:xyz"). You define this queue as .noAutoStartup(). Then you can send messages to it and they will queue up on an internal queue, managed by camel. When you want to process the messages, you simply start it up via controlbus and stop it again afterwards.
Generally, messing around with starting and stopping queues should be avoided unless absolutely necessary, but that's certainly another way to do it
Suggested solution
I suggest you do as you have done (i.e. alternative 1):
aggregate via GroupedExchangeAggregationStrategy to keep the individual files in a list
Compute the total file size (use a processor, or do it along the way with a custom aggregation strategy)
Use a filter(simple("${body} < 123"))
"Unwind" your aggregation via a splitter(simple("${property.CamelGroupedExchange}"))
Delete your files one by one
Please let me know if this doesn'y makes sense, or if I have misunderstood your problem in any way.