Adding patterns dynamically in Apache Flink without restarting job - apache-flink

My use case is that I want to apply different CEP patterns to the same datastream. the CEP patterns come dynamically & i want them to be added to flink without having to restart the job. While all conditions can be handled via custom classes that implement IterativeCondition, my main problem is that the temporal condition accepts only TimeWindow; which cannot be handled. Is there some way that the value passed to .within() be set based on the input elements?
Something similar was asked here: Flink and Dynamic templates recognition
Best Answer:
"What one could add is a co-flat map operator which receives on one input channel the events and on the other input channel patterns. For each newly received pattern one either updates the existing NFA (this functionality is missing) or compiles a new one. In the latter case, one would apply incoming events to all stored NFAs."
I am trying to implement this but I am facing some difficulty. Specifically, on the point of "In the latter case, one would apply incoming events to all stored NFAs"
Reason being that I apply stream to pattern using: PatternStream matchStream = CEP.pattern(tmatchStream, pattern);
But the stream "tmatchStream" would not be defined inside the co-flatMap. Am I missing something here??? Any help would be greatly appreciated.

Unfortunately the answer to the linked question is still valid. Flink CEP does not support dynamic patterns at that moment. There is already a JIRA ticket for that though: FLINK-7129
The earliest reasonable target version for that feature will be 1.6.0

Related

providing context for next condition

I want to detect all sequences of the same integer that is followed by the count of these integers.
is there any way to create this pattern with the Flink CEP library?
suppose this:
val input List(1,2,2,2,3,3,2,1,1,1,3,0,1)
these sequences should be detected:
List(2,2,2,3)
List(3,3,2)
List(1,1,1)
I know there are many ways to do it by the Flink itself but I want to do it with CEP and I think it needs support passing context to the next condition am I right? if so, is it possible with Flink CEP to pass context or modify the events on stream and put some extra data to pass information to the next condition?
is it possible with Flink CEP to pass context or modify the events on stream and put some extra data to pass information to the next condition?
Flink CEP supports iterative conditions, which are conditions that can examine the events that matched earlier components of a compound pattern. I think that's powerful enough to handle this use case.
There's an example showing how to use iterative conditions in one of the recipes in the Immerok Cookbook: Alerting when Problems Persist.
(Disclaimer: I work for Immerok.)

One object flink operator (ex. Filter) or two objects in Apache Flink Job

I have Apache Flink Job with 4 input DataStreams (JSON messages) from separate Apache Kafka topics and I've only one object XFilterFunction - which does some filtering. I wrote some data pipeline logics (for primitive example):
FilterFunction<MyEvent> xFilter = new XFilterFunction();
inputDataStream1.filter(xFilter)
.name("Xfilter")
.uid("Xfilter");
inputDataStream2
.union(inputDataStream3)
//here some logics (map, process,...)
.filter(xFilter);
Is it good or bad practice to use one new object XFilterFunction in Job?
Or better to use two new objects XFilterFunction? (2 streams -> 2 new filter objects)
If you instantiate the class several times i.e.
inputDataStream1.filter(new XFilterFunction());
...
inputDataStream2.filter(new XFilterFunction());
there should be not problem. I'm not so sure if otherwise things like state or overridden contextual functions would show unwanted behaviour.
In case it's no specialization of RichFunction, maybe there's even just a pure function invocation happening via delegates, unfortunately I'm not that deep into Flink's internals to say, but with solution above, you should be safe.

Detect absence of a certain event

In the documentation of FlinkCEP, I found that I can enforce that a particular event doesn't occur between two other events using notFollowedBy or notNext.
However, I was wondering If I could detect the absence of a certain event after a time X.
For example, if an event A is not followed by another event A within 10 seconds, fire an alert or do something.
Could be possible to define a FlinkCEP pattern to capture that situation?
Thanks in advance,
Humberto
Although Flink CEP does not support notFollowedBy at the end of a Pattern, there is a way to implement this by exploiting the timeout feature.
The Flink training includes an exercise where the objective is to identify taxi rides with a START event that is not followed by an END event within two hours. You'll find a solution to this exercise that uses CEP
here.
The main idea would be to define a Pattern of A followed by A within 10 seconds, and then capture the case where this times out.

Camel aggregation from two queues in Java DSL

I have two queues which having same type of objects in them. I want to aggregate them into a single queue through java DSL. Could anyone tell me if this is possible? If so, any code references?
If I understand your question correctly, it is possible to do such a thing.
If you need just to drive them into a single route (without any aggregations, enrichments, etc.) you can just proceed with this piece of code:
from('direct:queue1')
.to('direct:start');
from('direct:queue2')
.to('direct:start');
from('direct:start')
//there goes your processing
If you need to aggregate them later on, use Aggregator. Or you can use example from java-addict301's answer if it solves your case.
I believe this may be doable in Camel using the Content Enricher pattern.
Specifically, the following paradigm can be used to retrieve a message from one queue (where direct:start is) and enrich it with a message from the second queue (where direct:resource is). The combined message can then be built in your AggregationStrategy implementation class.
AggregationStrategy aggregationStrategy = ...
from("direct:start")
.enrich("direct:resource", aggregationStrategy)
.to("direct:result");
from("direct:resource")

Camel condition on aggregate of messages

I'm looking for a way to conditionally handle messages based on the aggregation of messages. I've looked into a lot of ways to do this, but it seems that Apache Camel doesn't support it. I'll explain the scenario and then the solutions I tried.
Scenario:
I'm trying to conditionally clean a directory. I poll from the directory every x days and fetch all the files (file://...). I route this into an aggregation, that aggregates the files into a single size (directorySize). I then check if this size passes a certain threshold.
Here is where the problem lies. I now want to remove certain files if this condition passes, but I don't have access to the original messages anymore because they were aggregated in a new exchange.
Solutions:
I tried to fetch the files again to process them. Problem is that you can't make a consumer fetch on demand as far as I know. I tried using pollEnrich, but that will only fetch a single file and not all files in the directory.
I tried to filter/stop the parent route. The problem here is that filter()/choice...stop()/end() will only stop the aggregated route with the directory size and not the parent route with the file messages. I can't conditionally process these.
I tried to move the aggregated condition to another route that I would call, but this causes the same problem as the first solution.
Things I consider doing:
Rewrite the aggregation strategy to not only aggregate the size, but also the files itself into a groupedExchange. This way I can split the aggregation again after the check. I don't really like this solution because it causes a lot boilerplate, both in code as during runtime.
Move the file size calculator to a processor instead of the aggregator. This would defeat the purpose of using camel in the first place.. I would manually be fetching the files and adding the sizes.. And that for every single file..
Use a ControlBus to dynamically start the delete route on that directory. Once again a lot of workaround to achieve something that I feel should be able to be done in a simple route.
I would like to set the calculated size on every parent message, but I have no clue how this could be achieved?
Another way to stop the parent route that I haven't thought of?
I'm a bit stunned that you can't elegantly filter messages based on the aggregation of these messages. Is there something that I missed in Camel that would provide an elegant solution? Or is this a case of the least bad solution?
Simple Schema
Message(File)
Message(File) --> AggregatedMessage(directorySize) --> delete certain Files?
Message(File)
Camel is really awesome, but sometimes it's sure difficult to see exactly which design pattern to use ;)
Firstly, you need to keep a copy of the file objects, because you don't know whether to delete them or not until you reach your threshold - there are basically (at least) two ways to do this.
Alternative 1
The first way is to use a List in an exchange property. This property will hang around no matter what you do with the exchange body. If you have a look at the source code for GroupedExchangeAggregationStrategy, it does precisely this:
list = new ArrayList<Exchange>();
answer.setProperty(Exchange.GROUPED_EXCHANGE, list);
// ...
list.add(newExchange);
Or you could do the same thing manually on your own exchange property. In any case, it's completely fine to use the Grouped aggregation strategy as you have done.
Alternative 2
The second way to "keep" old messages is to send a copy to a stopped SEDA queue. So you would do to("seda:xyz"). You define this queue as .noAutoStartup(). Then you can send messages to it and they will queue up on an internal queue, managed by camel. When you want to process the messages, you simply start it up via controlbus and stop it again afterwards.
Generally, messing around with starting and stopping queues should be avoided unless absolutely necessary, but that's certainly another way to do it
Suggested solution
I suggest you do as you have done (i.e. alternative 1):
aggregate via GroupedExchangeAggregationStrategy to keep the individual files in a list
Compute the total file size (use a processor, or do it along the way with a custom aggregation strategy)
Use a filter(simple("${body} < 123"))
"Unwind" your aggregation via a splitter(simple("${property.CamelGroupedExchange}"))
Delete your files one by one
Please let me know if this doesn'y makes sense, or if I have misunderstood your problem in any way.

Resources