Flink - behaviour of timesOrMore - apache-flink

I want to find pattern of events that follow
Inner pattern is:
Have the same value for key "sensorArea".
Have different value for key "customerId".
Are within 5 seconds from each other.
And this pattern needs to
Emit "alert" only if previous happens 3 or more times.
I wrote something but I know for sure it is not complete.
Two Questions
I need to access the previous event fields when I'm in the "next" pattern, how can I do that without using the ctx command because it is heavy..
My code brings weird result - this is my input
and my output is
3> {first=[Customer[timestamp=50,customerId=111,toAdd=2,sensorData=33]], second=[Customer[timestamp=100,customerId=222,toAdd=2,sensorData=33], Customer[timestamp=600,customerId=333,toAdd=2,sensorData=33]]}
even though my desired output should be all first six events (users 111/222 and sensor are 33 and then 44 and then 55
Pattern<Customer, ?> sameUserDifferentSensor = Pattern.<Customer>begin("first", skipStrategy)
.followedBy("second").where(new IterativeCondition<Customer>() {
#Override
public boolean filter(Customer currCustomerEvent, Context<Customer> ctx) throws Exception {
List<Customer> firstPatternEvents = Lists.newArrayList(ctx.getEventsForPattern("first"));
int i = firstPatternEvents.size();
int currSensorData = currCustomerEvent.getSensorData();
int prevSensorData = firstPatternEvents.get(i-1).getSensorData();
int currCustomerId = currCustomerEvent.getCustomerId();
int prevCustomerId = firstPatternEvents.get(i-1).getCustomerId();
return currSensorData==prevSensorData && currCustomerId!=prevCustomerId;
}
})
.within(Time.seconds(5))
.timesOrMore(3);
PatternStream<Customer> sameUserDifferentSensorPatternStream = CEP.pattern(customerStream, sameUserDifferentSensor);
DataStream<String> alerts1 = sameUserDifferentSensorPatternStream.select((PatternSelectFunction<Customer, String>) Object::toString);

You will have an easier time if you first key the stream by the sensorArea. They you will be pattern matching on streams where all of the events are for a single sensorArea, which will make the pattern easier to express, and the matching more efficient.
You can't avoid using an iterative condition and the ctx, but it should be less expensive after keying the stream.
Also, your code example doesn't match the text description. The text says "within 5 seconds" and "3 or more times", while the code has within(Time.seconds(2)) and timesOrMore(2).

Related

Need to optimize the code for mapping codes to description

I have a Text field that has semicolon separated codes. These code has to be replaced with the description. I have separate map that have code and description. There is a trigger that replace the code with their description. the data will loaded using the dataloader in this field. I am afraid, it might not work for large amount of data since I had to use inner for loops. Is there any way I can achieve this without inner for loops?
public static void updateStatus(Map<Id,Account> oldMap,Map < Id, Account > newMap)
{
Map<String,String> DataMap = new Map<String,String>();
List<Data_Mapper__mdt> DataMapList = [select Salseforce_Value__c,External_Value__c from Data_Mapper__mdt where
active__c = true AND Field_API_Name__c= :CUSTOMFIELD_MASSTATUS AND
Object_API_Name__c= :OBJECT_ACCOUNT];
for(Data_Mapper__mdt dataMapRec: DataMapList){
DataMap.put(dataMapRec.External_Value__c,dataMapRec.Salseforce_Value__c);
}
for(Account objAcc : newMap.values())
{
if(objAcc.Status__c != ''){
String updatedDescription='';
List<String> delimitedList = objAcc.Status__c.split('; ');
for(String Code: delimitedList) {
updatedDescription = DataMap.get(Code);
}
objAcc.Status__c = updatedDescription;
}
It should be fine. You have a map-based access acting like a dictionary, you have a query outside of the loop. Write an unit test that populates close to 200 accounts (that's how the trigger will be called in every data loader iteration). There could be some concerns if you'd have thousands of values in that Status__c but there's not much that can be done to optimise it.
But I want to ask you 3 things.
The way you wrote it the updatedDescription will always contain the last decoded value. Are you sure you didn't want to write something like updatedDescription += DataMap.get(Code) + ';'; or maybe add them to a List<String> and then call String.join on it. It looks bit weird. If you truly want first or last element - I'd add break; or really just access the last element of the split (and then you're right, you're removing the inner loop). But written like that this looks... weird.
Have you thought about multiple runs. I mean if there's a workflow rule/flow/process builder - you might enter this code again. And because you're overwriting the field I think it'll completely screw you over.
Map<String, String> mapping = new Map<String, String>{
'one' => '1',
'two' => '2',
'three' => '3',
'2' => 'lol'
};
String text = 'one;two';
List<String> temp = new List<String>();
for(String key : text.split(';')){
temp.add(mapping.get(key));
}
text = String.join(temp, ';');
System.debug(text); // "1;2"
// Oh noo, a workflow caused my code to run again.
// Or user edited the account.
temp = new List<String>();
for(String key : text.split(';')){
temp.add(mapping.get(key));
}
text = String.join(temp, ';');
System.debug(text); // "lol", some data was lost
// And again
temp = new List<String>();
for(String key : text.split(';')){
temp.add(mapping.get(key));
}
text = String.join(temp, ';');
System.debug(text); // "", empty
Are you even sure you need this code. Salesforce is perfectly fine with having separate picklist labels (what's visible to the user) and api values (what's saved to database, referenced in Apex, validation rules...). Maybe you don't need this transformation at all. Maybe your company should look into Translation Workbench. Or even ditch this code completely and do some search-replace before invoking data loader, in some real ETL tool (or even MS Excel)

A problem from Flink Training tutorial: LongRidesSolution.scala

What this function(ProcessElement) will do is pretty clear:
Based on the keyed stream(keyed by rideId), it will iterate all the elements whose rideId belongs to that key,it will update the state based on the condition
override def processElement(ride: TaxiRide,
context: KeyedProcessFunction[Long, TaxiRide, TaxiRide]#Context,
out: Collector[TaxiRide]): Unit = {
val timerService = context.timerService
if (ride.isStart) {
// the matching END might have arrived first; don't overwrite it
if (rideState.value() == null) {
rideState.update(ride)
}
}
else {
rideState.update(ride)
}
timerService.registerEventTimeTimer(ride.getEventTime + 120 * 60 * 1000)
}
The Timer will trigger once the watermark reaches to the timestamp
override def onTimer(timestamp: Long,
ctx: KeyedProcessFunction[Long, TaxiRide, TaxiRide]#OnTimerContext,
out: Collector[TaxiRide]): Unit = {
val savedRide = rideState.value
if (savedRide != null && savedRide.isStart) {
out.collect(savedRide)
}
rideState.clear()
}
The Problem is: If the End record comes first,and then based on the logic, it will not update the ride state(related key),then it will trigger after 2 hours, then it will not collect and will not emit the record, but what if this record meets our requirement? ==> the start time of the record happened more than 2 hours ago? I think there should be more logic to deal with that
If the END record is processed before the START record, then it could be that the START record arrives very late, and when it does arrive it supplies evidence that this ride lasted for more than two hours.
However, the goal of this exercise is not to find all rides that last for more than two hours, but rather to flag, in real-time, rides that should have ended by now (because they started more than two hours ago), but haven't. Since these rides you ask about have ended, it's debatable whether they merit alerts.
You've raised an interesting point that should probably be added to the exercise discussion page.

Why does Flink emit duplicate records on a DataStream join + Global window?

I'm learning/experimenting with Flink, and I'm observing some unexpected behavior with the DataStream join, and would like to understand what is happening...
Let's say I have two streams with 10 records each, which I want to join on a id field. Let's assume that for each record in one stream had a matching one in the other, and the IDs are unique in each stream. Let's also say I have to use a global window (requirement).
Join using DataStream API (my simplified code in Scala):
val stream1 = ... // from a Kafka topic on my local machine (I tried with and without .keyBy)
val stream2 = ...
stream1
.join(stream2)
.where(_.id).equalTo(_.id)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.apply {
(row1, row2) => // ...
}
.print()
Result:
Everything is printed as expected, each record from the first stream joined with a record from the second one.
However:
If I re-send one of the records (say, with an updated field) from one of the stream to that stream, two duplicate join events get emitted 😞
If I repeat that operation (with or without updated field), I will get 3 emitted events, then 4, 5, etc... 😞
Could someone in the Flink community explain why this is happening? I would have expected only 1 event emitted each time. Is it possible to achieve this with a global window?
In comparison, the Flink Table API behaves as expected in that same scenario, but for my project I'm more interested in the DataStream API.
Example with Table API, which worked as expected:
tableEnv
.sqlQuery(
"""
|SELECT *
| FROM stream1
| JOIN stream2
| ON stream1.id = stream2.id
""".stripMargin)
.toRetractStream[Row]
.filter(_._1) // just keep the inserts
.map(...)
.print() // works as expected, after re-sending updated records
Thank you,
Nicolas
The issue is that records are never removed from your global window. So you trigger the join operation on the global window, whenever a new record has arrived, but the old records are still present.
Thus, to get it running in your case, you'd need to implement a custom evictor. I expanded your example in a minimal working example and added the evictor, which I will explain after the snippet.
val data1 = List(
(1L, "myId-1"),
(2L, "myId-2"),
(5L, "myId-1"),
(9L, "myId-1"))
val data2 = List(
(3L, "myId-1", "myValue-A"))
val stream1 = env.fromCollection(data1)
val stream2 = env.fromCollection(data2)
stream1.join(stream2)
.where(_._2).equalTo(_._2)
.window(GlobalWindows.create()) // assume this is a requirement
.trigger(CountTrigger.of(1))
.evictor(new Evictor[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)], GlobalWindow](){
override def evictBefore(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {}
override def evictAfter(elements: lang.Iterable[TimestampedValue[CoGroupedStreams.TaggedUnion[(Long, String), (Long, String, String)]]], size: Int, window: GlobalWindow, evictorContext: Evictor.EvictorContext): Unit = {
import scala.collection.JavaConverters._
val lastInputTwoIndex = elements.asScala.zipWithIndex.filter(e => e._1.getValue.isTwo).lastOption.map(_._2).getOrElse(-1)
if (lastInputTwoIndex == -1) {
println("Waiting for the lookup value before evicting")
return
}
val iterator = elements.iterator()
for (index <- 0 until size) {
val cur = iterator.next()
if (index != lastInputTwoIndex) {
println(s"evicting ${cur.getValue.getOne}/${cur.getValue.getTwo}")
iterator.remove()
}
}
}
})
.apply((r, l) => (r, l))
.print()
The evictor will be applied after the window function (join in this case) has been applied. It's not entirely clear how your use case exactly should work in case you have multiple entries in the second input, but for now, the evictor only works with single entries.
Whenever a new element comes into the window, the window function is immediately triggered (count = 1). Then the join is evaluated with all elements having the same key. Afterwards, to avoid duplicate outputs, we remove all entries from the first input in the current window. Since, the second input may arrive after the first inputs, no eviction is performed, when the second input is empty. Note that my scala is quite rusty; you will be able to write it in a much nicer way. The output of a run is:
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
Waiting for the lookup value before evicting
4> ((1,myId-1),(3,myId-1,myValue-A))
4> ((5,myId-1),(3,myId-1,myValue-A))
4> ((9,myId-1),(3,myId-1,myValue-A))
evicting (1,myId-1)/null
evicting (5,myId-1)/null
evicting (9,myId-1)/null
A final remark: if the table API offers already a concise way of doing what you want, I'd stick to it and then convert it to a DataStream when needed.

How to buffer and drop a chunked bytestring with a delimiter?

Lets say you have a publisher using broadcast with some fast and some slow subscribers and would like to be able to drop sets of messages for the slow subscriber without having to keep them in memory. The data consists of chunked ByteStrings, so dropping a single ByteString is not an option.
Each set of ByteStrings is followed by a terminator ByteString("\n"), so I would need to drop a set of ByteStrings ending with that.
Is that something you can do with a custom graph stage? Can it be done without aggregating and keeping the whole set in memory?
Avoid Custom Stages
Whenever possible try to avoid custom stages, they are very tricky to get correct as well as being pretty verbose. Usually some combination of the standard akka-stream stages and plain-old-functions will do the trick.
Group Dropping
Presumably you have some criteria that you will use to decide which group of messages will be dropped:
type ShouldDropTester : () => Boolean
For demonstration purposes I will use a simple switch that drops every other group:
val dropEveryOther : ShouldDropTester =
Iterator.from(1)
.map(_ % 2 == 0)
.next
We will also need a function that will take in a ShouldDropTester and use it to determine whether an individual ByteString should be dropped:
val endOfFile = ByteString("\n")
val dropGroupPredicate : ShouldDropTester => ByteString => Boolean =
(shouldDropTester) => {
var dropGroup = shouldDropTester()
(byteString) =>
if(byteString equals endOfFile) {
val returnValue = dropGroup
dropGroup = shouldDropTester()
returnValue
}
else {
dropGroup
}
}
Combining the above two functions will drop every other group of ByteStrings. This functionality can then be converted into a Flow:
val filterPredicateFunction : ByteString => Boolean =
dropGroupPredicate(dropEveryOther)
val dropGroups : Flow[ByteString, ByteString, _] =
Flow[ByteString] filter filterPredicateFunction
As required: the group of messages do not need to be buffered, the predicate will work on individual ByteStrings and therefore consumes a constant amount of memory regardless of file size.

Is there a way to link Save Results (for an insert DML operation) back to sObjects?

Background
I have a list of sObjects I need to insert, but I must first check if the insert will be successful. So, I'm setting a database save point before performing the insert and checking the save results (for the insert statement). Because, I don't want to process if any errors occurred, if there were any errors in the insert results, the database is rolled back to the save point.
Problem & Question
I need to collect the errors for each save (insert) result and associate each error to the specific sObject record that caused the error. According to the documentation Save results contain a list of errors, the Salesforce ID of the record inserted (if successful), and a success indicator (boolean).
How do I associate the Save Result to the original sObject record inserted?
Code/Example
Here's an example I put together that demonstrates the concept. The example is flawed, in that the InsertResults don't always match the sObjectsToInsert. It's not exactly the code I'm using in my custom class, but it uses the same logic.
Map<Id,sObject> sObjectsToInsert; // this variable is set previously in the code
List<Database.SaveResult> InsertResults;
Map<String,sObject> ErrorMessages = new Map<String,sObject>();
System.SavePoint sp = Database.setSavepoint();
// 2nd parameter must be false to get all errors, if there are errors
// (allow partial successes)
InsertResults = Database.insert(sObjectsToInsert.values(), false);
for(Integer i=0; i < InsertResults.size(); i++)
{
// This method does not guarantee the save result (ir) matches the sObject
// I need to make sure the insert result matches
Database.SaveResult ir = InsertResults[i];
sObject s = sObjectsToInsert.values()[i];
String em = null; // error message
Integer e = 0; // errors
if(!ir.isSuccess())
{
system.debug('Not Successful');
e++;
for(Database.Error dbe : ir.getErrors()) { em += dbe.getMessage()+' '; }
ErrorMessages.put(em, s);
}
}
if(e > 0)
{
database.rollback(sp);
// log all errors in the ErrorMessages Map
}
Your comment says the SaveResult list is not guaranteed to be in order, but I believe that it is. I've used this technique for years and have never had an issue.

Resources