Fraud Detection DataStream API tutorial questions - apache-flink

I am following the tutorial here.
Q1: Why in the final application do we clear all states and delete timer whenever flagState = true regardless of the current transaction amount? I refer to this part of the code:
// Check if the flag is set
if (lastTransactionWasSmall != null) {
if (transaction.getAmount() > LARGE_AMOUNT) {
//Output an alert downstream
Alert alert = new Alert();
alert.setId(transaction.getAccountId());
collector.collect(alert);
}
// Clean up our state [WHY HERE?]
cleanUp(context);
}
If the datastream for a transaction was 0.5, 10, 600, then flagState would be set for 0.5 then cleared for 10. So for 600, we skip the code block above and don't check for large amount. But if 0.5 and 600 transactions occurred within a minute, we should have sent an alert but we didn't.
Q2: Why do we use processing time to determine whether two transactions are 1 minute apart? The transaction class has a timeStamp field so isn't it better to use event time? Since processing time will be affected by the speed of the application, so two transactions with event times within 1 minute of each other could be processed > 1 minute apart due to lag.

A1: The fraud model being used in this example is explained by this figure:
In your example, the transaction 600 must immediately follow the transaction for 0.5 to be considered fraud. Because of the intervening transaction for 10, it is not fraud, even if all three transactions occur within a minute. It's just a matter of how the use case was framed.
A2: Doing this with event time would be a very valid choice, but would make the example much more complex. Not only would watermarks be required, but we would also have to sort the stream by event time, since a realistic example would have to consider that the events might be out-of-order.
At that point, implementing this with a process function would no longer be the best choice. Using the temporal pattern matching capabilities of either Flink's CEP library or Flink SQL with MATCH_RECOGNIZE would be the way to go.

Related

Apache Flink Is Windowing dependent on Timestamp assignment of EventTime Events

I am new to apache flink and am trying to understand how the concept of EventTime and Windowing is handled by flink.
So here's my scenario :
I have a program that runs as a thread and creates a files with 3
fields every second of which the 3rd field is the timestamp.
There is a little tweak though every 5 seconds I enter an older timestamp (t-5 you could say) into the new file created.
Now I run the stream processing job which reads the 3 fields above
into a tuple.
Now I have defined the following code for watermarking and timestamp generation:
WatermarkStrategy
.<Tuple3<String, Integer, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(4))
.withTimestampAssigner((event, timestamp) -> event.f2);
And then I use the following code for windowing the above and trying to get the aggregation :
withTimestampsAndWatermarks
.keyBy(0)
.window(TumblingEventTimeWindows.of(Time.milliseconds(4000)))
.reduce((x,y) -> new Tuple3<String, Integer, Long>(x.f0, x.f1 + y.f1,y.f2))
It is clear that I am trying to aggregate the numbers within each field.(a little more context, the field(f2) that I am trying to aggregate are all 1s)
Hence I have the following questions :
That is the window is just 4 seconds wide, and every fifth entry is
an older timestamp, so I am expecting that the next window to have
lesser counts. Is my understanding wrong here ?
If my understanding is right - I do not see any aggregation when running both programs in parallel, Is there something wrong with my code ?
Another one that is bothering me is on what fields or on what parameters do the windows start time and end time really dependent ? Is it on the timestamp extracted from Events or is it from processing time
You have to configure the allowed lateness: https://nightlies.apache.org/flink/flink-docs-release-1.2/dev/windows.html#allowed-lateness. If not configured, Flink will drop the late message. So for the next window, there will be less elements than previous window.
Window is assigned by the following calculation:
return timestamp - (timestamp - offset + windowSize) % windowSize
In your case, offset is 0(default). For event time window, the timestamp is the event time. For processing time window, the timestamp is the processing time from Flink operator. E.g. if windowSize=3, timestamp=122, then the element will be assigned to the window [120, 123).

What is a watermark in Flink with respect to Event time processing? Why is it needed.?

What is a watermark in Flink with respect to Event time processing? Why is it needed.?
Why is it needed in all cases of event time being used. By all cases I mean if i dont do a window opeation
then why do we still need a water mark.
I come from spark background. In spark we need watermarks only when we use windows on the incoming events.
I have read few articles and it seems to me that watermarks and windows seems same.If there are differences please explain and point it put
Post your reply I did some more reading. Below is a query that is more specific.
Main Question:- Why do we need outoforder when we have acceptedlateness.
Given below example:
Assume you have a BoundedOutOfOrdernessTimestampExtractor with a 2 minute bound and a 10 minute tumbling window that starts at 12:00 and ends at 12:10:
12:01, A
12:04, B
WM, 12:02 // 12:04 - 2 minutes
12:02, C
12:08, D
12:14, E
WM, 12:12
12:16, F
WM, 12:14 // 12:16 - 2 minutes
12:09, G
In the above example [12:02, C] record is not dropped but included into the window 12:00 -12:10 and later evaluated.- Hence the watermark could as well be the event timestamp
The record [12:09, G] is included into the window 12:00 - 12:10 only when there is a acceptedlateness of 5mins configured. This takes care of late and out of order events
So now adding to my previous question above, what is the necessary of outoforder option to be BoundedOutOfOrdernessTimestampExtractor of some value(other than 0) instead of the event timestamp istelf ?
What is that outoforder can achieve which allowedlateness cannot and in what scenario it does?
Watermarks and windows are closely related but they are very different concepts.
Watermarks are needed for any kind of event-based aggregation to cut off late events. Windows can only close when they receive an appropriate watermark and that's when results of aggregations are published.
If you have no out of order events, you can set watermarks to be equivalent to the timestamps of input events. But that's usually a luxury.
edit to address questions in comment.
is it a rule of thumb to keep the watermarks duration equal to window duration because by only doing so the result is calculated and emitted.
No, the durations are independent, but add up the lag on a given event.
Your watermark duration depends on your data and how much lag you can take for your application. Let's say most events are in order, 10% are coming up to 1s late, an additional 5% up to 10s, and 1% up to 1h.
If you set watermark duration to 0, then 16% of your data points are discarded, but Flink will receive no additional lag. If your watermark trails 1s behind your events, you will lose 6% of your data, but the results will have 1s more lag. If you want to retain all data, Flink will need to wait for 1h on each aggregation until Flink can be sure that no data is missing.
But then what is the role of trigger? and how does sliding windows coordinate with water marks and triggers. Can you please explain as how they work with each other ?
Let's say, you have a window of 1 min and a watermark delay of 5 s. A window will only trigger when it is sure that all relevant data has been seen. In this case, it needs to wait 1 min 5 s to trigger, such that the last event of the window has surely arrived.
Btw events later as watermark are discarded by default. You can change that behavior.

Flink large size / small advance sliding window performance

My use case
the input is raw events keyed by an ID
I'd like to count the total number of events over the past 7 days for each ID.
the output would be every 10 mins advance
Logically, this will be handled by a sliding window of size 7 day and advance 10min
This post laid out a good optimization solution by a tumbling window of 1 day
So my logic would be like
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val oneDayCounts = joins
.keyBy(keyFunction)
.map(t => (t.key, 1L, t.timestampMs))
.keyBy(0)
.timeWindow(Time.days(1))
val sevenDayCounts = oneDayCounts
.keyBy(0)
.timeWindow(Time.days(7), Time.minutes(10))
.sum(1)
// single reducer
sevenDayCounts
.windowAll(TumblingProcessingTimeWindows.of(Time.minutes(10)))
P.S. forget about the performance concern of the single reducer.
Question
If I understand correctly, however, this would mean a single event would produce 7*24*6=1008 records due to the nature of the sliding window. So my question is how can I reduce the sheer amount?
There's a JIRA ticket -- FLINK-11276 -- and a google doc on the topic of doing this more efficiently.
I also recommend you take a look at this paper and talk about Efficient Window Aggregation with Stream Slicing.

Designing an access/web statistics counter module for appengine

I need a access statistics module for appengine that tracks a few request-handlers and collects statistics to bigtable. I have not found any ready made solution on github and Google's examples are either oversimplified (memcached frontpage counter with cron) or overkill (accurate sharded counter). But most importantly, no appengine-counter solution discussed elsewhere includes a time component (hourly, daily counts) needed for statistics.
Requirements: The system does not need to be 100% accurate and could just ignore memcache loss (if infrequent). This should simplify things considerably. The idea is to just use memcache and accumulate stats in time intervals.
UseCase: Users on your system create content (e.g. pages). You want to track approx. How often a user's pages are viewed per hour or day. Some pages are viewed often, some never. You want to query by user and timeframe. Subpages may have fixed IDs (Query for user with most hits on homepage). You may want to delete old entries (Query for entries of year=xxxx).
class StatisticsDB(ndb.Model):
# key.id() = something like YYYY-MM-DD-HH_groupId_countableID ... contains date
# timeframeId = ndb.StringProperty() YYYY-MM-DD-HH needed for cleanup if counter uses ancestors
countableId = ndb.StringProperty(required=True) # name of counter within group
groupId = ndb.StringProperty() # counter group (allows single DB query with timeframe prefix inequality)
count = ndb.Integerproperty() # count per specified timeframe
#classmethod
def increment(class, groupID, countableID):
# increment memcache
# save hourly to DB (see below)
Note: groupId and countableId indexes are necessary to avoid 2 inequalities in queries. (query all countables of a groupId/userId and chart/highcount-query: countableId with highest counts derives groupId/user), using ancestors in the DB may not support chart queries.
The problem is how to best save the memcached counter to DB:
cron: This approach is mentioned in example docs (example front-page counter), but uses fixed counter ids that are hardcoded in the cron-handler. As there is no prefix-query for existing memcache keys, determining which counter-ids were created in memcache during the last time interval and need to be saved is probably the bottleneck.
task-queue: if a counter is created schedule a task to collect it and write it to DB. COST: 1 task-queue entry per used counter and one ndb.put per time granularity (e.g. 1 hour) when the queue-handler saves the data. Seems the most promising approach to also capture infrequent events accurately.
infrequently when increment(id) executes: If a new timeframe starts, save the previous. This needs at least 2 memcache accesses (get date, incr counter) per increment. One for tracking the timeframe and one for the counter. Disadvantage: bursty counters with longer stale periods may loose the cache.
infrequently when increment(id) executes: probabilistic: if random % 100 == 0 then save to DB, but the counter should have uniformly distributed counting events
infrequently when increment(id) executes: if the counter reaches e.g. 100 then save to DB
Did anyone solve this problem, what would be a good way to design this?
What are the weaknesses and strengths of each approach?
Are there alternate approaches that are missing here?
Assumptions: Counting can be slightly inaccurate (cache loss), the counterID space is large, counterIDs are incremented sparesly (some once per day, some often per day)
Update: 1) I think cron can be used similar to the task queue. One only has to create the DB model of the counter with memcached=True and run a query in cron for all counters marked that way. COST: 1 put at 1st increment, query at cron, 1 put to update counter. Without thinking it through fully this appears slightly more costly/complex than the task approach.
Discussed elsewhere:
High concurrency non-sharded counters - no count per timeframe
Open Source GAE Fast Counters - no count per timeframe, nice performance comparison to sharded solution, expected losses due to memcache loss reported
Yep, your #2 idea seems to best address your requirements.
To implement it you need a task execution with a specified delay.
I used the deferred library for such purpose, using the deferred.defer()'s countdown argument. I learned in the meantime that the standard queue library has similar support, by specifying the countdown argument for a Task constructor (I have yet to use this approach, tho).
So whenever you create a memcache counter also enqueue a delayed execution task (passing in its payload the counter's memcache key) which will:
get the memcache counter value using the key from the task payload
add the value to the corresponding db counter
delete the memcache counter when the db update is successful
You'll probably lose the increments from concurrent requests between the moment the memcache counter is read in the task execution and the memcache counter being deleted. You could reduce such loss by deleting the memcache counter immediately after reading it, but you'd risk losing the entire count if the DB update fails for whatever reason - re-trying the task would no longer find the memcache counter. If neither of these is satisfactory you could further refine the solution:
The delayed task:
reads the memcache counter value
enqueues another (transactional) task (with no delay) for adding the value to the db counter
deletes the memcache counter
The non-delayed task is now idempotent and can be safely re-tried until successful.
The risk of loss of increments from concurrent requests still exists, but I guess it's smaller.
Update:
The Task Queues are preferable to the deferred library, the deferred functionality is available using the optional countdown or eta arguments to taskqueue.add():
countdown -- Time in seconds into the future that this task should run or be leased. Defaults to zero. Do not specify this argument if
you specified an eta.
eta -- A datetime.datetime that specifies the absolute earliest time at which the task should run. You cannot specify this argument if
the countdown argument is specified. This argument can be time
zone-aware or time zone-naive, or set to a time in the past. If the
argument is set to None, the default value is now. For pull tasks, no
worker can lease the task before the time indicated by the eta
argument.
Counting things in a distributed system is a hard problem. There's some good info on the problem from the early days of App Engine. I'd start with Sharding Counter, which, despites being written in 2008, is still relevant.
Here is the code for the implementation of the task-queue approach with hourly timeframe. Interestingly it works without transactions and other mutex magic. (For readability the python indent of methods is wrong.)
Supporting priorities for memcache would increase accuracy of this solution.
TASK_URL = '/h/statistics/collect/' # Example: '/h/statistics/collect/{counter-id}"?groupId=" + groupId + "&countableId=" + countableId'
MEMCACHE_PREFIX = "StatisticsDB_"
class StatisticsDB(ndb.Model):
"""
Memcached counting saved each hour to DB.
"""
# key.id() = 2016-01-31-17_groupId_countableId
countableId = ndb.StringProperty(required=True) # unique name of counter within group
groupId = ndb.StringProperty() # couter group (allows single DB query for group of counters)
count = ndb.IntegerProperty(default=0) # count per timeframe
#classmethod
def increment(cls, groupId, countableId): # throws InvalidTaskNameError
"""
Increment a counter. countableId is the unique id of the countable
throws InvalidTaskNameError if ids do not match: [a-zA-Z0-9-_]{1,500}
"""
# Calculate memcache key and db_key at this time
# the counting timeframe is 1h, determined by %H, MUST MATCH ETA calculation in _add_task()
counter_key = datetime.datetime.utcnow().strftime("%Y-%m-%d-%H") + "_" + groupId +"_"+ countableId;
client = memcache.Client()
n = client.incr(MEMCACHE_PREFIX + counter_key)
if n is None:
cls._add_task(counter_key, groupId, countableId)
client.incr(MEMCACHE_PREFIX + counter_key, initial_value=0)
#classmethod
def _add_task(cls, counter_key, groupId, countableId):
taskurl = TASK_URL + counter_key + "?groupId=" + groupId + "&countableId=" + countableId
now = datetime.datetime.now()
# the counting timeframe is 1h, determined by counter_key, MUST MATCH ETA calculation
eta = now + datetime.timedelta(minutes = (61-now.minute)) # at most 1h later, randomized over 1 minute, throttled by queue parameters
task = taskqueue.Task(url=taskurl, method='GET', name=MEMCACHE_PREFIX + counter_key, eta=eta)
queue = taskqueue.Queue(name='StatisticsDB')
try:
queue.add(task)
except taskqueue.TaskAlreadyExistsError: # may also occur if 2 increments are done simultaneously
logging.warning("StatisticsDB TaskAlreadyExistsError lost memcache for %s", counter_key)
except taskqueue.TombstonedTaskError: # task name is locked for ...
logging.warning("StatisticsDB TombstonedTaskError some bad guy ran this task premature manually %s", counter_key)
#classmethod
def save2db_task_handler(cls, counter_key, countableId, groupId):
"""
Save counter from memcache to DB. Idempotent method.
At the time this executes no more increments to this counter occur.
"""
dbkey = ndb.Key(StatisticsDB, counter_key)
n = memcache.get(MEMCACHE_PREFIX + counter_key)
if n is None:
logging.warning("StatisticsDB lost count for %s", counter_key)
return
stats = StatisticsDB(key=dbkey, count=n, countableId=countableId, groupId=groupId)
stats.put()
memcache.delete(MEMCACHE_PREFIX + counter_key) # delete if put succeeded
logging.info("StatisticsDB saved %s n = %i", counter_key, n)

Drools Timer based rule fires multiple times after restart

I have a scenario where I want to use rules purely as a scheduled job for invoking other services. I am using a solution similar to Answer 2 on this. So I have rule 1 which looks like:
rule "ServiceCheck"
timer ( int: 3m 5m )
no-loop true
when
then
boolean isServiceEnabled = DummyServices.getServiceEnabledProperty();
if(isServiceEnabled){
ServicesCheck servicesCheck = new ServicesCheck();
servicesCheck.setServiceEnabled(true);
insert(servicesCheck);
}
end
This inserts a servicesCheck object every 5 minutes if services are enabled. Once this object is inserted my other rules fire and retract the servicesCheck fact from there.
The problem I am facing is when I switch off the app and start it next day. At that time, the ServiceCheck rule gets fired a load of times before coming to a stop. My assumption is that the last fired time is saved in the session and when I restart, it finds a difference between current time and saved time and fires the rules for number of times till the 2 times match in the session. So effectively, to catch up for 1 hr gap from shutdown to restart, it will fire the rule 12 times in this case as the interval set is 5 mins. Is there a way using which I can update the last fired time on the rules session so that it starts working like a fresh new start without catching up for lost time.
I suppose you are persisting the entire session? I suppose you have a shutdown procedure. You can use a single Fact, let's call it Trigger. Modify your rule to
rule "ServiceCheck"
timer ( int: 3m 5m )
when
Trigger()
then
// ... same
end
You'll have to insert one Trigger fact after startup and retract it during shutdown.
Later
I've set up an experiment (using 5.5.0) where a session is running, being called with fireUntilHalt in one thread, with a rule like "ServiceCheck". Another thread will sleep some time, halt the session after retracting the Trigger fact. After more than double the interval of the timer firing, the second thread inserts the Trigger again, signals the first thread to re-enter fireUntilHalt(), and the second thread will repeat its cycle. I can observe silence during the period where the Trigger is retracted.
If, however, the Trigger is not retracted/re-inserted, there'll be a burst of firings after the session has been restarted.
This indicates that retracting and re-inserting a Trigger does indeed stop and restart a timer rule.

Resources