Flink: Evaluate window for each incoming element of stream - apache-flink

I have a stream of Booking elements of the following form:
Booking(id=B1, driverId=D1, time=t1, location=l1)
Booking(id=B2, driverId=D2, time=t2, location=l2)
I need to find, per location, count of bookings made in last 15mins. But the window should be evaluated for any new booking coming for a location.
Roughly like:
Assuming `time` field is set as timestamp of record
bookingStream.keyBy(b=>b.location).window(Any window of 15mins).trigger(triggerFunction)
Except that the trigger function should not be evaluated at the end of 15mins but instead whenever any booking arrives at a location, and emit the count of booking in last 15min from timestamp of newly arrived booking.
Approach:
Use RichMap function, maintain a priority queue of location bookings as a managed state(ValueState) with timestamp as priority of bookings. For each element that arrives, first add it to state and remove elements earlier than 15mins from currently arrived elements. Emit the count of remaining elements in priority queue to collector.
Is this the right way or it could be achieved by using some other flink construct in a better way.

If you are running on the heap-based state backend, what you propose should behave reasonably well. But with RocksDB you will have to go through serialization/deserialization of the priority queue for every access, which may be rather painful.
An approach that might perform better on RocksDB would be to keep the current count along with the earliest timestamp in ValueState, and the set of bookings in ListState. The RocksDB state backend can append to ListState without going through ser/de, so you would only have to deserialize and reserialize the whole list when the earliest element is too old.

Related

How to Implement Patterns to Match Brute Force Login and Port Scanning Attacks using Flink CEP

I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.

Flink CEP cannot get correct results on a unioned table

I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.

How to remove redundant data with skip and limit in mongodb

When we fetch data from db with skip and limit then it's very high chance that data may be redundant.
Let me explain you with an example
suppose you are fetching those student records which belongs to some state x and you already fetched 10 student records. Between the time of first and second request one more student record is inserted, deleted or updated then in next query either one data row will come again or inserted data row will be skiped.
How to solve such case?
Method - 1
It can be resolved by using sending 'Created_by' and 'Updated_by' time from UI and filter the data according to it and send the data.
Method - 2
In the second fetch request and there after skip one less then you want and increase limit by 1(Yes, its correct, think about it), and pass those data to UI now UI check the there current list data's last item should match the first item of fetched request's response(just compare by ids), then it mean that no single data added or removed after first query. If it's not match then fetch complete data from first to last in a single query.
If you use individual method then its not rock solid(Yes, some corner case will miss in each case) but if you combine both methods then it will be rock solid.

How to get fullDocument from MongoDB changeStream when a document is deleted?

My Code
I have a MongoDB with two collections, Items and Calculations.
Items
value: Number
date: Date
Calculations
calculation: Number
start_date: Date
end_date: Date
A Calculation is a stored calcluation based off of Item values for all Items in the DB which have dates in between the Calculation's start date and end date.
Mongo Change Streams
I figure a good way to create / update Calculations is to create a Mongo Change Stream on the Items collection which listens for changes to the Items collection to then recalculate relevant Calculations.
The issue is that according to the Mongo Change Event docs, when a document is deleted, the fullDocument field is omitted which would prevent me from accessing the deleted Item's date which would inform which Calculations should be updated.
Question
Is there any way to access the fullDocument of a Mongo Change Event that was fired due to a document deletion?
No I don't believe there is a way. From https://docs.mongodb.com/manual/changeStreams/#event-notification:
Change streams only notify on data changes that have persisted to a majority of data-bearing members in the replica set.
When the document was deleted and the deletion was persisted across the majority of the nodes in a replica set, the document has ceased to exist in the replica set. Thus the changestream cannot return something that doesn't exist anymore.
The solution to your question would be transactions in MongoDB 4.0. That is, you can adjust the Calculations and delete the corresponding Items in a single atomic operation.
fullDocument is not returned when a document is deleted.
But there is a workaround.
Right before you delete the document, set a hint field. (Obviously use a name that does not collide with your current properties.)
await myCollection.updateOne({_id:theId}, {_deleting: true})
await myCollection.deleteOne({_id:theId})
This will trigger one final change event in the stream, with a hint that the document is getting deleted. Then in your stream watcher, you simple check for this value.
stream.on('change', event => {
if (!event.fullDocument) {
// The document was deleted
}
else if (event.fullDocument._deleting) {
// The document is going to be deleted
}
else {
// Document created or updated
}
})
My oplog was not getting updated fast enough, and the update was looking up a document that was already removed, so I needed to add a small delay to get this working.
myCollection.updateOne({_id:theId}, {_deleting: true})
setTimeout( ()=> {
myCollection.deleteOne({_id:theId})
}, 100)
If the document did not exist in the first place, it won't be updated or deleted, so nothing gets triggered.
Using TTL Indexes
Another way to make this work is to add a ttl index, and then set that field to the current time. This will trigger an update first, and then delete the document automatically.
myCollection.setIndex({_deleting:1}, {expireAfterSeconds:0})
myCollection.updateOne({_id:theId}, {$set:{_deleting:new Date()}})
The problem with this approach is that mongo prunes TTL documents only during certain intervals, 60s or more as stated in the docs, so I prefer to use the first approach.

How to send an Email notification by day end using Rules with all the nodes published that day?

I am trying to achieve email notification . The condition is , it should go by end of the day with the current day published content list.
For the same I have tried couple of things using Rules, but stuck in between.
Any help?
I tried using rules, and I created a rule like so:
Events:
After updating existing content of type(content type name)
Cron maintenance tasks are performed
Condition: Data to compare: [node:field-img-status], Data value: Approve
When I am trying to add second condition to check if the node is published within 24hrs, I am unable to achieve it. When I add strtotime("-1 day"), I get an error like:
Wrong date format. Specify the date in the format 2017-05-10 08:17:18.
I tried date('Y-m-d h:i:s',strtotime("-1 day")) but I did not succeed.
Now I am trying one more method to achieve it using Views Rules which is suggested in this answer to the question about 'How to create a Drupal rule to check (on cron) a date field and if passed set field "status" to "ended"?'.
Below is a blueprint of how I'd get this to work ...
Step 1: Create a single eMail for each node that was published
Create a view (using Views) of all the nodes that were published the last 24 hours. Make sure to include a column in that view for the various data you want to be included about each node in your eMail later on.
Use Rules to create a rule with a Rules Action that consists of a "Rules Loop", in which its "list items" are actually the list of nodes that you want to be included in your eMail later on. To create this Rules Loop, use the Views Rules combined with a Views display type of "Views Rules", for the view that you created. Refer to my answer to "How to pass arguments to a view from Rules?" for way more details on how to use the Views Rules module.
For each list item in the Rules Loop of the previous step, you have access to all data for each column in the View you created. By using these data you could add an additional Rules Action (within the same Rules Loop) to send an appropriate eMail about the node being processed.
Step 2: Group all eMails in a single eMail
Obviously, the previous step creates a single eMail for each node that was published in the last 24 hours. If you only have a few nodes that may not be a real issue to worry about. But if you have dozens (or more?) of such nodes then you might want to consider consolidating all such eMails in a single eMail, which contains (in its eMail body) the complete list of nodes.
A possible solution to implement such consolidation, is similar to what is shown in the Rules example included in my answer to "How to concatenate all token values of a list in a single field within a Rules loop?". In your case, you could make it work like so:
Add some new Rules variable that will be used later on as part of the eMail body, before the start of your loop. Say you name the variable nodes_list_var_for_email_body.
Within your loop, for each iteration, prepend or append the value for each "list item" to that variable nodes_list_var_for_email_body.
Move the Rules Action to send an eMail outside your loop, and after the loop completed. And finetune the details (configuration) of your (new) "send an eMail" Rules Action. When doing so, you'll be able to select the token for nodes_list_var_for_email_body to include anywhere in your eMail body.
Step 3: Schedule the daily execution of your rule
Use the Rules Once per Day to schedule the daily execution of your rule. Refer to my answer to "How to limit the execution of a rule for sending an email to only run once in a day?" for way more details about this module.
VoilĂ , that's it ...
This is how I would achieve this:
Make some view which would list all nodes created today.
Make some end-point (from my module, check out: https://api.drupal.org/api/drupal/modules%21system%21system.api.php/function/hook_menu/7.x)
It would call this view, and grab that node list (i.e. with views_get_view_result : https://api.drupal.org/api/views/views.module/function/views_get_view_result/7.x-3.x ), loop through the list, compose the email and send it.
Then I would set cron job to call that end-point at end of every day.

Resources