Apache Flink - Matching with fields having different values in successive patterns - apache-flink

Consider the use case where we need to find the pattern for a attack like 10 failed logons from the same device and same username followed by a success logon from different device but same username. This should happen within 10 mins.
Let us say we have 10 login failed windows events with user A as username and B as devicename and we have a success logon from user A with different device C, we should raise an alert. Please let me know how flink CEP can be used to solve the case.

This is rather similar to Apache Flink - Matching Fields with the same value. In this case you might try MATCH_RECOGNIZE with something this:
PARTITION BY user
...
PATTERN (F{10} S) WITHIN INTERVAL '10' MINUTE
DEFINE
F.status = 'failure' AND (LAST(F.device, 1) IS NULL OR F.device = LAST(F.device, 1)),
S AS S.status = 'success' AND S.device <> LAST(F.device, 1)
The idea is to check that each new F is for the same device as the previous one, and S is for a different device.
BTW, in practice you might rather specify F{10,} so that the pattern matches 10 or more failed attempts in a row, rather than exactly 10.

Related

How to Implement Patterns to Match Brute Force Login and Port Scanning Attacks using Flink CEP

I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.

Fail Flink Job if source/sink/operator has undefined uid or name

In my jobs I'd like every source/sink/operator should to have uid and name property defined for easier identification.
operator.process(myFunction).uid(MY_FUNCTION).name(MY_FUNCTION);
Right now I need to manually review every job to detect missing settings. How can I tell Flink to fail job if any name or uid is not defined?
Once you get a StreamExecutionEnvironment you can get the graph of the operators.
When you don't define a name Flink autogenerates one for you. In addition if you set a name, in case at least of sources or sinks, Flink adds a prefix Source: or Sink: to the name.
When you don't define a uid, the uid value in the graph at this stage is null.
Given your scenario, where the name and uid are always the same, to check all operator have been provided with the name and uid you can do the following:
getExecutionEnvironment().getStreamGraph().getStreamNodes().stream()
.filter(streamNode -> streamNode.getTransformationUID() == null ||
!streamNode.getOperatorName().contains(streamNode.getTransformationUID()))
.forEach(System.out::println);
This snippet will print all the operator that doesn't match with your rules.
This won't work in the 100% of cases, like using a uid which is a substring of the name. But you have here a general way to access to the operators information and apply the filters that fits in your case and perform your own strategy.
This snippet can de used as part of your CI or use it directly in your application.

Delphi - multiple users (sessions) after login (FireDAC)

I am working on Windows desktop application in Delphi using FireDAC driver and MSSQL database system. Currently, I am having a problem in understanding how multiple sessions (users) should work. Right now, I have three test users, and when I log in with any of them, every session has the same data and functionalities. I don't want that. I want that each user (each session) has different data and functionalities.
Note that this is different from distributed systems, where tasks are distributed by hosts in a network. I am not interested in distributed system. I have a desktop application.
Could someone explain how to achieve this (different users (sessions) = different data and functionalities)?
You've indicated in a comment that what you want is for a number of users o be able to see different data rows in the same table or tables
That's actually quite quite straightforward: you just need to define, for each user (or user type), the criteria which determine which data rows they are supposed to be able to see, then write a Where clause which selects only those rows. It's generally a bad idea to hard-code users's identities in a database and what data they are permitted to see and what operations they are permitted to carry out on the data.
It's hard to give a concrete example without getting into details of what you are wanting to do, but the following simple example might help.
Suppose you have a table of Customers, and one user is suposed to deal with the USA, the second user deals with France and the third with the rest of the world.
In your app, you could have an enumerated type to represent this:
type
TRegion = (rtUSA, rtFR, rtRoW); // RoW = Rest of the World
Then you could write a function to generate the Where clause of a SQL Select statement like this:
function GetRegionWhereClause(const ARegion : TRegion) : String;
begin
Result := ' Where ';
case ARegion of
rtUSA : Result := Result + ' Customer.Country = ''USA''';
rtFR : Result := Result + ' Customer.Country = ''FR''';
rtRoW : Result := Result + ' not Customer.Country in (''USA'', ''FR'')'
end; { case }
end;
You could then call GetRegionWhereClause when you generate the Sql to open the Customers table.
Similarly define for each user type what operations they are permitted to carry out on the data (Update, Insert, Delete). But implementing that would be more a question of selectively enabling and disable the gui functionality in your app to do the tasksin question.

Flink - Grouping query to external system per operator instance while enriching an event

I am currently writing a streaming application where:
as an input, I am receiving some alerts from a kafka topic (1 alert is linked to 1 resource, for example 1 alert will be linked to my-router-1 or to my-switch-1 or to my-VM-1 or my-VM-2 or ...)
I need then to do a query to an external system in order to enrich the alert with some additional information linked to the resource on which the alert is linked
When querying the external system:
I do not want to do 1 query per alert and not even 1 query per resource
I rather want to do group queries (1 query for several alerts linked to several resources)
My idea was to have something like n buffer (n being a small number representing the nb of queries that I will do in parallel), and then for a given time period (let's say 100ms), put all alerts within one of those buffer and at the end of those 100ms, do my n queries in parallel (1 query being responsible for enriching several alerts belonging to several resources).
In Spark, it is something that I would do through a mapPartitions (if I have n partition, then I will do only n queries in parallel to my external system and each query will be for all the alerts received during the micro-batch for one partition).
Now, I am currently looking at Flink and I haven't really found what is the best way of doing such kind of grouping when requesting an external system.
When looking at this kind of use case and especially at asyncio (https://ci.apache.org/projects/flink/flink-docs-release-1.4/dev/stream/operators/asyncio.html), it seems that it deals with 1 query per key.
For example, I can very easily:
define the resouce id as a key
define a processing time window of 100ms
and then do my query to the external system (synchronously or maybe better asynchrously through the asyncio feature)
But by doing so, I will do 1 query per resource (maybe for several alerts but linked to the same key, ie the same resource).
It is not what I want to do as it will lead to too much queries to the external system.
I've then explored the option of defining a kind of technical key for my requests (something like the hashCode of my resource id % nb of queries I want to perform).
So, if I want to do max 4 queries in parallel, then my key will be something like "resourceId.hashCode % 4".
I was thinking that it was ok, but when looking more deeply to some metrics when running my job, I found that that my queries were not well distributed to my 4 operator instances (only 2 of them were doing something).
It comes for the mechanism used to assign a key to a given operator instance:
public static int assignKeyToParallelOperator(Object key, int maxParallelism, int parallelism) {
return computeOperatorIndexForKeyGroup(maxParallelism, parallelism, assignToKeyGroup(key, maxParallelism));
}
(in my case, parallelism being 4, maxParallelism 128 and my key value in the range [0,4[ ) (in such a context, 2 of my keys goes to operator instance 3 and 2 to operator instance 4) (operator instance 1 and 2 will have nothing to do).
I was thinking that key=0 will go to operator 0, key 1 to operator 1, key 2 to operator 2 and key 3 to operator 3, but it is not the case.
So do you know what will be the best approach to do this kind of grouping while querying an external system ?
ie 1 query per operator instance for all the alerts "received" by this operator instance during the last 100ms.
You can put an aggregator function upstream of the async function, where that function (using a timed window) outputs a record with <resource id><list of alerts to query>. You'd key the stream by the <resource id> ahead of the aggregator, which should then get pipelined to the async function.

Selenium Webdriver - How to avoid data duplication

Suppose I have normal "Add Users" module that I want to automate using Java scripting, how I can avoid data duplication to avoid error message such as "User already exist"?
There are numerous ways in which this can be automated. You are receiving 'User already exists' due to fact that you're (probably) running your 'Add Users' test cases using static variables.
Note: For the following examples I will consider a basic registration flow/scenario of a new user: name, email, password being the required fields.
Note-002: My language of choice will be JavaScript. You should be able to reproduce the concept with Java with ease.
1.) Pre-pending/Post-pending a unique identifier to the information you're submitting (e.g.: Date() returns the number of seconds that have elapsed since January 1, 1970 => it will always be unique when running your test case)
var timestamp = Number(new Date());
var email = 'test.e2e' + timestamp + '#<yourMainDomainHere>'
Note: Usually, the name & password don't need to be unique, so you can actually use hardcoaded values without any issues.
2.) The same thing can also be achieved using the Math.random() (for JS), which returns a value between 0 and 1 (0.8018194703223693), 18 digits long.
var almostUnique = Math.random();
// You can go ahead and gen only the decimals
almostUnique = almostUnique.toString().split('.')[1];
var email = 'test.e2e' + almostUnique + '#<yourMainDomainHere>'
!!! Warning: While Math.random() is not actually unique, in hundreds of regression runs of 200 functional test cases, I didn't have the chance of seeing a duplicate.
3.) (Not so elegant | Harder exponentially harder to implement) If you have access to your web-apps backend API and through it you can execute different actions in the DB, then you can actually write yourself some scripts that will run after your registration test cases, like a cleanup suite.
These scripts will have to remove the previously added user from your database.
Hope this helps!
You don't mention what language you are coding in, but use whatever language you use's random function to generate random numbers and/or text for the user id. It won't guarantee that there will not be a duplicate, but the nature of testing is such that you should be able to handle both situations anyway. If this is not clear or I don't understand your question correctly, you'll need to provide a lot more information: what you've tried, what language you use, etc.

Resources