Apply anomaly detection on Flink sliding windows - apache-flink

I am new to Flink, so I hope what I am saying makes sense. I would like to apply sliding windows to a DataStream, and then for each of those Windows to perform anomaly detection, using FlinkML or maybe FlinkCEP (in fact I want to use both). My question is, which function should I use after I have created the sliding windows.
So far I am trying to achieve this using the apply method, but I am not sure if it makes sense. To my understanding when the apply function is performed, then I will have all the elements within the window available.

I don't think Flink ML currently provides off-the-shelf anomaly detection algorithm that uses a sliding window like you mentioned.
Suppose you have a function that takes a collect of elements as input and determines whether there is an anomaly, then you can write a streaming job that applies this function to sliding windows of an input stream like below:
DataStream<Integer> source =
env.fromElements(1, 2, 3);
DataStream<String> alerts =
source.keyBy(i -> i %2)
.window(SlidingEventTimeWindows.of(
Time.of(1000, TimeUnit.MILLISECONDS),
Time.of(100, TimeUnit.MILLISECONDS)))
.process(new ProcessWindowFunction<Integer, String, Integer, TimeWindow>() {
#Override
public void process(
Integer key,
Context context,
Iterable<Integer> elements,
Collector<String> out) throws Exception {
if (isAbnormal(elements)) {
out.collect("Found a sequence of abnormal elements");
}
}
});

Related

Instance of object related to flink Parallelism & Apply Method

First let me ask the my question then could you please clarify my assumption about apply method?
Question: If my application creates 1.500.000 (approximately) records in every one minute interval and flink job reads these records from kafka consumer with let's say 15++ different operators, then this logic could create latency, backpressure etc..? (you may assume that parallelism is 16)
public class Sample{
//op1 =
kafkaSource
.keyBy(something)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething())
.name("Name")
.addSink(kafkaSink);
//op2 =
kafkaSource
.keyBy(something2)
.timeWindow(Time.seconds(1)) // let's assume that this one second
.apply(new ApplySomething2())
.name("Name")
.addSink(kafkaSink);
// ...
//op16 =
kafkaSource
.keyBy(something16)
.timeWindow(Time.minutes(1))
.apply(new ApplySomething16())
.name("Name")
.addSink(kafkaSink);
}
// ..
public class ApplySomething ... {
private AnyObject object;
private int threshold = 30, 40, 100 ...;
#Override
public void open(Configuration parameters) throws Exception{
object = new AnyObject();
}
#Override
public void apply(Tuple tuple, TimeWindow window, Iterable<Record> input, Collector<Result> out) throws Exception{
int counter = 0;
for (Record each : input){
counter += each.getValue();
if (counter > threshold){
out.collec(each.getResult());
return;
}
}
}
}
If yes, should i use flatMap with state(rocksDB) instead of timeWindow?
My prediction is "YES". Let me explain why i am thinking like that:
If parallelism is 16 than there will be a 16 different instances of indivudual ApplySomething1(), ApplySomething2()...ApplySomething16() and also there will be sixteen AnyObject() instances for per ApplySomething..() classes.
When application works, if keyBy(something)partition number is larger than 16 (assumed that my application has 1.000.000 different something per day), then some of the ApplySomething..()instances will handle the different keys therefore one apply() should wait the others for loops before processing. Then this will create a latency?
Flink's time windows are aligned to the epoch (e.g., if you have a bunch of hourly windows, they will all trigger on the hour). So if you do intend to have a bunch of different windows in your job like this, you should configure them to have distinct offsets, so they aren't all being triggered simultaneously. Doing that will spread out the load. That will look something like this
.window(TumblingProcessingTimeWindows.of(Time.minutes(1), Time.seconds(15))
(or use TumblingEventTimeWindows as the case may be). This will create minute-long windows that trigger at 15 seconds after each minute.
Whenever your use case permits, you should use incremental aggregation (via reduce or aggregate), rather than using a WindowFunction (or ProcessWindowFunction) that has to collect all of the events assigned to each window in a list before processing them as a sort of mini-batch.
A keyed time window will keep its state in RocksDB, assuming you have configured RocksDB as your state backend. You don't need to switch to using a RichFlatMap to have access to RocksDB. (Moreover, since a flatMap can't use timers, I assume you would really end up using a process function instead.)
While any of the parallel instances of the window operator is busy executing its window function (one of the ApplySomethings) you are correct in thinking that that task will not be doing anything else -- and thus it will (unless it completes very quickly) create temporary backpressure. You will want to increase the parallelism as needed so that the job can satisfy your requirements for throughput and latency.

Consume from two flink dataStream based on priority or round robin way

I have two flink dataStream. For ex: dataStream1 and dataStream2. I want to union both the Streams into 1 stream so that I can process them using the same process functions as the dag of both dataStream is the same.
As of now, I need equal priority of consumption of messages for either stream.
The producer of dataStream2 produces 10 messages per minute, while the producer of dataStream1 produces 1000 messages per second. Also, dataTypes are the same for both dataStreams.DataSteam2 more of a high priority queue that should be consumed asap. There is no relation between messages of dataStream1 and dataStream2
Does dataStream1.union(dataStream2) will produce a Stream that will have elements of both Streams?
Probably the simplest solution to this problem, yet not exactly the most efficient one depending on the exact specification of the sources for Your data, may be connecting the two streams. In this solution, You could use the CoProcessFunction, which will invoke separate methods for each of the connected streams.
In this solution, You could simply buffer the elements of one stream until they can be produced (for example in round-robin manner). But keep in mind that this may be quite inefficient if there is a very big difference between the frequency in which sources produce events.
It sounds like the two DataStreams have different types of elements, though you didn't specify that explicitly. If that's the case, then create an Either<stream1 type, stream2 type> via a MapFunction on each stream, then union() the two streams. You won't get exact intermingling of the two, as Flink will alternate consuming from each stream's network buffer.
If you really want nicely mixed streams, then (as others have noted) you'll need to buffer incoming elements via state, and also apply some heuristics to avoid over-buffering if for any reason (e.g. differing network latency, or more likely different performance between the two sources) you have very different data rates between the two streams.
You may want to use a custom operator that implements the InputSelectable interface in order to reduce the amount of buffering needed. I've included an example below that implements interleaving without any buffering, but be sure to read the caveat in the docs which explains that
... the operator may receive some data that it does not currently want to process ...
In other words, this simple example can't be relied upon to really work as is.
public class Alternate {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1);
DataStream<Long> positive = env.generateSequence(1L, 100L);
DataStream<Long> negative = env.generateSequence(-100L, -1L);
AlternatingTwoInputStreamOperator op = new AlternatingTwoInputStreamOperator();
positive
.connect(negative)
.transform("Hack that needs buffering", Types.LONG, op)
.print();
env.execute();
}
}
class AlternatingTwoInputStreamOperator extends AbstractStreamOperator<Long>
implements TwoInputStreamOperator<Long, Long, Long>, InputSelectable {
private InputSelection nextSelection = InputSelection.FIRST;
#Override
public void processElement1(StreamRecord<Long> element) throws Exception {
output.collect(element);
nextSelection = InputSelection.SECOND;
}
#Override
public void processElement2(StreamRecord<Long> element) throws Exception {
output.collect(element);
nextSelection = InputSelection.FIRST;
}
#Override
public InputSelection nextSelection() {
return this.nextSelection;
}
}
Note also that InputSelectable was added in Flink 1.9.0.

Apache Flink - Access internal buffer of WindowedStream from another Stream's MapFunction

I have a Apache Flink based streaming application with following setup:
Data Source: generates data every minute.
Windowed Stream using CountWindow with size=100, slide=1 (sliding count window).
ProcessWindowFunction to apply some computation ( say F(x) ) on the data in the Window.
Data sink to consume the output stream
This works fine. Now, I'd like to enable users to provide a function G(x) and apply it on the current data in the Window and send the output to the user in real-time
I am not asking about how to apply arbitrary function G(x) - I am using dynamic scripting to do that. I am asking how to access the buffered data in window from another stream's map function.
Some code to clarify
DataStream<Foo> in = .... // source data produced every minute
in
.keyBy(new MyKeySelector())
.countWindow(100, 1)
.process(new MyProcessFunction())
.addSink(new MySinkFunction())
// The part above is working fine. Note that windowed stream created by countWindow() function above has to maintain internal buffer. Now the new requirement
DataStream<Function> userRequest = .... // request function from user
userRequest.map(new MapFunction<Function, FunctionResult>(){
public FunctionResult map(Function Gx) throws Exception {
Iterable<Foo> windowedDataFromAbove = // HOW TO GET THIS???
FunctionResult result = Gx.apply(windowedDataFromAbove);
return result;
}
})
Connect the two streams, then use a CoProcessFunction. The method call that gets the stream of Functions can apply them to what's in the other method call's window.
If you want to broadcast Functions, then you'll either need to be using Flink 1.5 (which supports connecting keyed and broadcast streams), or use some helicopter stunts to create a single stream that can contain both Foo and Function types, with appropriate replication of Functions (and key generations) to simulate a broadcast.
Assuming Fx aggregates incoming foos on-fly and Gx processes a window's worth of foos, you should be able to achieve what you want as following:
DataStream<Function> userRequest = .... // request function from user
Iterator<Function> iter = DataStreamUtils.collect(userRequest);
Function Gx = iter.next();
DataStream<Foo> in = .... // source data
.keyBy(new MyKeySelector())
.countWindow(100, 1)
.fold(new ArrayList<>(), new MyFoldFunc(), new MyProcessorFunc(Gx))
.addSink(new MySinkFunction())
Fold function (operates on incoming data as soon as they arrive) can be defined like this:
private static class MyFoldFunc implements FoldFunction<foo, Tuple2<Integer, List<foo>>> {
#Override
public Tuple2<Integer, List<foo>> fold(Tuple2<Integer, List<foo>> acc, foo f) {
acc.f0 = acc.f0 + 1; // if Fx is a simple aggregation (count)
acc.f1.add(foo);
return acc;
}
}
Processor function can be something like this:
public class MyProcessorFunc
extends ProcessWindowFunction<Tuple2<Integer, List<foo>>, Tuple2<Integer, FunctionResult>, String, TimeWindow> {
public MyProcessorFunc(Function Gx) {
super();
this.Gx = Gx;
}
#Override
public void process(String key, Context context,
Iterable<Tuple2<Integer, List<foo>> accIt,
Collector<Tuple2<Integer, FunctionResult>> out) {
Tuple2<Integer, List<foo> acc = accIt.iterator().next();
out.collect(new Tuple2<Integer, FunctionResult>(
acc.f0, // your Fx aggregation
Gx.apply(acc.f1), // your Gx results
));
}
}
Please note that fold\reduce functions do not internally buffer elements by default. We use fold here to compute on-fly metrics and to also create a list of window items.
If you are interested in applying Gx on tumbling windows (not sliding), you could use tumbling windows in your pipeline. To compute sliding counts too, you could have another branch of your pipeline that computes sliding counts only (does not apply Gx). This way, you do not have to keep 100 lists per window.
Note: you may need to add the following dependency to use DataStreamUtils:
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-contrib</artifactId>
<version>0.10.2</version>
</dependency>

Designing a generic DB utility class

Time and again I find myself creating a database utility class which has multiple functions which all do almost the same thing but treat the result set slightly differently.
For example, consider a Java class which has many functions which all look like this:
public void doSomeDatabaseOperation() {
Connection con = DriverManager.getConnection("jdbc:mydriver", "user", "pass");
try {
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery("SELECT whatever FROM table"); // query will be different each time
while (rs.next()) {
// handle result set - differently each time
}
} catch (Exception e) {
// handle
} finally {
con.close();
}
}
Now imagine a class with 20 of these functions.
As you can see, tons of boilerplate (opening a connection, try-finally block), and the only thing that changes would be the query and the way you handle the result set. This type of code occurs in many languages (considering you're not using an ORM).
How do you manage your DB utility classes so as to reduce code duplication? What does a typical DB utility class look like in your language/framework?
The way I have done in one of my project is that I followed what Spring does with JDBC template and came up with a Query framework. Basically create a common class which can take select statement or pl/sql calls and bind parameters. If the query returns resultset, also pass the Rowmapper. This rowmapper object will be called by the framework to convert each row into an object of any kind.
Example -
Query execute = new Query("{any select or pl/sql}",
// Inputs and Outputs are for bind variables.
new SQL.Inputs(Integer.class, ...),
// Outputs is only meaningful for PL/SQL since the
// ResultSetMetaData should be used to obtain queried columns.
new SQL.Outputs(String.class));
If you want the rowmapper -
Query execute = new Query("{any select or pl/sql}",
// Inputs and Outputs are for bind variables.
new SQL.Inputs(Integer.class, ...),
// Outputs is only meaningful for PL/SQL since the
// ResultSetMetaData should be used to obtain queried columns.
new SQL.Outputs(String.class), new RowMapper() {
public Object mapRow(ResultSet rs, int rowNum) throws SQLException {
Actor actor = new Actor();
actor.setFirstName(rs.getString("first_name"));
actor.setSurname(rs.getString("surname"));
return actor;
});
Finally a Row class is the output which will have list of objects if you have passed the RowMapper -
for (Row r : execute.query(conn, id)) {
// Handle the rows
}
You can go fancy and use Templates so that type safety is guaranteed.
Sounds like you could make use of a Template Method pattern here. That would allow you to define the common steps (and default implementations of them, where applicable) that all subclasses will take to perform the action. Then subclasses need only override the steps which differ: SQL query, DB-field-to-object-field mapping, etc.
When using .net, the Data Access Application Block is in fairly widespread use to provide support for the following:
The [data access] application block
was designed to achieve the following
goals:
Encapsulate the logic used to perform
the most common data access tasks.
Eliminate common coding errors, such
as failing to close connections.
Relieve developers of the need to
write duplicated code for common data
access tasks.
Reduce the need for
custom code.
Incorporate best
practices for data access, as
described in the .NET Data Access
Architecture Guide.
Ensure that, as
far as possible, the application block
functions work with different types of
databases.
Ensure that applications
written for one type of database are,
in terms of data access, the same as
applications written for another type
of database.
There are plenty of examples and tutorials of usage too: a google search will find msdn.microsoft, 4guysfromrolla.com, codersource.com and others.

typesafe NotifyPropertyChanged using linq expressions

Form Build your own MVVM I have the following code that lets us have typesafe NotifyOfPropertyChange calls:
public void NotifyOfPropertyChange<TProperty>(Expression<Func<TProperty>> property)
{
var lambda = (LambdaExpression)property;
MemberExpression memberExpression;
if (lambda.Body is UnaryExpression)
{
var unaryExpression = (UnaryExpression)lambda.Body;
memberExpression = (MemberExpression)unaryExpression.Operand;
}
else memberExpression = (MemberExpression)lambda.Body;
NotifyOfPropertyChange(memberExpression.Member.Name);
}
How does this approach compare to standard simple strings approach performancewise? Sometimes I have properties that change at a very high frequency. Am I safe to use this typesafe aproach? After some first tests it does seem to make a small difference. How much CPU an memory load does this approach potentially induce?
What does the code that raises this look like? I'm guessing it is something like:
NotifyOfPropertyChange(() => SomeVal);
which is implicitly:
NotifyOfPropertyChange(() => this.SomeVal);
which does a capture of this, and pretty-much means that the expression tree must be constructed (with Expression.Constant) from scratch each time. And then you parse it each time. So the overhead is definitely non-trivial.
Is is too much though? That is a question only you can answer, with profiling and knowledge of your app. It is seen as OK for a lot of MVC usage, but that isn't (generally) calling it in a long-running tight loop. You need to profile against a desired performance target, basically.
Emiel Jongerius has a good performance comparrison of the various INotifyPropertyChanged implementations.
http://www.pochet.net/blog/2010/06/25/inotifypropertychanged-implementations-an-overview/
The bottom line is if you are using INotifyPropertyChanged for databinding on a UI then the performance differences of the different versions is insignificant.
How about using the defacto standard
if (propName == value) return;
propName = value;
OnPropertyChanged("PropName");
and then create a custom tool that checks and refactors the code file according to this standard. This tool could be a pre-build task, also on the build server.
Simple, reliable, performs.
Stack walk is slow and lambda expression is even slower. We have solution similar to well known lambda expression but almost as fast as string literal. See
http://zamboch.blogspot.com/2011/03/raising-property-changed-fast-and-safe.html
I use the following method in a base class implementing INotifyPropertyChanged and it is so easy and convenient:
public void NotifyPropertyChanged()
{
StackTrace stackTrace = new StackTrace();
MethodBase method = stackTrace.GetFrame(1).GetMethod();
if (!(method.Name.StartsWith("get_") || method.Name.StartsWith("set_")))
{
throw new InvalidOperationException("The NotifyPropertyChanged() method can only be used from inside a property");
}
string propertyName = method.Name.Substring(4);
RaisePropertyChanged(propertyName);
}
I hope you find it useful also. :-)
Typesafe and no performance loss: NotifyPropertyWeaver extension. It adds in all the notification automatically before compiling...

Resources