Which Open source CEP shoud I choose for distributed and pipelined processing ; siddhi, Flink , Esper? - apache-flink

I am little bised towards siddhi cep as it has siddhi query language but it uses storm for distributed processing and WSO2 provides an web interface / dashboard to create and deploy applications . I think it will give me less independence to enhance / use some features .
Flink on the other hand seems to be good choice but it requires lot of code to implement even simple logic.
Is there a better option than these , I am
Confused

What do you mean by less independence? You can use Siddhi 4.x [1] without depending on storm by using its source and sink features to receive and send messages from one instance to another using tcp, Kafka, http, etc.
WSO2 Stream processor also uses the new version of Siddhi and with its editor you and simulate events and also debug.
Update: From 4.1 [WSO2 Stream Processor][2] can run on top of Kafka in fully distributed mode. See https://docs.wso2.com/display/SP4xx/Fully+Distributed+Deployment.
[1] https://wso2.github.io/siddhi/
[2] https://wso2.com/analytics

I would do a test...create 10 queries in each system....something like....
select * from SomeEvent where value = 1
select * from SomeEvent where value = 2
...
select * from SomeEvent where value = 9
select * from SomeEvent where value = 10
The idea is to see how easy it is to create the queries, how the API or deploy steps work and how performance changes with the number of queries.

Related

How to set a Flink Operator acting on many streams

I am looking into using Flink for streaming Engine. I am coming from apache-storm and As I understand storm's Bolt is similar to Flink's task/operator. In storm one can have
builder.setBolt("TEST", new TestBolt(),5)
.fieldsGrouping("Source1", "ID1")
.fieldsGrouping("Source2","ID2)
.fieldsGrouping("Source3","ID1")
.allGrouping("Source4");
How can I achieve something similar with Flink. Basically I want my test Bolt to have state from Source2, source3, source4 and do some computation when data from source 1 comes.
Your options for combining streams in Flink include union (for merging n streams of the same type), connect for jointly processing two streams of any type with a CoFlatMap or CoProcessFunction, and broadcast.
In some cases it's preferable to build a sort of binary tree, connecting, for example, streams 1 and 2 to form stream 12, and separately, streams 3 and 4 to create stream 34, and then connect stream12 with stream 34.
The alternative is to create some sort of union type that can hold an object from any of the streams, and then use union to merge the streams. Flink includes an Either type that can be helpful in these situations.

How to Implement Patterns to Match Brute Force Login and Port Scanning Attacks using Flink CEP

I have a use case where a large no of logs will be consumed to the apache flink CEP. My use case is to find the brute force attack and port scanning attack. The challenge here is that while in ordinary CEP we compare the value against a constant like "event" = login. In this case the Criteria is different as in the case of brute force attack we have the criteria as follows.
username is constant and event="login failure" (Delimiter the event happens 5 times within 5 minutes).
It means the logs with the login failure event is received for the same username 5 times within 5 minutes
And for port Scanning we have the following criteira.
ip address is constant and dest port is variable (Delimiter is the event happens 10 times within 1 minute). It means the logs with constant ip address is received for the 10 different ports within 1 minute.
With Flink, when you want to process the events for something like one username or one ip address in isolation, the way to do this is to partition the stream by a key, using keyBy(). The training materials in the Flink docs have a section on Keyed Streams that explains this part of the DataStream API in more detail. keyBy() is the roughly same concept as a GROUP BY in SQL, if that helps.
With CEP, if you first key the stream, then the pattern will be matched separately for each distinct value of the key, which is what you want.
However, rather than CEP, I would instead recommend Flink SQL, perhaps in combination with MATCH_RECOGNIZE, for this use case. MATCH_RECOGNIZE is a higher-level API, built on top of CEP, and it's easier to work with. In combination with SQL, the result is quite powerful.
You'll find some Flink SQL training materials and examples (including examples that use MATCH_RECOGNIZE) in Ververica's github account.
Update
To be clear, I wouldn't use MATCH_RECOGNIZE for these specific rules; neither it nor CEP is needed for this use case. I mentioned it in case you have other rules where it would be helpful. (My reason for not recommending CEP in this case is that implementing the distinct constraint might be messy.)
For example, for the port scanning case you can do something like this:
SELECT e1.ip, COUNT(DISTINCT e2.port)
FROM events e1, events e2
WHERE e1.ip = e2.ip AND timestampDiff(MINUTE, e1.ts, e2.ts) < 1
GROUP BY e1.ip HAVING COUNT(DISTINCT e2.port) >= 10;
The login case is similar, but easier.
Note that when working with streaming SQL, you should give some thought to state retention.
Further update
This query is likely to return a given IP address many times, but it's not desirable to generate multiple alerts.
This could be handled by inserting matching IP addresses into an Alert table, and only generate alerts for IPs that aren't already there.
Or the output of the SQL query could be processed by a de-duplicator implemented using the DataStream API, similar to the example in the Flink docs. If you only want to suppress duplicate alerts for some period of time, use a KeyedProcessFunction instead of a RichFlatMapFunction, and use a Timer to clear the state when it's time to re-enable alerts for a given IP.
Yet another update (concerning CEP and distinctness)
Implementing this with CEP should be possible. You'll want to key the stream by the IP address, and have a pattern that has to match within one minute.
The pattern can be roughly like this:
Pattern<Event, ?> pattern = Pattern
.<Event>begin("distinctPorts")
.where(iterative condition 1)
.oneOrMore()
.followedBy("end")
.where(iterative condition 2)
.within(1 minute)
The first iterative condition returns true if the event being added to the pattern has a distinct port from all of the previously matching events. Somewhat similar to the example here, in the docs.
The second iterative condition returns true if size("distinctPorts") >= 9 and this event also has yet another distinct port.
See this Flink Forward talk (youtube video) for a somewhat similar example at the end of the talk.
If you try this and get stuck, please ask a new question, showing us what you've tried and where you're stuck.

Apache Flink and Apache Pulsar

I am using Flink to read data from Apache Pulsar.
I have a partitioned topic in pulsar with 8 partitions.
I produced 1000 messages in this topic, distributed across the 8 partitions.
I have 8 cores in my laptop, so I have 8 sub-tasks (by default parallelism = # of cores).
I opened the Flink-UI after executing the code from Eclipse, I found that some sub-tasks are not receiving any records (idle).
I am expecting that all the 8 sub-tasks will be utilized (I am expecting that each sub-task will be mapped to one partition in my topic).
After restarting the job, I found that some times 3 sub-takes are utilized and some times 4 tasks are utilized while the remaining sub-tasks kept idle.
please your support to clarify this scenario.
Also how can I know that there is a shuffle between sub-takes or not?
My Code:
ConsumerConfigurationData<String> consumerConfigurationData = new ConsumerConfigurationData<>();
Set<String> topicsSet = new HashSet<>();
topicsSet.add("flink-08");
consumerConfigurationData.setTopicNames(topicsSet);
consumerConfigurationData.setSubscriptionName("my-sub0111");
consumerConfigurationData.setSubscriptionType(SubscriptionType.Key_Shared);
consumerConfigurationData.setConsumerName("consumer-01");
consumerConfigurationData.setSubscriptionInitialPosition(SubscriptionInitialPosition.Earliest);
PulsarSourceBuilder<String> builder = PulsarSourceBuilder.builder(new SimpleStringSchema()).pulsarAllConsumerConf(consumerConfigurationData).serviceUrl("pulsar://localhost:6650");
SourceFunction<String> src = builder.build();
DataStream<String> stream = env.addSource(src);
stream.print(" >>> ");
For the Pulsar question, I don't know enough to help. I recommend setting up a larger test and see how that turns out. Usually, you'd have more partitions than slots and have some slots consume several partitions in a somewhat random fashion.
Also how can I know that there is a shuffle between sub-takes or not?
The easiest way is to look at the topology of the Flink Web UI. There you should see the number of tasks and the channel types. You could post a screenshot if you want more details but in this case, there is nothing that will be shuffled, since you only have a source and a sink.

Designing a near real time streaming backend

I have the following requirements for designing a streaming backend :
Documents are getting added # 20 docs/sec. Each doc has a timestamp field.
Searches are primarily based on timestamp range ( e.g. show me documents arrived in last 20 minutes )
Search QueriesPerSecond : 100 searches/sec
Documents older than 2 days could be continuously deleted for optimization purposes ( by a cron )
I am thinking of using Solr ( with SolrReplication/NRT ). The problem with Solr is basically frequent updates/deletes. For freshest data I will need to do commit on each update ( otherwise data wont be visible by searchers). Setting pollInterval~1 minute might kill the master/server both. NRT/SolrCloud could be one fo the options, but not very sure about their stability.
Any other approaches/suggestions based on SQL/NoSQL architectures ?
mysql + memcached. Facebook runs their entire site on these two widely-available, widely supported, free and open source packages.

What's your experience developing on Google App Engine?

Is GQL easy to learn for someone who knows SQL? How is Django/Python? Does App Engine really make scaling easy? Is there any built-in protection against "GQL Injections"? And so on...
I'd love to hear the not-so-obvious ups and downs of using app engine.
Cheers!
My experience with google app engine has been great, and the 1000 result limit has been removed, here is a link to the release notes:
app-engine release notes
No more 1000 result limit - That's
right: with addition of Cursors and
the culmination of many smaller
Datastore stability and performance
improvements over the last few months,
we're now confident enough to remove
the maximum result limit altogether.
Whether you're doing a fetch,
iterating, or using a Cursor, there's
no limits on the number of results.
The most glaring and frustrating issue is the datastore api, which looks great and is very well thought out and easy to work with if you are used to SQL, but has a 1000 row limit across all query resultsets, and you can't access counts or offsets beyond that. I've run into weirder issues, with not actually being able to add or access data for a model once it goes beyond 1000 rows.
See the Stack Overflow discussion about the 1000 row limit
Aral Balkan wrote a really good summary of this and other problems
Having said that, app engine is a really great tool to have at ones disposal, and I really enjoy working with it. It's perfect for deploying micro web services (eg: json api's) to use in other apps.
GQL is extremely simple - it's a subset of the SQL 'SELECT' statement, nothing more. It's only a convenience layer over the top of the lower-level APIs, though, and all the parsing is done in Python.
Instead, I recommend using the Query API, which is procedural, requires no run-time parsing, and makes 'GQL injection' vulnerabilities totally impossible (though they are impossible in properly written GQL anyway). The Query API is very simple: Call .all() on a Model class, or call db.Query(modelname). The Query object has .filter(field_and_operator, value), .order(field_and_direction) and .ancestor(entity) methods, in addition to all the facilities GQL objects have (.get(), .fetch(), .count()), etc.) Each of the Query methods returns the Query object itself for convenience, so you can chain them:
results = MyModel.all().filter("foo =", 5).order("-bar").fetch(10)
Is equivalent to:
results = MyModel.gql("WHERE foo = 5 ORDER BY bar DESC LIMIT 10").fetch()
A major downside when working with AppEngine was the 1k query limit, which has been mentioned in the comments already. What I haven't seen mentioned though is the fact that there is a built-in sortable order, with which you can work around this issue.
From the appengine cookbook:
def deepFetch(queryGen,key=None,batchSize = 100):
"""Iterator that yields an entity in batches.
Args:
queryGen: should return a Query object
key: used to .filter() for __key__
batchSize: how many entities to retrieve in one datastore call
Retrieved from http://tinyurl.com/d887ll (AppEngine cookbook).
"""
from google.appengine.ext import db
# AppEngine will not fetch more than 1000 results
batchSize = min(batchSize,1000)
query = None
done = False
count = 0
if key:
key = db.Key(key)
while not done:
print count
query = queryGen()
if key:
query.filter("__key__ > ",key)
results = query.fetch(batchSize)
for result in results:
count += 1
yield result
if batchSize > len(results):
done = True
else:
key = results[-1].key()
The above code together with Remote API (see this article) allows you to retrieve as many entities as you need.
You can use the above code like this:
def allMyModel():
q = MyModel.all()
myModels = deepFetch(allMyModel)
At first I had the same experience as others who transitioned from SQL to GQL -- kind of weird to not be able to do JOINs, count more than 1000 rows, etc. Now that I've worked with it for a few months I absolutely love the app engine. I'm porting all of my old projects onto it.
I use it to host several high-traffic web applications (at peak time one of them gets 50k hits a minute.)
Google App Engine doesn't use an actual database, and apparently uses some sort of distributed hash map. This will lend itself to some different behaviors that people who are accustomed to SQL just aren't going to see at first. So for example getting a COUNT of items in regular SQL is expected to be a fast operation, but with GQL it's just not going to work the same way.
Here are some more issues:
http://blog.burnayev.com/2008/04/gql-limitations.html
In my personal experience, it's an adjustment, but the learning curve is fine.

Resources