What is complex Event processing? - database

I have build an analytic app for a company that shows Key Performance Indicators(KPIs) based on selling/buying of items of their employees.
Now what they want is to add event functionality like:
* alert them if any employee sell more than $10,000 worth items
* or alert them if some employee is tired of life and killed himself/herself
basically these silly events.
I am totally blank about what to do, they recommended using ESPER but I find it very tough to understand how event processing works.
I want to know how Event Processing works in such cases and where can i learn more about it.
See besides programming and DB i know nothing and I am not a PRO too.
Please share your opinions on what am I supposed to do?

Basically, complex event processing is used when you need to analyze "streams" of data, and when you have to react quickly when a seeked pattern emerges from it. For example when three guys flying from different airports are carrying ingredients to potentially build a bomb, and they're heading to the same destination, you need to find that correlation in data flowing from each of the airports, and react to the event quickly.
Employee selling more than 10000 is not something you need to know about in real-time. You can reward him next month, and for that "normal" reporting will work just fine.
Some papers to read:
Oracle CEP

CEP is an excellent way to accomplish your task actually. You can think of CEP as a pattern matching technique. Once it's matched you can be notified or launch another process if your CEP is integrated with other tools.
Here's the example from Wikipedia. We can easily infer that it's a wedding based on these incoming signals:
church bells ringing.
the appearance of a man in a tuxedo with a woman in a flowing white gown.
rice flying through the air.
This is an abstract example, and things like this probably aren't getting put into any system, but you get the idea. It's best to see the code to understand a CEP implimentation. Here's exactly the script you can write on top of NebriOS to build the inference. Other tools like Drools will accomplish the same thing.
class wedding_detect(NebriOS):
def check(self):
if self.church_bells == "ringing" and \
(self.clothes_type == "tuxedo" or \
self.clothes_type == "wedding gown") and \
self.rice_flying == True:
return True
else
return False
def action(self):
# fires if the check() is true
send_email("me#example.com", "A wedding is underway")
For your situation, CEP is easy to program also:
class sales_alert(NebriOS):
def check(self):
return self.sales_total > 10000
def action(self):
send_email("me#example.com", "You got a $10k sale!")
The sales_total can come into the system from your CRM for example.
That's CEP - Complex Event Processing! A start anyways.

Related

Custom Searcher - Blending of hits from different sources

We have a need for "Blending of hits from different sources", as per your documentation it is recommended to write a custom-searcher in JAVA. Is there a demo of this written somewhere on Github ? I wouldn't even know where to start :( I understand I can create search "chains" , preferably Asynchronous, and then blend results in JAVA before returning them...but then how would I handle paginations, limits...etc ? This all seems very complicated, for someone who doesn't even know JAVA that much. So, I am hoping someone has already written a demo for this ? Please ? Anyone ?
Thank you so much
EDIT to make my quesion clearer:
We are writing a search engine that fetches data from various websites. Some websites have 10mil indexable items, other websites only 100,000. When we present the results to end user, we want to include results from all our sources ( when match applies ). Let's say 10 results from each of the websites we crawl, so that they all get equal amount of attention on page. If we don't do custom blending, what happens is that the largest website with most items wins all our traffic.
I understand that we can send 10 separate queries to VESPA, and blend the results in our front end, but that seems very inefficient. Thus, the quesion of "Custome Searcher". Thank you so much !
That documentation covers some very advanced use cases which you do not have. Are your sources different Vespa schemas or content clusters? If so Vespa will by default blend the hits returned from each according to their relevance scores so there's nothing you need to do.
The two other most common use-cases are:
Some (or all) the data sources are external, so you need to write a Searcher component to fetch the external data and turn it into a Result.
You want the data to be blended in some custom way (rather than by relevance score). If so you need to exclude the default blending Searcher (com.yahoo.prelude.searcher.BlendingSearcher) and write your own.
If you provide some more information about your use cases I can give you some code examples.
EDIT: Use grouping to solve the need explained under "EDIT" in the question:
Create a "siteid" field when feeding (e.g in document processing).
Use the grouping expression all(group(siteid) each(max(10) output(summary())))
See http://docs.vespa.ai/documentation/grouping.html

Should you lock values in a ConcurrentDictionary, best practice

I'm trying to find the best solution (performance & accurate) to have a static list of objects in a web service.
Some web methods will be making amendments to these objects and returning the state of the object after the amendments and others will be requesting the current state.
This needs to be accurate at every operation as it's money related. This web service will be bombarded with requests from different areas of our large scale project.
I've been looking at ConcurrentDictionary, and while reading some other SO questions I came across the following answer: https://stackoverflow.com/a/1966462/151625
The following paragraph is something that I do not want:
Now consider this. In the store with one clerk, what if you get all the way to the front of the line, and ask the clerk "Do you have any toilet paper", and he says "Yes", and then you go "Ok, I'll get back to you when I know how much I need", then by the time you're back at the front of the line, the store can of course be sold out. This scenario is not prevented by a threadsafe collection.
So essentially I'm asking, should I lock values within a ConcurrentDictionary, or does it defeat the whole purpose of it? If I should/could, what is the best way to do it, if not, what alternative do I have?

Timers in clojure?

I'm wondering if there is any widespread timer solution.
I want to write a simple game engine which would process users input on tick every 20ms (or perform some actions once after 20ms (or any other period)) and basically update "global" state via transactions, and also I plan to use futures so this solution should be able to deal with concurrency caveats.
Can you give me an advice?
You have actually got two distinct issues here.
The first is the issues of timers. You have lots of choices here:
Start a thread that sleeps between actions, something like (future (loop [] (do-something) (Thread/sleep 20) (when (game-still-running) (recur))))
Use a Java TimerTask - easy to call from Clojure
Use a library like my little utility Task that includes a DSL for repeating tasks
Use the timer functionality from whatever game engine you are using - most of these provide some tools for setting up a game loop
I'd probably just use the simple thread option - it's pretty easy to set up and easy to hack more features in later if you need to.
The second issue is handling game state. This is actually the trickier problem and you will need to design it around the particular type of game you are making, so I'll just give a few points of advice:
If your game state is immutable, then you get a lot of advantages: your rendering code can draw the current game state independently while the game update is computing the next game state. Immutability has some performance costs, but the concurrency advantages are huge if you can make it work. Both of my mini Clojure games (Ironclad and Alchemy) use this approach
You should probably try and store your game state in a single top-level var. I find this works better than splitting different parts of the game state over different vars. It also means that you don't really need ref-based transactions: an atom or agent will usually do the trick.
You may want to implement a queue of events that need to get processed sequentially by the game state update function. This is particularly important if you have multiple concurrent event sources (e.g. player actions, timer ticks, network events etc.)
Nowadays I would consider core/async to be a good choice, since
it scales very well when it comes to complex tasks that can be separated into activities that communicate using channels
avoids binding activity to a thread
here is the sketch
(require '[clojure.core.async :refer [go-loop]])
(go-loop []
(do
(do-something ...)
(Thread/sleep 1000)
(recur))))
This solution assumes you're writing Clojure on the JVM.
Something like this could work:
(import '(java.util TimerTask Timer))
(let [task (proxy [TimerTask] []
(run [] (println "Here we go!")))]
(. (new Timer) (schedule task (long 20))))

Building a NetHack bot: is Bayesian Analysis a good strategy?

A friend of mine is beginning to build a NetHack bot (a bot that plays the Roguelike game: NetHack). There is a very good working bot for the similar game Angband, but it works partially because of the ease in going back to the town and always being able to scum low levels to gain items.
In NetHack, the problem is much more difficult, because the game rewards ballsy experimentation and is built basically as 1,000 edge cases.
Recently I suggested using some kind of naive bayesian analysis, in very much the same way spam is created.
Basically the bot would at first build a corpus, by trying every possible action with every item or creature it finds and storing that information with, for instance, how close to a death, injury of negative effect it was. Over time it seems like you could generate a reasonably playable model.
Can anyone point us in the right direction of what a good start would be? Am I barking up the wrong tree or misunderstanding the idea of bayesian analysis?
Edit: My friend put up a github repo of his NetHack patch that allows python bindings. It's still in a pretty primitive state but if anyone's interested...
Although Bayesian analysis encompasses much more, the Naive Bayes algorithm well known from spam filters is based on one very fundamental assumption: all variables are essentially independent of each other. So for instance, in spam filtering each word is usually treated as a variable so this means assuming that if the email contains the word 'viagra', that knowledge does affect the probability that it will also contain the word 'medicine' (or 'foo' or 'spam' or anything else). The interesting thing is that this assumption is quite obviously false when it comes to natural language but still manages to produce reasonable results.
Now one way people sometimes get around the independence assumption is to define variables that are technically combinations of things (like searching for the token 'buy viagra'). That can work if you know specific cases to look for but in general, in a game environment, it means that you can't generally remember anything. So each time you have to move, perform an action, etc, its completely independent of anything else you've done so far. I would say for even the simplest games, this is a very inefficient way to go about learning the game.
I would suggest looking into using q-learning instead. Most of the examples you'll find are usually just simple games anyway (like learning to navigate a map while avoiding walls, traps, monsters, etc). Reinforcement learning is a type of online unsupervised learning that does really well in situations that can be modeled as an agent interacting with an environment, like a game (or robots). It does this trying to figure out what the optimal action is at each state in the environment (where each state can include as many variables as needed, much more than just 'where am i'). The trick then is maintain just enough state that helps the bot make good decisions without having a distinct point in your state 'space' for every possible combination of previous actions.
To put that in more concrete terms, if you were to build a chess bot you would probably have trouble if you tried to create a decision policy that made decisions based on all previous moves since the set of all possible combinations of chess moves grows really quickly. Even a simpler model of where every piece is on the board is still a very large state space so you have to find a way to simplify what you keep track of. But notice that you do get to keep track of some state so that your bot doesn't just keep trying to make a left term into a wall over and over again.
The wikipedia article is pretty jargon heavy but this tutorial does a much better job translating the concepts into real world examples.
The one catch is that you do need to be able to define rewards to provide as the positive 'reinforcement'. That is you need to be able to define the states that the bot is trying to get to, otherwise it will just continue forever.
There is precedent: the monstrous rog-o-matic program succeeded in playing rogue and even returned with the amulet of Yendor a few times. Unfortunately, rogue was only released an a binary, not source, so it has died (unless you can set up a 4.3BSD system on a MicroVAX), leaving rog-o-matic unable to play any of the clones. It just hangs cos they're not close enough emulations.
However, rog-o-matic is, I think, my favourite program of all time, not only because of what it achieved but because of the readability of the code and the comprehensible intelligence of its algorithms. It used "genetic inheritance": a new player would inherit a combination of preferences from a previous pair of successful players, with some random offset, then be pitted against the machine. More successful preferences would migrate up in the gene pool and less successful ones down.
The source can be hard to find these days, but searching "rogomatic" will set you on the path.
I doubt bayesian analysis will get you far because most of NetHack is highly contextual. There are very few actions which are always a bad idea; most are also life-savers in the "right" situation (an extreme example is eating a cockatrice: that's bad, unless you are starving and currently polymorphed into a stone-resistant monster, in which case eating the cockatrice is the right thing to do). Some of those "almost bad" actions are required to win the game (e.g. coming up the stairs on level 1, or deliberately falling in traps to reach Gehennom).
What you could try would be trying to do it at the "meta" level. Design the bot as choosing randomly among a variety of "elementary behaviors". Then try to measure how these bots fare. Then extract the combinations of behaviors which seem to promote survival; bayesian analysis could do that among a wide corpus of games along with their "success level". For instance, if there are behaviors "pick up daggers" and "avoid engaging monsters in melee", I would assume that analysis would show that those two behaviors fit well together: bots which pick daggers up without using them, and bots which try to throw missiles at monsters without gathering such missiles, will probably fare worse.
This somehow mimics what learning gamers often ask for in rec.games.roguelike.nethack. Most questions are similar to: "should I drink unknown potions to identify them ?" or "what level should be my character before going that deep in the dungeon ?". Answers to those questions heavily depend on what else the player is doing, and there is no good absolute answer.
A difficult point here is how to measure the success at survival. If you simply try to maximize the time spent before dying, then you will favor bots which never leave the first levels; those may live long but will never win the game. If you measure success by how deep the character goes before dying then the best bots will be archeologists (who start with a pick-axe) in a digging frenzy.
Apparently there are a good number of Nethack bots out there. Check out this listing:
In nethack unknown actions usually have a boolean effect -- either you gain or you loose. Bayesian networks base around "fuzzy logic" values -- an action may give a gain with a given probability. Hence, you don't need a bayesian network, just a list of "discovered effects" and wether they are good or bad.
No need to eat the Cockatrice again, is there?
All in all it depends how much "knowledge" you want to give the bot as starters. Do you want him to learn everything "the hard way", or will you feed him spoilers 'till he's stuffed?

Any business examples of using Markov chains?

What business cases are there for using Markov chains? I've seen the sort of play area of a markov chain applied to someone's blog to write a fake post. I'd like some practical examples though? E.g. useful in business or prediction of stock market, or the like...
Edit: Thanks to all who gave examples, I upvoted each one as they were all useful.
Edit2: I selected the answer with the most detail as the accepted answer. All answers I upvoted.
The obvious one: Google's PageRank.
Hidden Markov models are based on a Markov chain and extensively used in speech recognition and especially bioinformatics.
I've seen spam email that was clearly generated using a Markov chain -- certainly that qualifies as a "business use". :)
There is a class of optimization methods based on Markov Chain Monte Carlo (MCMC) methods. These have been applied to a wide variety of practical problems, for example signal & image processing applications to data segmentation and classification. Speech & image recognition, time series analysis, lots of similar examples come out of computer vision and pattern recognition.
We use log-file chain-analysis to derive and promote secondary and tertiary links to otherwise-unrelated documents in our help-system (a collection of 10m docs).
This is especially helpful in bridging otherwise separate taxonomies. e.g. SQL docs vs. IIS docs.
I know AccessData uses them in their forensic password-cracking tools. It lets you explore the more likely password phrases first, resulting in faster password recovery (on average).
Markov chains are used by search companies like bing to infer the relevance of documents from the sequence of clicks made by users on the results page. The underlying user behaviour in a typical query session is modeled as a markov chain , with particular behaviours as state transitions...
for example if the document is relevant, a user may still examine more documents (but with a smaller probability) or else he may examine more documents (with a much larger probability).
There are some commercial Ray Tracing systems that implement Metropolis Light Transport (invented by Eric Veach, basically he applied metropolis hastings to ray tracing), and also Bi-Directional- and Importance-Sampling- Path Tracers use Markov-Chains.
The bold texts are googlable, I omitted further explanation for the sake of this thread.
We plan to use it for predictive text entry on a handheld device for data entry in an industrial environment. In a situation with a reasonable vocabulary size, transitions to the next word can be suggested based on frequency. Our initial testing suggests that this will work well for our needs.
IBM has CELM. Check out this link:
http://www.research.ibm.com/journal/rd/513/labbi.pdf
I recently stumbled on a blog example of using markov chains for creating test data...
http://github.com/emelski/code.melski.net/blob/master/markov/main.cpp
Markov model is a way of describing a process that goes through a series of states.
HMMs can be applied in many fields where the goal is to recover a data sequence that is not immediately observable (but depends on some other data on that sequence).
Common applications include:
Crypt-analysis, Speech recognition, Part-of-speech tagging, Machine translation, Stock Prediction, Gene prediction, Alignment of bio-sequences, Gesture Recognition, Activity recognition, Detecting browsing pattern of a user on a website.
Markov Chains can be used to simulate user interaction, f.g. when browsing service.
My friend was writing as diplom work plagiat recognision using Markov Chains (he said the input data must be whole books to succeed).
It may not be very 'business' but Markov Chains can be used to generate fictitious geographical and person names, especially in RPG games.
Markov Chains are used in life insurance, particularly in the permanent disability model. There are 3 states
0 - The life is healthy
1 - The life becomes disabled
2 - The life dies
In a permanent disability model the insurer may pay some sort of benefit if the insured becomes disabled and/or the life insurance benefit when the insured dies. The insurance company would then likely run a monte carlo simulation based on this Markov Chain to determine the likely cost of providing such an insurance.

Resources