best way to statistically detect anomalies in data

best way to statistically detect anomalies in data - database

our webapp collects huge amount of data about user actions, network business, database load, etc etc etc
All data is stored in warehouses and we have quite a lot of interesting views on this data.
if something odd happens chances are, it shows up somewhere in the data.
However, to manually detect if something out of the ordinary is going on, one has to continually look through this data, and look for oddities.
My question: what is the best way to detect changes in dynamic data which can be seen as 'out of the ordinary'.
Are bayesan filters (I've seen these mentioned when reading about spam detection) the way to go?
Any pointers would be great!
EDIT:
To clarify the data for example shows a daily curve of database load.
This curve typically looks similar to the curve from yesterday
In time this curve might change slowly.
It would be nice that if the curve from day to day changes say within some perimeters, a warning could go off.
R

Take a look at Control Charts, they provide a way to track changes in your data visually and specify when the data is "out of control" or "anomalous". They are heavily used in manufacturing to ensure quality control.

This question is impossible to answer without knowing much more about the particular data you have. For an overview of what kinds of approaches exist, see Anomaly Detection: A Survey by Chandola, Banerjee, and Kumar.

Bayesian classification might help you find some anomalies in your data, depending on the type of data and how good you train your Bayesian filter.
There is even one available as a web service # uClassify.com.

This depends so much on what the data is. Take a statistics class and learn the basics first. This isn't usually an easy or simple problem.

Related

How can I automate this task ? (automatic contouring for radiotherapy)

I've just started my residency as a radiation oncologist. I have a little background in programming (Python, VBA).
I'd like your insights on an issue I have at work.
The issue : For each patient, the radiation oncologist needs to do a contouring. Basically, he contours the main structures (like the aorta, the heart, the lungs, and so on) on a CT scan. This is essential for computing the spatial distribution of the radiations (because you want to avoid those structures). The contouring is done within a 3rd party software (called Isogray). The CT scans come from the hospital database and the radiation distribution is computed on another software.
It takes at least one hour to do a complete contouring. Multiply that by each patients (maybe a dozen per week) and by each oncologists (we are a team of 15 members) and you can see that it represents hundred (maybe even thousand) manhours every year.
There exists softwares that do this automatically, but the hospital doesn't want to rent/buy them. But, seriously, how hard can this be to do a little automation ? Can't I do this myself ?
My plan of action : Here I'd like your insights. How can I automate this task ? The first thing is that I can't change anything within Isogray, so I need to do the automation externally. What I think I should do :
Create a database of the historical contourings : this means I need to be able to read what Isogray uses as an output files
Design an automatic model : I'm thinking deep learning models here. I don't know if there's anything more optimal to do than calibrating a deep learning model on the contoured CT scans I already have
Create a little software : based on the automatic model, the software will take a 'not contoured' Isogray file and turn it into a 'contoured' file. The oncologist only needs to load the new file into Isogray and validate the contouring
What do you think ? Do you see an easier way to do that ? I don't know anything about Isogray (I just know how to use it). Do you think this is doable? What information do I need before I start this project ?
Any insights will be welcomed :)

From what I have understood it is a problem of semantic segmentation.
You have an input image of N dimensions (or black and white) and you use the neural network to indicate which regions correspond to a specific organ.
You can use an architecture like the U-Net for this task: https://medium.com/#keremturgutlu/semantic-segmentation-u-net-part-1-d8d6f6005066
What I do not know is if the degree of reliability would be very high, that depends on many factors.
Neural networks look for differentiating patterns to discriminate zones, the first important component is shape and color. That is why it is more difficult when both the color and the shape are very different.
On the other hand you will need a lot of images but you can create a process called data-augmentation to generate more (artificial).
Another method that is currently used is to work in reverse, we know that the problem of image segmentation is difficult. But you can design a program that simulates real images where segmentation is known perfectly.
There are only some keypoints, I hope I have helped you.
EDIT:
Semantic segmentation in biomedic context: https://towardsdatascience.com/review-u-net-biomedical-image-segmentation-d02bf06ca760

You need to provide more background on the specifics on the contouring, especially given the fact that this is for medical diagnosis. Truthfully, I wouldn't try and automate this for liability reasons.
If you make an error someone it could cause a misdiagnosis, which as you already know can lead to numerous problems including lawsuits and death. The nice thing about 3rd party products is that it is already being tested robustly against numerous scenarios and approved for medical usage and liability reasons.
I'm pretty sure you could make a masters thesis doing something like this
With that being said, there is a nice github repo for problems like this that I think you could potentially start generating ideas from.

Thought on creating a searchable database

I guess what I need is two things. First a way to input data into an Excel like application or a form builder, then a way to search those entries. For example.. CAR PART put a car Part A into Field 1 the next Field 2 would be car Type, followed by make and model. The fields would need to be made into a form consisting of preset inputs such as ( Title/Type ) and (Variable Categories) so a drop down menu, icons, or checkboxes would help narrow down the list of results. What pieces need to be in place to build/use a lightweight database/application design like this that allows inputting new information and then being able to search for latet search for variables? Also is there any application that does this already, a programming code to learn, or estimated cost and requirements to have it built?

First, there might be something off the shelf that does this already, and there are applications like this. Microsoft's Access would be a good place to start to see if it would fit your needs -- you can build forms and store data without much programming effort. As time goes on, you can scale up to a SQL Server.
It's not clear to me if your data is relational or not, and it might not matter much at first (any database will likely handle your queries to start). I originally thought your data was not relational, but re-reading your post, I'm not so sure now.
If that doesn't work, or you want more flexibility, then I'd start looking at NoSQL as an option. Some good choices include Mongo and RavenDB (there are many others).
You can program it yourself with just about any major language -- some provide more or less functionality based on the tie-in to the data.

What are some of the problems with GTFS?

I am intersted in replacing my current data format that I use with GTFS, but I hear and read from here and there that there are flaws in GTFS file format.
Most of the time I see that you can't somehow predict some things such as delays or some real-time stuff. They say you can't get the "whole picture" with them.
So what I am asking is there anyone more experienced with GTFS , since I am seeing them only for first time, that could have possibly used GTFS in some kind of application and could tell the problems they have faced while developing?
Maybe someone has a suggestion about a better kind of file format? Or a combination of some formats?

It's hard to say whether GTFS is a good fit or not for your application without knowing what your application's requirements are, but I can offer a few remarks.
If your goal is to provide real-time data to users you should take a look at GTFS-realtime, a complementary data format designed specifically for issuing real-time updates. For most public-transit applications, using a GTFS and a GTFS-realtime feed together does indeed give the "whole picture" about a transit network, or near enough.
In terms of GTFS itself, my main complaint is that it seems designed specifically for route-planning applications and using data in this format for any other purpose can be difficult. For example, while a GTFS feed records information about transit stops and routes, there is no requirement that each of these have a single, canonical entry—if the data spans multiple board periods, there will almost always be (seemingly) duplicate entries for each.
This doesn't matter if you're plotting a route based on where and when a person is travelling, since the links between objects ensure you'll always generate the right result. If you're starting with only a person's location and want to know, "What transit resources are available nearby?", reliably producing an accurate answer requires some contortions.

It depends on your needs for importing existing feeds. If yes, then you need to be able to handle it anyhow. In my case, import was required, so I use the same for data that stems from other formats like PDF timetables. Otherwise you need to supoprt two formats. If you do not need it for import (or export) you may consider your own format : I find GTFS does not reveal the actual network.
GTFS needs quite some interpretation and digesting in order to end up with the whole picture that you can answer planning questions on.
I merge stops together if they are close, like a few meters apart, and assume 'trivial walk' if 10-50 meters. That automatically handles combining multipe feeds.
Apart from that, I turn the stop_times roughly inside-out to create a 'link' table'. The end result is that for each stop you have a list of departures and their destinations.
Biggest problem until now is that GTFS feeds can record the trips from an operator point of view. Passengers can remain sitting in the bus if it flips the headsign from 351 to 285, takes a new driver onboard and continues. That means you need to know what trips actually need to be seen as joined in passenger terms.
I solved minor problem for manual feed entry by having my GTFS parser accept a handful of constructs that ease editing, such as leaving out the sequence numbers to have it generated incrementally, and recognising 02.13+1 as 26.13.

Template Matching for relational database

I am trying to do the following:
we are trying to design a fraud detection system for stock market.
I know the Specification for the frauds (they are like templates).
so I want to know if I can design a template, and find all records that match this template.
Notice:
I can't use the traditional queries cause the templates are complex
for example one of my Fraud is circular trading,it's like this :
A bought from B, and B bought from C, And C bought from A (it's a cycle)
and this cycle can include 4 or 5 persons.
is there any good suggestion for this situation.

I don't see why you can't use "traditional queries" as you've stated. SQL can be used to write extraordinarily complex queries. For that matter I'm not sure that this is a hugely challenging question.
Firstly, I'd look at the behavior you have described as vary transactional, therefore I treat the transactions as a model. I'd likely have a transactions table with some columns like buyer, seller, amount, etc...
You could alternatively have the shares as its own table and store say the previous 100 owners of that share in the same table using STI (Single Table Inheritance) buy putting all the primary keys of the owners into an "owners" column in your shares table like 234/823/12334/1234/... that way you can do complex queries and see if that share was owned by the same person or look for patterns in the string really easily and quickly.
-update-
I wouldn't suggest making up a "small language" I don't see why you'd want to do something like that when you have huge selection of wonderful languages and databases to choose from, all of which have well refined and tested methods to solve exactly what you are doing.
My best advice is pop open your IDE (thumbs up for TextMate) and pick your favorite language (Ruby in my case). Find some sample data and create your database and start writing some code! You can't go wrong trying to experiment like this, it'll will totally expose better ways to go about it than we can dream up here on Stackoverflow.

Definitely Data Mining. But as you point out, you've already got the models (your templates). Look up fraud DETECTION rather than prevention for better search results?
I know a some banks use SPSS PASW Modeler for fraud detection. This is very intuitive and you can see what you are doing as you play around with the data. So you can implement your templates. I agree with Joseph, you need to get playing, making some new data structures.
Maybe a timeseries model?

Theoretically you could develop a "Small Language" first, something with a simple syntax (that makes expressing the domain - in your case fraud patterns - easy) and from it generate one or more SQL queries.
As most solutions, this could be thought of as a slider: at one extreme there is the "full Fraud Detection Language" at the other, you could just build stored procedures for the most common cases, and write new stored procedures which use the more "basic" blocks you wrote before to implement the various patterns.
What you are trying to do falls under the Data Mining umbrella, so you could also try to learn more about it: maybe you can find a Data Mining package for your specific DB (you didn't specify) and see if it helps you finding common patterns in your data.

How do games handle saved content?

I don't see an answer to this question here on SO which makes me afraid that it's incredibly simple and I'm just missing something but here goes.
Background, feel free to skip: I need a single course for my bachelor's degree that I skipped out on years ago. Theoretically it's Computer Graphics, but since I left it has become more Game Development. And that's great because to me it's more interesting than the fill algorithms and translations and whatnot that it used to be. It's a 4th year course only offered every other year, but I've managed to talk the department into letting me take a 4th year independent study on the same topic and call that good enough.
The prof "running" the independent study doesn't teach the actual Computer Graphics course so while he's a smart guy this isn't really his field. So most of my questions are left to me, a text book and the internet. You know...like an independent study should be. :)
/Background
I've got a buddy that likes to develop game systems for fun. I plan to take one of his table top games and make it into a computer game using XNA.
I don't foresee any insurmountable challenges with the game mechanics but one thing I'm curious about is how do most games save their content? I mean that in a couple of ways and hopefully I can express them clearly.
Take the case of any RPG you've ever played. You can hit the "Save" button and save the world, your character's information and whatever other information is necessary. Then later on you can hit the "Load" button and bring it back.
Or the case of NPC dialogue. When I bump into Merchant #853 he randomly spits out one of 3 different greetings.
There are others that I can think of but they're really just variations on the same theme. Even with those two examples it seems to me the same mechanic could be used, but what is that mechanic?
I've been doing web development for years so my mind automatically jumps to "databases!". A database is the solution to any problem. And I can see how it could work here but the overhead seems pretty steep. "Here's my 6mb compiled game...oh and 68mb MySQL installation." Or even worse since I'm using XNA, maybe I'd need to find a way to bundle SQL Server. :)
I thought maybe XML but that doesn't feel right to me either. How would it work if I wanted to run on the XBox? Or Zune? (Those aren't necessary for what I'm doing, but there must be a solution somewhere that takes them into account.)
Anyone know the secret? Or have some ideas anyway?
Thanks
Jeff

There are two main ways how games are saved, a simple one and a complex one. The first way is to simply stores the current level, the current score and a handful of other stats. This is seen in games such as Super Mario Galaxy and most earlier console module based games. The save game doesn't restore your exact position, but just which levels you have completed. These save games are generally very simple and require very little memory.
The second way not only stores your overall progress, but stores each and every little detail, such as enemy positions, their current animation frame and so on, so that loading a save game will place you at the exact spot where you stopped, with all the enemies right in place, instead of back at the start of a level. These savegames tend to get much bigger than the other version and thus are mostly seen on PC games.
Databases are used in neither of these schemes, as the purpose of databases is to provide the ability to dynamically query data structures, what the game however needs isn't a way to query individual pieces, but just a way to statically store them. When a savegame is loaded, it is loaded completly into memory and from there on the game engine does its thing with the data. There are a handful of exceptions, such as MMORPGs which might work on a database, but single player games generally don't.
How the data is actually stored depends on the game. Most common seem to be simple binary data formats, as they are much better in terms of disk space than XML. In older games those binary formats where often raw dumps of a pieces of memory of the games process, so they didn't have any well thought out structure and often broke when a patch or a different version of a game got released, in some modern games that's still the case. XML can be used as too, as well as any other text based file format.
In large part this is more a game design issue than a programming one, as they way a game can be saved can drastically change how its played. The simple way, where you just save the level number and some stats, is however a lot easier to implement, as its just a few lines of of code. While the second one requires serialization of most of your classes, which for a complex game can be quite a tricky issue and lead to many subtle bugs.

One approach is to use .net serialization.
Make sure the state of you game is a fully connected graph and that each class in that graph is marked as Serializable (with the SerializableAttribute), the for saving (and loading) you can use normal .net serialization.
You can look at the codebase for Project Xenocide (open source XNA game) to see how it was done there.

You could use an SQLite database, with the SQLite.NET wrapper. I've used this, and found it quite simple. The whole DLL is only 850KB, and the database itself sits in a single file (with temp files created as needed). So your users shouldn't have an issue.
But you could also use a simple XML file, or a home-grown binary format. It all depends on how you're going to be querying the data, and how much data is involved. There is no one answer.

As others have noted, serialization is the way to go. And Gamasutra just published an article on data baking.

From my limited experience developing games, save games really don't use much storage. As tvanfosson said, you normally store most things in memory while playing the game, so saving state to disk isn't a problem.
Here's a short example. Assuming a single person RPG, if you needed to save your character's location only, you'd have perhaps a level number, xyz coordinates and maybe the direction you're facing. That's just a few bytes.
Now assume you need to save the state/location of things like health packs, crates, enemies, character's health and picked up items, etc. You could have a few hundred of these at most which would easily be less than 10KB.
Obviously things can get very complicated with more complex games. The trick is to only store what is truly necessary to recreate the player's experience. A lot of games only let you save at certain places, like the end of a level. In this case you only need to store the new level number plus the outcome of previous levels (e.g. health remaining, picked up items).
Even if you allow arbitrary save points you can ignore the state of any places/levels that you cannot return to. And you probably wouldn't want the user to be able to save mid-jump.
EDIT: With regard to file format... use any way that's convenient for the data type! XML is quite a nice way of doing things. Not sure how effective a database would be since for an RPG each fragment of data can be very different; You might end up with a bunch of tables with one row each.
Most games use their own, binary, file formats. Firstly this reduces the storage amount dramatically. Secondly, it helps prevent users cheating by editing the save game manually - if you have XML like <health value="10"/> it's very easy to edit the file to read <health value="100"/>. The downside of binary is that it's much more difficult for debugging.

While the game is running, I'd try to keep everything relating to the current context in memory. Your initialization can be kept in some suitable serialized format and read in on start up. XML would work, but it's somewhat verbose. A custom compact binary format is probably more appropriate. The same is true of the saved state. Whatever objects need to be reinitialized when the saved game is loaded should be serialized to a custom binary format and then reconstituted on load. If you run into memory problems, a small custom database optimized for speed would be another alternative. It could be pre-populated on installation.