(This question is about strategy and high-level approach to data refining, not programming, so if it is off-topic... sorry in advance, but I couldn't find a better stackexchange community)
So, we are in a (typical) scenario in which new data are introduced by a moltitude of users (bottom-up contribution) and periodically refined, corrected, categorized and enriched by moderators/administrators/trusted users (top-down refining).
This scenario is quite common in websites (stackexchange tags are a good example)
Is there a "best strategy" to minimize efforts and maximize the quality of data?
Here some doubt:
Force data to pass a validation process or let them populate the system (accepting a certain grade of incorrectness/inconsistency) and fix/enrich the most popular as they arise.
Top-down-prefill the system with as much data as you can anticipating the bottom-up arrivals.
Help bottom-up entries to be consistent with the rest of other data (autocompletes and did-you-mean boxes for the user)
I have experience dealing with Neural Networks, specifically ones of the Back-Propagating nature, and I know that of the inputs passed to the trainer, dependencies between inputs are part of the resulting models knowledge when a hidden layer is introduced.
Is the same true for decision networks?
I have found that information around these algorithms (ID3) etc somewhat hard to find. I have been able to find the actual algorithms, but information such as expected/optimal dataset formats and other overviews are rare.
Decision Trees are actually very easy to provide data to because all they need is a table of data, and which column out of that data what feature (or column) you want to predict on. That data can be discrete or continuous for any feature. Now there are several flavors of decision trees with different support for continuous and discrete values. And they work differently so understanding how each one works can be challenging.
Different decision tree algorithms with comparison of complexity or performance
Depending on the type of algorithm you are interested in it can be hard to find information without reading the actual papers if you want to try and implement it. I've implemented the CART algorithm, and the only option for that was to find the original 200 page book about it. Most of other treatments only discuss ideas like splitting with enough detail, but fail to discuss any other aspect at more than a high level.
As for if they take into account the dependencies between things. I believe it only assumes dependence between each input feature and the prediction feature. If the input was independent from the prediction feature you couldn't use it as a split criteria. But, between other input features I believe they must be independent of each other. I'd have to check the book to ensure that was true or not, but off the top of my head I think that's true.
I have implicitly made this a community wiki seeing that the answers can be quite broad.
I'm working with a start-up company to accomplish the following goal.
In a medical research, a patient medical record can have infinite amount of data regarding a patient for a specific diagnosis, e.g. a smoker has a higher chance of catching lung cancer but that doesn't necessarily mean that a non-smoker can catch lung cancer. My goal is to create/use a database model that can deal with such parameters.
Now, I also have to come up with ways to data mine these parametrized data to create statistical data e.g. see the trends on all 40 year old female who suffered from lung cancer. That report can be generic, (graph, tabular, etc.) where doctors can see trends or analyse possible solutions that can work....
My questions are:
1) Which Database systems allows for parametrized backend storage (e.g. Cassandra) that can easily be used in java, and is very efficient in data retrieval, linkage, etc. We are dealing with high amount of patient records per states.
2) What algorithms or AI techniques can I use for data mining? Is there any mining techniques out there that can help me do this?
PS How does Google Analytics deal with parametrised data?
PPS A parametrized data is data which has a key, and data where data can be value, another key-value pair, a list of value, a set of parametrized data (organized, unorganized)
I'm looking forward for suggestive answers! :-D
I'll try to answer your first question only.
Cassandra is a key-value datastore (in your case parametrized). If you use Cassandra, you need higher computation time to derive complex reports. The reason being - it stores data in raw format. Cassandra like NOSQL databases are good if you want to scale very very big. They are eventually consistent and compromise on data replication and latency.
In your case as a patient can have data in infinitely any form, try to fit the model of a Triple Store (Semantic Web frameworks like Jena, OpenSesame, etc). They allow you to have a lousy data structures and can be molded at runtime. Also, their querying engines (SPARQL, SeRQL) give you more power than NOSQL stores (like Cassandra), but these querying capabilities are obviously lesser than RDBMS.
For this question, this is how we have implemented this.
We created a keyspace called medical and a supercolumn family called patient.
under the supercolumn family, we have a general supercolumn which basically store the patient details, and another supercolumn called operation to keep recording of the user occupation.
Don't forget that the general supercolumn keeps record of the patient as he/she comes to the doctor. That way, we know exactly the patient's exact condition before, during and after operation.
I know some data can be duplicates, but no supercolumns can be identical as there is no way that you can have exactly 2 different patient of identical attributes and sickness.
So basically, Cassandra allows 3 layers of abstraction, Keyspace, Column/Supercolumn family, Column/Supercolumn.
Hope this can help somebody.
I am trying to do the following:
we are trying to design a fraud detection system for stock market.
I know the Specification for the frauds (they are like templates).
so I want to know if I can design a template, and find all records that match this template.
I can't use the traditional queries cause the templates are complex
for example one of my Fraud is circular trading,it's like this :
A bought from B, and B bought from C, And C bought from A (it's a cycle)
and this cycle can include 4 or 5 persons.
is there any good suggestion for this situation.
I don't see why you can't use "traditional queries" as you've stated. SQL can be used to write extraordinarily complex queries. For that matter I'm not sure that this is a hugely challenging question.
Firstly, I'd look at the behavior you have described as vary transactional, therefore I treat the transactions as a model. I'd likely have a transactions table with some columns like buyer, seller, amount, etc...
You could alternatively have the shares as its own table and store say the previous 100 owners of that share in the same table using STI (Single Table Inheritance) buy putting all the primary keys of the owners into an "owners" column in your shares table like 234/823/12334/1234/... that way you can do complex queries and see if that share was owned by the same person or look for patterns in the string really easily and quickly.
I wouldn't suggest making up a "small language" I don't see why you'd want to do something like that when you have huge selection of wonderful languages and databases to choose from, all of which have well refined and tested methods to solve exactly what you are doing.
My best advice is pop open your IDE (thumbs up for TextMate) and pick your favorite language (Ruby in my case). Find some sample data and create your database and start writing some code! You can't go wrong trying to experiment like this, it'll will totally expose better ways to go about it than we can dream up here on Stackoverflow.
Definitely Data Mining. But as you point out, you've already got the models (your templates). Look up fraud DETECTION rather than prevention for better search results?
I know a some banks use SPSS PASW Modeler for fraud detection. This is very intuitive and you can see what you are doing as you play around with the data. So you can implement your templates. I agree with Joseph, you need to get playing, making some new data structures.
Maybe a timeseries model?
Theoretically you could develop a "Small Language" first, something with a simple syntax (that makes expressing the domain - in your case fraud patterns - easy) and from it generate one or more SQL queries.
As most solutions, this could be thought of as a slider: at one extreme there is the "full Fraud Detection Language" at the other, you could just build stored procedures for the most common cases, and write new stored procedures which use the more "basic" blocks you wrote before to implement the various patterns.
What you are trying to do falls under the Data Mining umbrella, so you could also try to learn more about it: maybe you can find a Data Mining package for your specific DB (you didn't specify) and see if it helps you finding common patterns in your data.
our webapp collects huge amount of data about user actions, network business, database load, etc etc etc
All data is stored in warehouses and we have quite a lot of interesting views on this data.
if something odd happens chances are, it shows up somewhere in the data.
However, to manually detect if something out of the ordinary is going on, one has to continually look through this data, and look for oddities.
My question: what is the best way to detect changes in dynamic data which can be seen as 'out of the ordinary'.
Are bayesan filters (I've seen these mentioned when reading about spam detection) the way to go?
Any pointers would be great!
To clarify the data for example shows a daily curve of database load.
This curve typically looks similar to the curve from yesterday
In time this curve might change slowly.
It would be nice that if the curve from day to day changes say within some perimeters, a warning could go off.
Take a look at Control Charts, they provide a way to track changes in your data visually and specify when the data is "out of control" or "anomalous". They are heavily used in manufacturing to ensure quality control.
This question is impossible to answer without knowing much more about the particular data you have. For an overview of what kinds of approaches exist, see Anomaly Detection: A Survey by Chandola, Banerjee, and Kumar.
Bayesian classification might help you find some anomalies in your data, depending on the type of data and how good you train your Bayesian filter.
There is even one available as a web service # uClassify.com.
This depends so much on what the data is. Take a statistics class and learn the basics first. This isn't usually an easy or simple problem.
Let's say you are a GM dba and you have to design around the GM models
Is it better to do this?
type {cadillac, saturn, chevrolet}
Or this?
Let's say that the business lines have the same columns for a model and that there are over a million records for each subtype.
there is a lot of CRUD
there are a lot of very processor intensive reports
in either schema, there is a model_detail table that contains 3-5 records for each model and the details for each model differ (you can't add a cadillac detail to a saturn model)
the dev team doesn't have any issues with db complexity
i'm not really sure that this is a normalization question. even though the structures are the same they might be thought of as different entities.
Reasons for partitioning the structure into multiple tables
- business lines may have different business rules regarding parts
- addModelDetail() could be different for each business line (even though the data format is the same)
- high add/update activity - better performance with partitioned structure instead of single structure (I'm guessing and not sure here)?
I think this is a variation of the EAV problem. When posed as a EAV design, the single table structure generally gets voted as a bad idea. When posed in this manner, the single table strucutre generally gets voted as a good idea. Interesting...
I think the most interesting answer is having two different structures - one for crud and one for reporting. I think I'll try concatenated/flattened view for reporting and multiple tables for crud and see how that works.
Definitely the former example. Do you want to be adding tables to your database whenever you add a new model to your product range?
On data with a lot of writes, (e.g. an OLTP application), it is better to have more, narrower tables (e.g. tables with fewer fields). There will be less lock contention because you're only writing small amounts of data into different tables.
So, based on the criteria you have described, the table structure I would have is:
Other common fields
Fields specific to a Caddy
Fields specific to a Saturn
For reporting, I'd have an entirely different database on an entirely different server that does not have the normalized structure (e.g. just has CadillacVehicle and SaturnVehicle tables with all of the fields from the Vehicle table duplicated into them).
With proper indexes, even the OLTP database could be performant in your SELECT's, regardless of the fact that there are tens of millions of rows. However, since you mentioned that there are processor-intensive reports, that's why I would have a completely separate reporting database.
One last comment. About the business rules... the data store cares not about the business rules. If the business rules are different between models, that really shouldn't factor into your design decisions about the database schema (other than to help dictate which fields are nullable and their data types).
Use the former. Setting up separate tables for the specialisations will complicate your code and doesn't bring any advantages that can't be achieved in other ways. It will also massively simplify your reports.
If the tables really do have the same columns, then the former is the best way to do it. Even if they had different columns, you'd probably still want to have the common columns be in their own table, and store a type designator.
You could try having two separate databases.
One is an OLTP (OnLine Transaction Processing) system which should be highly normalized so that the data model is highly correct. Report performance must not be an issue, and you would deal with non-reporting query performance with indexes/denormalization etc. on a case-by-case basis. The data model should try to match up very closely with the conceptual model.
The other is a Reports system which should pull data from the OLTP system periodically, and massage and rearrange that data in a way that makes report-generation easier and more performant. The data model should not try to match up too closely with the conceptual model. You should be able to regenerate all the data in the reporting database at any time from the data currently in the main database.
I would say the first way looks better.
Are there reasons you would want to do it the second way?
The first way follows normalization better and is closer to how most relational database schema are developed.
The second way seems to be harder to maintain.
Unless there is a really good reason for doing it the second way I would go with the first method.
Given the description that you have given us, the answer is either.
In other words you haven't given us enough information to give a decent answer. Please describe what kind of queries you expect to perform on the data.
[Having said that, I think the answer is going to be the first one ;-)
As I imaging even though they are different models, the data for each model is probably going to be quite similar.
But this is a complete guess at the moment.]
Given your updated edit, I'd say the first one definitely. As they have all the same data then they should go into the same table.
Another thing to consider in defining "better"--will end users be querying this data directly? Highly normalized data is difficult for end-users to work with. Of course this can be overcome with views but it's still something to think about as you're finalizing your design.
I do agree with the other two folks who answered: which form is "better" is subjective and dependent on what you're hoping to achieve. If you're hoping to achieve very quick queries that's one thing. If you're hoping to achieve high programmer productivity--that's a different goal again and possibly conflicts with quick queries.
Choice depends on required performance.
The best database is normalized database. But there could be performance issues in normalized database then you have to denormalize it.
Principle "Normalize first, denormalize for performance" works well.
It depends on the datamodel and the use case. If you ever need to report on a query that wants data out of the "models" then the former is preferable because otherwise (with the latter) you'd have to change the query (to include the new table) every time you added a new model.
Oh and by "former" we mean this option:
* type {cadillac, saturn, chevrolet}
#mson has asked the question "What do you do when a question is not satisfactorily answered on SO?", which is a direct reference to the existing answers to this question.
I contributed the following answer to that discussion, primarily critiquing the way the question was asked.
Quote (verbatim):
I looked at the original question yesterday, and decided not to contribute an answer.
One problem was the use of the term 'model' as in 'GM models' - which cited 'Chevrolet, Saturn, Cadillac' as 'models'. To my understanding, these are not models at all; they are 'brands', though there might also be an industry-insider term for them that I'm not familiar with, such as 'division'. A model would be a 'Saturn Vue' or 'Chevrolet Impala' or 'Cadillac Escalade'. Indeed, there could well be models at a more detailed level than that - different variants of the Saturn Vue, for example.
So, I didn't think that the starting point was well framed. I didn't critique it; it wasn't quite compelling enough, and there were answers coming in, so I let other people try it.
The next problem is that it is not clear what your DBMS is going to be storing as data. If you're storing a million records per 'model' ('brand'), then what sorts of data are you dealing with? Lurking in the background is a different scenario - the real scenario - and your question has used an analogy that failed to be sufficiently realistic. That means that the 'it depends' parts of the answer are far more voluminous than the 'this is how to do it' ones. There is just woefully too little background information on the data to be modelled to allow us to guess what might be best.
Ultimately, it will depend on what uses people have for the data. If the information is going to go flying off in all different directions (different data structures in different brands; different data structures at the car model levels; different structures for the different dealerships - the Chevrolet dealers are handled differently from the Saturn dealers and the Cadillac dealers), then the integrated structure provides limited benefit. If everything is the same all the way down, then the integrated structure provides a lot of benefit.
Are there legal reasons (or benefits) to segregating the data? To what extent are the different brands separate legal entities where shared records could be a liability? Are there privacy issues, such that it will be easier to control access to the data if the data for the separate brands is stored separately?
Without a lot more detail about the scenario being modelled, no-one can give a reliable general answer - at least, not more than the top-voted one already gives (or doesn't give).
Data modelling is not easy.
Data modelling without sufficient information is impossible to do reliably.
I have copied the material here since it is more directly relevant. I do think that to answer this question satisfactorily, a lot more context should be given. And it is possible that there needs to be enough extra context to make SO the wrong place to ask it. SO has its limitations, and one of those is that it cannot deal with questions which require long explanations.
From the SO FAQs page:
What kind of questions can I ask here?
Programming questions, of course! As long as your question is:
detailed and specific
written clearly and simply
of interest to at least one other programmer somewhere
What kind of questions should I not ask here?
Avoid asking questions that are subjective, argumentative, or require extended discussion. This is a place for questions that can be answered!
This question is, IMO, close to the 'require extended discussion' limit.