Data modeling in Datomic - data-modeling

I've been looking into Datomic, and it looks really interesting. But while there seems to be very good information on how Datomic works technically, I have not seen much on how one should think about data modeling.
What are some best practices for data modeling in Datomic? Are there any good resources on the subject?

Caveat Lector
As Datomic is new and my experience with it is limited, this answer shouldn't be considered best practices in any way. Take this instead as an intro to Datomic for those with a relational background and a hankering for a more productive data store.
Getting Started
In Datomic, you model your domain data as Entities that possess Values for Attributes. Because a reference to another Entity can be the Value of an Attribute, you can model Relationships between Entities simply.
At first look, this isn't all that different from the way data is modeled in a traditional relational database. In SQL, table rows are Entities and a table's columns name Attributes that have Values. A Relationship is represented by a foreign key Value in one table row referencing the primary key Value of another table row.
This similarity is nice because you can just sketch out your traditional ER diagrams when modeling your domain. You can rely on relationships just like you would in a SQL database, but don't need to mess around with foreign keys since that's handled for you. Writes in Datomic are transactional and your reads are consistent. So you can separate your data into entities at whatever granularity feels right, relying on joins to provide the bigger picture. That's a convenience you lose with many NoSQL stores, where it's common to have BIG, denormalized entities to achieve some useful level of atomicity during updates.
At this point, you're off to a good start. But Datomic is much more flexible than a SQL database.
Taking Advantage
Time is inherently part of all Datomic data, so there is no need to specifically include the history of your data as part of your data model. This is probably the most talked about aspect of Datomic.
In Datomic, your schema is not rigidly defined in the "rectangular shape" required by SQL. That is, an entity1 can have whatever attributes it needs to satisfy your model. An entity need not have NULL or default values for attributes that don't apply to it. And you can add attributes to a particular, individual entity as you see fit.
So you can change the shape of individual entities over the course of time to be responsive to changes in your domain (or changes to your understanding of the domain). So what? This is not unlike Document Stores like MongoDB and CouchDB.
The difference is that with Datomic you can enact schema changes atomically over all affected entities. Meaning that you can issue a transaction to update the shape of all entities, based upon arbitrary domain logic, written in your language[2], that will execute without affecting readers until committed. I'm not aware of anything close to this sort of power in either the relational or document store spaces.
Your entities are not rigidly defined as "living in a single table" either. You decide what defines the "type" of an entity in Datomic. You could choose to be explicit and mandate that every entity in your model will have a :table attribute that connotes what "type" it is. Or your entities can conform to any number of "types" simply by satisfying the attribute requirements of each type.
For example, your model could mandate that:
A Person requires attributes :name, :ssn, :dob
An Employee requires :name, :title, :salary
A Resident requires :name, :address
A Member requires :id, :plan, :expiration
Which means an entity like me:
{:name "Brian" :ssn 123-45-6789 :dob 1976-09-15
:address "400 South State St, Chicago, IL 60605"
:id 42 :plan "Basic" :expiration 2012-05-01}
can be inferred to be a Person, a Resident and a Member but NOT an Employee.
Datomic queries are expressed in Datalog and can incorporate rules expressed in your own language, referencing data and resources that are not stored in Datomic. You can store Database Functions as first-class values inside of Datomic. These resemble Stored Procedures in SQL, but can be manipulated as values inside of a transaction and are also written in your language. Both of these features let you express queries and updates in a more domain-centric way.
Finally, the impedance mismatch between the OO and relational worlds has always frustrated me. Using a functional, data-centric language (Clojure) helps with that, but Datomic looks to provide a durable data store that doesn't require mental gymnastics to bridge from code to storage.
As an example, an entity fetched from Datomic looks and acts like a Clojure (or Java) map. It can be passed up to higher levels of an application without translation into an object instance or general data structure. Traversing that entity's relationships will fetch the related entities from Datomic lazily. But with the guarantee that they will be consistant with the original query, even in the face of concurrent updates. And those entities will appear to be plain old maps nested inside the first entity.
This makes data modeling more natural and much, much less of a fight in my opinion.
Potential Pitfalls
Conflicting attributes
The example above illustrates a potential pitfall in your model. What if you later decide that :id is also an attribute of an Employee? The solution is to organize your attributes into namespaces. So you would have both :member/id and :employee/id. Doing this ahead of time helps avoid conflict later on.
An attribute's definition can't be changed (yet)
Once you've defined an attribute in your Datomic as a particular type, indexed or not, unique, etc. you can't change that later. We're talking ALTER TABLE ALTER COLUMN in SQL parlance here. For now, you could create a replacement attribute with the right definition and move your existing data.
This may sound terrible, but it's not. Because transactions are serialized, you can submit one that creates the new attribute, copies your data to it, resolves conflicts and removes the old attribute. It will run without interference from other transactions and can take advantage of domain-specific logic in your native language to do it's thing. It's essentially what an RDBMS is doing behind the scenes when you issue an ALTER TABLE, but you name the rules.
Don't be "a kid in a candy store"
Flexible schema doesn't mean no data model. I'd advise some upfront planning to model things in a sane way, same as you would for any other data store. Leverage Datomic's flexibility down the road when you have to, not just because you can.
Avoid storing large, constantly changing data
Datomic isn't a good data store for BLOBs or very large data that's constantly changing. Because it keeps a historical record of previous values and there isn't a method to purge older versions (yet). This kind of thing is almost always a better fit for an object store like S3. Update: There is a way to disable history on a per-attribute basis. Update: There is now also a way to excise data; however, storing references to external objects rather than the objects themselves may still stil be the best approach to handling BLOBs. Compare this strategy with using byte arrays.
Resources
Datomic mailing list
IRC channel #datomic on Freenode
Notes
I mean entity in the row sense, not the table sense which is more properly described as entity-type.
My understanding is that Java and Clojure are currently supported, but it is possible that other JVM languages could be supported in the future.

A very nice answer from bkirkbri. I want to make some additions:
If you store many entities of similar, but not equal "types" or schemas, use a type keyword in the schema, like
[:db/add #db/id[:db.part/user] :db/ident :article.type/animal]
[:db/add #db/id[:db.part/user] :db/ident :article.type/weapon]
[:db/add #db/id[:db.part/user] :db/ident :article.type/candy]
{:db/id #db/id[:db.part/db]
:db/ident :article/type
:db/valueType :db.type/ref
:db/cardinality :db.cardinality/one
:db/doc "The type of article"
:db.install/_attribute :db.part/db}
When you read them, get the entity-ids from a query and use datomic.api/entity and the eid and parse them by multimethods dispatching on type if nescessary, since it's hard to make a good query for all attributes in some more complex schema.

Related

How to represent in UML deletion consequences (nullify or cascading) in non-(whole / part) relations?

I think we agree that there is a correspondance between composition and delete cascading on one side and aggregation and nullify on delete on the other, in case we delete the whole instance in a whole / part relationship.
But what if there is no whole / part relationship between two classes:
I understand that we can only use composition and aggregation in cases where the whole / part hierarchy occurs: Car - Wheels, Apartment - Rooms and not in cases where this hierarchy does not occurs (e.g. Car - Driver classes).
So, how should we represent in UML this situation where there are deletion consequences in the database (nullify or cascading) but no "whole / part" relation?
Do we agree on the initial assumption?
The UML literature frequently refers to part-whole relationships regarding aggregation/composition. However, the definitions in the UML standard have evolved (see UML 2.5.1):
Sometimes a Property is used to model circumstances in which one instance is used to group together a set of instances; this is called aggregation. (...)
Shared: Indicates that the Property has shared aggregation semantics. Precise semantics of shared aggregation varies by application area and modeler.
Composite: Indicates that the Property is aggregated compositely, i.e., the composite object has responsibility for the existence and storage of the composed objects.
Composite aggregation is a strong form of aggregation that requires a part object be included in at most one composite object at a time. If a composite object is deleted, all of its part instances that are objects are deleted with it.
In other words, there is no precise semantic specified for the "aggregation" (i.e. shared aggregation) that would make a difference from a simple association: shared aggregation is a modeling placebo.
The relationship between database constraints and UML modeling are therefore not as straightforward as you would assume.
Close match?
Moreover, there is no general one-to-one mapping between a database schema and an UML model. More than one database schema could be used to implement the same UML class diagram. And conversely, more than one UML diagram may represent the design that is implemented by a given database schema. So the best we can do here, is to consider close-matches.
In your database, the table with the FOREIGN KEY constraint would correspond to a potential component in a composition, or an element of a shared aggregation, or an associated instance in a simple association :
a ON DELETE CASCADE could help to implement a composite aggregation: it's the only way in SQL to implement the kind of lifecycle management that you would expect in a composition: the components would be deleted when the composite is. It could as well implement an ordinary association, if some business rules/contracts (e.g. UML post conditions) would require such a related deletion.
a ON DELETE SET NULL could help to implement a shared aggregation, if its smeantics would be defined as you mean: if the aggregate is deleted, its elements would not be deleted, and could therefore be shared. But it could as well implement any ordinary association, since the deletion of an associated instance would not trigger a deletion either and the constraint would allow to maintain a clean referential integrity.
I agree, that composition means cascading delete, because according to UML the whole is responsible for the existence of the parts. A normal association means, you can delete any object without affecting any other objects that might have a link to it. UML doesn't define semantics for aggregation, so they will behave in the same way. But even if we take into account domain specific semantics for aggregation, I don't think there are examples where this is changed.
However, if you have an association with a multiplicity of 1 on one end, you cannot delete the object on this end, because the objects that have been linked to it would be invalid afterwards. This has nothing to do with composition or aggregation.
So, the remaining question is, how to express cascading delete if there is no whole-part relationship? Are there really examples where this happens? I don't see that Car - Driver, could not be in a whole-part relationship. Please bear in mind, that we are not talking about real cars or real people. We are talking about a software system that we want to represent knowledge about the real world for a specific purpose. And if the purpose is to issue boarding cards for cars and their drivers on a ferry, it makes perfect sense to view them as a composition.

Supertype/subtype Notation for ERD

This is more of a notation and 'proper procedure' type of question than anything.
Please see below an image of a few relations in my Enhanced ERD logical model. A patient can be an OUTPATIENT or a RESIDENT, but there are no attributes which are specific to OUTPATIENTS or RESIDENTS. There are relationships which are specific to the subtypes though, as only OUTPATIENTS can be associated with visits and only RESIDENTs can be associated with beds.
I am in the process of converting this to a physical data model. Obviously it makes sense to not have OUTPATIENT or RESIDENT tables and only a PATIENT table which contains a discriminator for the type of patient.
But what is the proper way to model this?
How do I now model the relationships to VISITS and BEDS while still maintaining the constraint that the discriminator must be of a certain value to qualify for those relationships?
Do I just forget about representing this constraint in the physical data model and make sure its implemented in the code when the tables are created?
Or is there a notation for physical data models which represents this type of constraint?
Section of CareCenter schema in Extended ERD
I have done much searching and cannot seem to find anything about this. All of the material I have found talks about creating subtypes for the purpose of isolating attributes specific to a subtype and not relationships specific to a subtype.
Advice or reference to data you have found that I was not able to is greatly appreciated!
(If you are really trying to make sense of my section of EERD it may be helpful to know that PATIENT is a subtype of a PERSON supertype.)
1  Modelling & Notation
1.1  ERD
is pre-Relational, 1960’s.  It cannot handle Relational Keys, which means it is hopeless for Relational Data Modelling.  In the Relational paradigm, the Relational Key (which is composite) is central, therefore the identity of each entity cannot be analysed, or modelled, or defined, in ERD.
There is no definition in ERD for the Relational concepts of Independent/Dependent tables, or Identifying/Non-Identifying relations, as it is meaningless without a Relational Key, which leads to much confusion when extending ERD and attempting to add those.  Further, as you have found, it has no notation of Domain/Datatype; Subtype; etc.
ERD never was a Standard.  Since it is un-useable, each person who attempts to use it for an SQL implementation has to “extend” ERD, and that results in a million notations, all of which are different and incomplete.  And which have to be explained to the reader.  Whereas a Standard needs no explanation because it is complete and documented, once.
Technically, ERD is not a model (which implies a mathematical, logical basis).  The semantics are primitive and nowhere near complete.  In fact, it is hopeless for modelling, period, even for pre-Relational filing systems.
1.2  IDEF1X
is the Standard for Relational Data Modelling, available since the 1980's, a Standard since 1993.  As such it is complete, whereas an extended ERD will never be complete, no matter how much you extend it.
The academics and authors of "textbooks" are clueless: as evidenced, they are 50 years behind the industry (definition) and 40 years behind (implementation on SQL platforms).  They are stuck in 1960's Record Filing Systems, which is physical, characterised by a RecordID, and they market it as "relational".  
Whereas Codd's Relational Model is completely logical, with a mathematical foundation, and provides far more Integrity; Power; and Speed.
To use ERD at all, you have to extend it, using some private notation, as you have done.  Instead of moving incrementally and painfully in the direction of IDEF1X, I suggest you just switch to it, and obtain the full benefit.  You may find this IDEF1X Introduction useful.
1.3  Logical vs Physical Data Model
There is a lot of nonsense written about the distinction.
The Logical model simply progresses, in iterations, to the point where it is stable, and then it is the Physical, which can be implemented on a specific SQL platform.  That is, there is no “convert” process.
In good Data Modelling tools, such as ERwin, it is one file, not two or three, and the Logical vs Physical is simply different views of that one file.  Eg. Domain in the Logical is DataType in the Physical. The Physical is of course specific to the target platform, eg. BOOLEAN in one is BIT in another.  If you are not using a Data Modelling tool, or using a poor one, sure, you will have separate files and you have to deal with the attendant synchronisation problems.
But what is the proper way to model this? How do I now model the relationships to visits and beds while still maintaining the constraint that the discriminator must be of a certain value to qualify for those relationships?
In this regard, the question is not about Logical vs Physical DM, all aspects re the question are implemented in both.
Yes, it is about notation. There is no notation problem, or difference (Logical vs Physical) in IDEF1X, because it is complete.
Do I just forget about representing this constraint in the physical data model
No, they are drawn in both, they are implemented in the DDL.
and make sure its implemented in the code when the tables are created?
If you use a Data Modelling tool, it squirts out SQL that is specific to the target platform. Otherwise, sure, you have to write your own DDL and make sure it is correct. In any case, the SQL is the same (not counting the difference in SQL flavours).
Caveat.  The pretend SQLs (all freeware “sqls” and Oracle) are not SQL compliant, their use of the term is not correct.  They cannot implement ordinary SQL features such as Constraints for Subtypes or ACID Transactions; etc.
Or is there a notation for physical data models which represents this type of constraint?
No, there is no difference in the notation in IDFE1X. Your question appears to be due to your extensions to ERD. First, the ERD is not useable for Relational data modelling, and cannot cope with Relational Keys or Subtypes.  Second, your extensions, good as they may be, do not have the ordinary Relational notation that IDEF1X has. Again, just switch to IDEF1X.
2  Codd’s Relational Model
As distinct from the variety of primitive nonsense written by the academics and in textbooks, misleadingly marketed as “relational”.
2.1  Subtype
I have done much searching and cannot seem to find anything about this. All of the material I have found talks about creating subtypes for the purpose of isolating attributes specific to a subtype and not relationships specific to a subtype.
There is no problem at all with a Subtype that has no attributes, same as there is no problem at all with a row that has no attributes.  Keep in mind that each entity is a Fact (one fact in one place), and the Fact is established by the Relational Key, to which the attributes are quite secondary (Codd’s 3NF properly understood).  Thus Resident and OutPatient are discrete Facts, whether each Subtype has attributes or not; whether the Fact exists for supporting a Foreign Key or not, is a separate issue.
Advice or reference to data you have found that I was not able to is greatly appreciated
You may find this Subtype document useful.  For examples, go to my profile, and look up any answers that interest you.
If you require even further detail, there is a long discourse regarding Subtypes and notation, that I had with the single academic who is trying to cross the great chasm between academia and reality in this field, who recently "found" IDEF1X from my data models.  I use a corrected form of IDEF1X (it was written by an academic), using the pre-existing IEEE notation when it is more precise.  The discourse goes into the whys and wherefores of the original IDEF1X vs the corrected form.  It is long at 70 posts, and there is a document that summarises it. Just ask.
Obviously it makes sense to not have OUTPATIENT or RESIDENT tables and only a PATIENT table which contains a discriminator for the type of patient.
No.  Each Subtype is a separate table, in the Logical models (first) and Physical (last), and the DDL. The physical is merely the implementation level of the Logical, you should not have anything in the Physical that is not in the Logical (you do not want to implement a thing that is not logical, not semantic; not Relational (which is absolutely logical, and unlimited).
Consider that the database may be expanded in the future, and you may have attributes in the Subtypes. 
- If the cluster is Exclusive, the Basetype table must have a Discriminator. 
- If it is Non-Exclusive, there is no Discriminator.
Supertype means something quite different, the academics use terms loosely and incorrectly. Eg. the notion of Superkey is hysterical, and anti-Relational.
2.2  Data Model
Here is the logical model in IDEF1X notation, showing attributes, not domains.  
I have corrected a few errors: given the level of modelling that you have demonstrated, I don't think they need a full explanation.
Person Subtype is Non-Exclusive (no Discriminator)
Patient Subtype is Exclusive (needs a Discriminator)
That is to be used in your code to determine the Subtype, otherwise JOIN to the Subtype
Since Resident::Bed is 1::1, the attributes (Bed FK) can be located in Resident.  
This treatment ensures that the Bed that a Patient may be assigned to, exists.
Consider:
When an OutPatient visits the CareCenter, is not the purpose to obtain a treatment of some kind, which must be recorded ?
Is not the treatment obtained under a Physician’s control, and shouldn’t the treatment details be recorded ?
Therefore an OutPatient obtains a Treatment, same as a Resident, and it is common, in the Basetype.
Visit can be eliminated
(again, whether the treatment is received by a Resident or OutPatient regards the Subtype).
The data model in a PDF.
2.3  Predicate
The Predicates can be read directly from the graphic model, the evaluation of such provides an excellent feedback loop to the modelling process.  Please read them and verify.
Eg. the Predicate Each Bed accommodates 0-to-n Residents would cause a brawl that can be avoided.
Again, the academics and authors do not understand the Relational Model, and thus they are clueless about Predicates. For a good introduction, refer to Relational Table Naming Convention, the Relationship, Verb Phrase section at the top, and the Predicate section at the end.
2.4  Null
Nulls in a Relational database are a clear indication of a Normalisation error. I have removed them.
3  Outstanding
The academics and authors understand only 1960's physical Record Filing Systems (placed in an SQL container for convenience), thus they understand only Referential Integrity.  They do not understand Codd's Relational Model, thus they cannot understand, and they cannot teach, Relational Integrity, which is logical, and provides far more data integrity than 50-year-obsolete filing systems.
Your model allows any Physician to treat any Patient, which is typical for a RFS, if you follow the literature, but sub-normal for Relational.
I doubt that that is what you want in a database.  I think you want only the treating Physician, the ProviderNo to treat the Patient.  
As the model progresses, you may wish to ensure that a Bed is assigned to one Resident only. I didn’t model it because I need to be told: is admission and bed assignment two administrative steps or one ?
Do you not require lookup tables for Speciality and TreatmentName ?
Data Modelling is an iterative exercise: it is only when a model is erected, and contemplated, that the issues are exposed, which leads to the next iteration.

Generic relation for database

I have to design a generic entity that would be able to refer to variated other entities.
In my example, that would be a commentary entity inside a web application. You could post commentaries on to users, classifieds, articles, varieties (botanical ones), and so on.
So that entity would be made like this:
As a matter of fact, the design (kind of) pattern would be this one:
What are the pros and cons of this kind of pattern?
What I see is:
Pros
It decreases the number of entities if the concept is the same (commentaries for example);
You can therefore easily manipulate heterogeneous objects;
You can aggregate these objects easily (e.g. this user's last commentaries in the whole site, presented easily in a same thread);
Cons
This allows you to fall in the ugly (you use it outrageously and your database and source code are ugly);
There is no control in the database, and this one must therefore be done inside the application code.
What are the performances impacts?
Conclusion
Is this kind of pattern suitable for a relational database? How can we do then?
Thank you by advance.
One more con :
This scheme relies on a mapping between values and names for the "entities" referred to by those values. Think of all the fun you'll have resolving issues that in the TEST system, the ORDER entity has number 734 but in production, it has number 256. You can use the entity names themselves as the values of your entity_id stuff, but you will never be able to avoid hardcoding values for them in your programs (or, say, in view definitions) anyway. Thereby defeating whatever advantage it was you thought you could win.
This kind of scheme is a disease mostly suffered by OO programmers. They see structures that are largely similar and they have this instinctive reflex "I must find a way to resue the existing thing for this". Forgetting that database design is not program design.
EDIT
(if it wasn't clear, this means my answer to your question "Is this kind of pattern suitable for a relational database?" is a principled "NO".)
This is the classic Polymorphic Association anti-pattern. There are a number of possible solutions:
1) Exclusive Arcs e.g. for the Commentary entity
Id
User_Id
Classified_Id
Article_Id
Variety_Id
Where User_Id, Classified_Id, Article_Id and Variety_Id are nullable and exactly one must be not null.
2) Reverse the Relationship e.g remove the Target_Entity and Target_Entity_Id from the Commentary entity and create four new entities
User_Commentary
Commentary_Id
User_Id
Classified_Commentary
Commentary_Id
Classified_Id
Article_Commentary
Commentary_Id
Article_Id
Variety_Commentary
Commentary_Id
Variety_Id
Where Commentary_Id is unique and relates to the Id in Commentary.
3) Create a super-type entity for User, Classified, Article and Variety and have the Commentary entity reference the unique attribute of this new entity.
You would need to decide which of these approaches you feel is most appropriate in your specific situation.

Entity Attribute Value model (EAV) and how to achieve it with cfml?

I'm trying to figure out how to implement this relationship in coldfusion. Also if anyone knows the name for this kind of relationship I'd be curious to know it.
I'm trying to create the brown table.
Recreating the table from the values is not the problem, the problem that I've been stuck with for a couple of days now is how to create an editing environment.
I'm thinking that I should have a table with all the Tenants and TenantValues (TenantValues that match TenantID I'm editing) and have the empty values as well (the green table)
any other suggestions?
The name of this relationship is called an Entity Attribute Value model (EAV). In your case Tenant, TenantVariable, TenantValues are the entity, attribute and value tables, respectively. EAV is attempt to allow for the runtime definition or entities and is most found in my experience backing content managements systems. It has been referred to an as anti pattern database model because you lose certain RDBMS advantages, while gaining disadvantages such as having to lock several tables on delete or save. Often a suitable persistence alternative is a NoSQL solution such as Couch.
As for edits, the paradigm I typically see is deleting all the value records for a given ID and inserting inside a loop, and then updating the entity table record. Do this inside of a transaction to ensure consistency. The upshot of this approach is that it's must easier to figure out than delta detection algorithm. Another option is using the MERGE statement if your database supports it.
You may want to consider an RDF Triple Store for this problem. It's an alternative to Relational DBs that's particularly good for sparse categorical data. The data is represented as triples - directed graph edges consisting of a subject, an object, and the predicate that describes the property connecting them:
(subject) (predicate) (object)
Some example triples from your data set would look something like:
<Apple> rdf:type <Red_Fruit>
<Apple> hasWeight "1"^^xsd:integer
RDF triple stores provide the SPARQL query language to retrieve data from your store much like you would use SQL.

Database Normalization Vocabulary

There is lot or material on database normalization available on Steve's Class and the Web. However, I still seem to lack on very definite reasons on explaining normalization.
For example, for a simple design such as a table Item with a Type field, it makes sense to have the Type as a separate table. The reason I forwarded for that was if in future any need arose to add properties to the Type, it would be much easier with a separate table already existing.
Are there more reasons which can be shown to be obvious?
Check these out too:
An Introduction to Database Normalization
A Simple Guide to Five Normal Forms
in Relational Database Theory
This article says it better than I can:
There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations. A customer address change is much easier to implement if that data is stored only in the Customers table and nowhere else in the database.
What is an "inconsistent dependency"? While it is intuitive for a user to look in the Customers table for the address of a particular customer, it may not make sense to look there for the salary of the employee who calls on that customer. The employee's salary is related to, or dependent on, the employee and thus should be moved to the Employees table. Inconsistent dependencies can make data difficult to access because the path to find the data may be missing or broken.
following links can be useful:
http://support.microsoft.com/kb/283878
http://neerajtripathi.wordpress.com/2010/01/12/normalization-of-data-base/
Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization. In his own words:
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
— E.F. Codd, "Further Normalization of the Data Base Relational Model"
Taken word-for-word from Wikipedia:Database normalization

Resources