uncertainty in developing a database model - database

I'm trying to develop a database model for candidate, their registered exams and result of the exams when its being taken.
This is what I've done so far. however im unsure if am on the right track especially from the examination table to the examination result table.
how easy will it be to right write an insert sql code for examinationresult population for a particular candidate
the examination types are categorised into science, art and social science. they all have 4 components each

Note on Progression
Given the fact that the Question changes substantially (in clarifying the requirement, not is scope) in response to my Response and TRD, this is going to take some back-and-forth. Let's identify Steps: your Step numbers are odd, starting from 1; mine, in response, are even. Parts of previous Response Steps have become obsolete, they may no longer make sense.
I would suggest a bounty, except for the fact that you have few points.
Response Step 2 to Initial Question & Step 1 Diagram
This is what I've done so far.
You have done some good work, but it is too early for assigning PKs. Besides, assigning an ID on every file as a starting point will cripple the modelling process, the result will not be a database. You have to model the data (not the database) first, then assign Keys when the entities are clear and stable. So drop all your IDs and PKs and model the data, as data. Forget about what you want to do with the data (ie. forget the app).
how easy will it be to right write an insert sql code for examinationresult population for a particular candidate
Right now you can't. You have no relationship between Candidate and Examination[Result]. That is not a problem because the modelling is incomplete at this stage, when it is complete the code will be simple.
The entity Course is implied, but it is missing.
however im unsure if am on the right track especially from the examination table to the examination result table
You are on the right track with some of the other files, but the Examination cluster needs work. This will take a bit of back-and-forth. Once you answer the questions in the comments, we can proceed.
The main issue is this: how is Examination identified.
An ID field does not identify anything, nor does it provide uniqueness in the data, which is required if you want data integrity. IDs result in a Record Filing System with no integrity, however, it appears you want a database with data integrity. Is that correct ?
Go back to the user and discuss how courses and components are identified, what codes they use, etc. Those are the natural Keys that they use to identify their data, that they will enter into the system when they need look something up, or to enter examination results.
Eg. It is not reasonable to contemplate an Examination that exists independently (as you have modelled it). People do not go to a hall and sit for any old exam. The exam exists only in the context of a course, they sit for an exam for a course.
Then the course, and not the exam, has components, which are examined. And each course has a different number of components.
Eg. a Course which is identified as ENG101 for English Literature year 1
And then the components within that. Eg. 2b Short essay on poetry.
They may need to identify the year and semester of the course as well, in which case, you need a CourseOffering per semester.
Consider this, as a discussion point. Courier is example data, blue is Key, green is non-key:
TRD Step 2
Response Step 4
Response to Question & Description
This is what I've done so far.
My previous response still applies:
You have done some good work, but it is too early for assigning PKs. Besides, assigning an ID on every file as a starting point will cripple the modelling process, the result will not be a database. You have to model the data (not the database) first, then assign Keys when the entities are clear and stable. So drop all your IDs and PKs and model the data, as data. Forget about what you want to do with the data (ie. forget the app).
You have not addressed that issue, that I identified in your Step 1 Diagram, in your Step 3 Diagram. It appears, from the evidence, that you might be happy with IDs as "Primary Keys" (there aren't), despite the hindrance having been identified to you. That means your understanding of the data is crippled, and the progress of your diagrams will be slow.
My previous response still applies:
An ID field does not identify anything, nor does it provide uniqueness in the data, which is required if you want data integrity. IDs result in a Record Filing System with no integrity, however, it appears you want a database with data integrity. Is that correct ?
You must answer these questions, otherwise your design cannot proceed. These are severe errors that must be corrected. One cannot build on, or progress, a foundation that contains severe errors.
Could you please confirm, you do want a Relational Database, with the integrity and performance that Relational Databases are capable of, that is easy to code against, as opposed to a Record Filing System, with no integrity or speed, that will be difficult to code against. Correct ?
If [1] is correct. Since ID fields as "Primary Keys" do not provide row uniqueness, which is demanded for a Relational Database, how exactly, do you intend to provide the required row uniqueness ? Alternately, are you happy to have an RFS that is full of duplicate rows (each with an unique record ID) ?
how easy will it be to right write an insert sql code for examinationresult population for a particular candidate
My previous response still applies:
Right now you can't. You have no relationship between Candidate and Examination[Result]. That is not a problem because the modelling is incomplete at this stage, when it is complete the code will be simple.
Ok, in your Step 3 Diagram, you have drawn a line between Candidate file and the ExaminationResult file (as opposed to, inserting a relationship in a database).
In a record filing system, sure, you can just draw a line between any two files, insert the relevant ID field, and hey presto, you have "linked" or "connected" or "mapped" the two files.
But database design (as opposed to file design) does not progress like that, you cannot just draw a line between any two objects, insert the relevant ID field, and hey presto, create a database relationship. No. There is no basis, no integrity, in the dashed line that you have drawn. Eg. in your Step 3 Diagram, any Candidate can be related to any Examination[Result].
That is "normal" or "ordinary" in record filing systems, but in a database, it is something to be recognised and understood as an error, and thus prevented. Because we expect integrity in a database, and because it can be prevented, easily.
however im unsure if am on the right track especially from the examination table to the examination result table
My previous response still applies:
You are on the right track with some of the other files, but the Examination cluster needs work. This will take a bit of back-and-forth. Once you answer the questions in the comments, we can proceed.
The main issue is this: how is Examination identified.
An ID field does not identify a row (it identifies a record, which has no relevance whatsoever in a database).
The same two problems (a) lack of a valid identifier, and (b) lack of row uniqueness, exists with your Candidate, Component and ExaminationResult files.
Response to Diagram as a Diagram (as opposed to the content)
You have improved it over your Step 1 Diagram, and in response to my Response Step 2, great. But the relationships (most of them) are still incorrect. And the basis of Candidate::Examination is still not resolved.
It appears to me that you are not clear about the notation (notches; circles; crows feet) and precisely what they mean at the parent and child ends). So you need to learn that first, and then draw the diagram, rather than the other way round.
It is great that you are using a Notation that is meaningful, and many details are shown (many people don't, they draw nice-looking diagrams that lack the detail required for a full understanding of the model. That means that every notch; circle; crows foot, has specific meaning, and must be drawn correctly, in order to convey that meaning to the reader.
Entities do not exist in isolation, there must always be a parent first, in order for the child to be a child of the parent. There is no such thing as "equal". Dependency is always in one direction.
Your relationships that are 1-and-only-1 on one side, and 1-and-only-1 on the other side, are incorrect, they indicate a Normalisation error. The field in the subordinate record can be Normalised into the ordinate record.
Eg. AdmissionLetter is not a separate file, some form of AdmissionLetter identifier (not an ID field) should be located in Candidate.
Eg. Title::Candidate is a drawing error, it should be 1 at the Title end and 0-to-many at the Candidate end.
In a data model, bold (by convention) means a migrated Foreign Key. The Primary Key that is migrated is not bold.
Response to Diagram Content
From your replies, the term Subject trumps the term Component; Category trumps various loosely-identified elements into one clear entity.
It is not reasonable to contemplate an Examination that exists independently (as you have modelled it).
People do not go to a hall and sit for any old exam, any old Subject. The exam exists only in the context of a Subject, they sit for an exam for a Subject.
I accept that the Examination is one sitting, for four Subjects
I accept that the four Subjects are defined by a Category.
I accept that the Candidate is registered for a Category.
Thus the exam exists only in the context of a Subject, which exists only in the context of a Category, and the Candidate sits for an exam which is a Category, which contains four (the number does not matter) Subjects.
Having resolved that, two questions remain:
Do you need to record an Examination as an event, independent of the Candidates who sit in that event. Eg. Examination(Location, DateTime) ?
Does the Examination event examine Candidates in one, or more than one, Category ?
The notion of four Subjects that are implemented as four repeated fields in one record breaks Second Normal Form, which demands that repeating fields are Normalised into separate records in a child file.
Therefore, for both your Component and ExaminationResult files, that issue needs to be resolved.
Note that the fact that that problem is repeated in two separate files is a second alarm that it is an error.
I have clarified the Category/Subject issues for you, and resolved the Normalisation error.
I have given simple identifiers for Categories and Subjects.
If you do not implement that, you will not have integrity between the Candidate and the Subject they are being Examined for. As well, you will suffer various problems when you get to the coding stage.
I have no idea what you are trying to do with exComp, therefore I have no response. Perhaps you can say a few words about it.
Thus far, there is still no reasonable way of relating Candidates to Examinations or ExaminationResults. That is, it has no basis, nothing has been defined as the basis for the relationship, and thus the relationship has no integrity.
On the basis of what I have been able to ascertain thus far, there must be some sort of registration for an exam. Otherwise you would not know that a Candidate is sitting for an exam.
When the Candidate registers, they register for an exam, and that exam is defined (and therefore constrained) by a Category. Otherwise any Candidate can sit for any exam, which I believe, you would like to prevent.
Further, the [four] exam Subjects that they sit for, should be constrained by the Category that they registered for.
You do want to ensure that you do not record an Economics exam result for a Candidate who is registered for Science, correct ?
I have determined that the basis of an exam is the Registration. That is the event, the fact, the recording of which, establishes that a Candidate will sit for an exam.
The identifier virtually jumps out at you, it is CategoryCode plus CandidateID. Voila! we have row uniqueness. Magnifique! we have integrity.
Now the integrity of ExaminationResult can be implemented: it is constrained to the CandidateRegistration::Category and to the Category::Subject.
To be Resolved: Do you need to identify the fact of a Candidate registering for an examination (RegistrationDate, AdmissionLetter of whatever) vs the fact that the Candidate sat for the examination (eg. ExaminationDate) ? A sort of roll call.
Right now, I have modelled that as a single fact with no differentiation, and the table is called Examination because you seem to be focussed on that.
Predicate
These days, people seem to be throwing themselves at drawing a diagram, without understanding either the basics of a Relational Database, or of the exercise of modelling data. Predictably, that results in an ill-defined diagram (many relevant details are omitted) [gratefully, your diagram has some definition], and it produces a record filing system with no integrity, no relational power, no speed, instead of a Relational Database with integrity, power, and speed.
One concept that is often missing is Predicates. A competent reader can read a good data model, and ascertain the Predicates, because they are drawn in the model, in the form of notation, but a novice doesn't understand the notation, or the relevance of the various items, and therefore will miss the Predicates. In sum, the Predicates are all the constraints that are placed on the data:
Row Identification:
The basis of it existence, and how it is Identified: Independent (square corners); or Dependent (round corners)
Row Uniqueness: Primary and Alternate Keys (note, IDs are not Keys)
Relationships between rows:
Identifying (solid lines); or Non-identifying (dashed lines)
Meaning, relevance, purpose: the all-important Verb Phrase
Further, a novice cannot determine the Predicates when there is no diagram, or when the diagram is poor, or when they are designing the filing system and drawing the diagram themselves. Thus they do not identify the relevant Predicates in their diagram.
Predicates are very important during the modelling exercise, in that as well as the model expressing the Predicates, the Predicates confirm the accuracy of the model, it is a feedback loop. It is an essential part of the modelling exercise. Since I am executing the modelling task for you, I am working out the Predicates as I perform that task, they are obvious to me. But they may not be obvious to you.
When the data model is published, and ready for discussion with the users, these Predicates are incorporated into it. They come under the heading of Business Rules, they form a part of that, because that is the way the user perceives them. Consequently, during the walkthroughs and discussions, the Predicates (as well as the other stated Business Rules) are either confirmed or denied by the user. They need to be stated explicitly, because unlike the technically educated developer, the user cannot be expected to read all the relevant Predicates from the notation in a good data model.
In this situation, I am the modeller, and you are the "user". Thus I have decided to provide the Predicates for you, explicitly. So that you can confirm or deny them, and thus we can progress the modelling exercise. Once you get used to reading the Predicates from a good data model, you will not need to have them declared explicitly for you. Again, Predicates are very important because they verify (or not) the accuracy of the model. So please read them carefully and comment on any Predicates that you do not completely agree with, or that you do not understand.
Of course, it is not necessary to explicitly declare all the Predicates, there are just too many, we declare just the more relevant ones, that relate to:
(a) rows (tables), the basis for their existence
(b) their identification
(c) all dependencies
(d) relationships, both sides (one side is the Verb Phrase).
Step 4 TRD
I have implemented all the above, as detailed. Please consider this TRD as a discussion platform for the next iteration, and comment. Courier indicates example data, blue indicates Key values, green indicates non-key values:
Step 4 TRD
Response Step 6 to Chat Step 5
All issues discussed have been resolved, and implemented in the model. Sorry, I do not have time right now to post details, this is simply identifies the updated models.
Entity-Relation and full Predicates on page 1
All resolved issues have been implemented.
Predicates
Now that it is stable, I am now giving you the second side of the Relation Predicates (child-to-parent). And now that you understand them, I have deleted the repeated, annoying "Each" that is demanded for novices.
Entity-Relation-Key on page 2
Now that the TRD is stable, we are ready to proceed to Determination of Keys
(Second only to Normalisation, Key Determination is a critical part of the modelling exercise. The two tasks are normally performed side-by-side, they are inseparable, I have already determined the keys. In this case, given the limitations of the communication media, I am presenting it as a sequential step).
Here, I use an Extension to the IDEF1X Notation that allows me to concentrate of the components that are relevant to the task, I expect that it is self-explanatory. The Key columns only, are given. Foreign Keys are not Bold (as they are in the DM). All that, is intended to make it easy on the eye.
Most tables have one Key (Primary). Where there are two Keys (Primary and Alternate), the AK is below the line.
This is my recommendation for the Keys, as requested, for your review.
Step 6 TRD and TRK 6

Related

Is it Important to Understand Each Normal Form

I have been studying database design and programming for quite some time now, but I still can't get a grasp of understanding each individual normal form (1NF, 2NF, 3NF.)
Seeing as anytime the data is in Third Normal Form, it is already automatically in Second and First Normal Form, can the whole process actually be accomplished less tediously by fully normalizing the data from the start. I can accomplish this easily by arranging the data so that the columns in each table, other than the primary key, are dependent only on the whole primary key.
How important is it to understand each individual normal form if we can simply fully normalize the data less tediously by doing what I have described?
EDIT: What I'm ultimately asking is: Is it important to go through the steps of each normal form when normalizing data, or is it appropriate to just go to Third Normal Form seeing as the result is ultimately the same?
I highly recommend understanding each normal form as this will help you determine or investigate any issues with a current database may have as sometimes you might not have the perfect scenario each time and understanding each normal form will help you to understand the current problems with an existing database design if there are any.
Going through step by step through the different normal forms will help you to figure out why we do this and this is to achieve the goals specified by E. F. Codd.
The objectives of normalization were stated as follows:
1. To free the collection of relations from undesirable insertion, update and deletion dependencies.
2. To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs.
3. To make the relational model more informative to users.
4. To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
Here is a image to help you understand the different normal forms better.
P.S. BCNF is actually 3.5NF not 4NF
It's right that, when being in the 3. NF, you're also in the 2. and in the 1. NF. However, the only condition for the 3. NF is not only that all the data is only dependent on the whole candidate key. It also has the condition that it already is in the 2. NF, meaning that every property that is not the candidate key has to fully depend on the candidate key and that it is in the 1. NF, meaning that every column has to be atomar. So yes, it is important to understand every NF if you want to have a table in the 3. NF.
I'll try to explain the Normal Forms to you:
1. NF
The 1. NF states that every column has to be atomar. This means, there shouldn't be multiple items of data in one column. For example, the adress of someone shouldn't be stored in one column, but should be splitted in the country, the state, the street and so on. Each of these pieces of data should then be stored in their own column.
2. NF
The 2. NF states that every attribute, that is no part of the candidate key, has to be identificable only by the whole candidate key. That means for example that you shouldn't store books and printing labels in one table. Because then the name of the book would only be dependent on the id of the book, while the printing label's name would only be dependent of the id of the printing label and not of the whole candidate key.
3. NF
The 3. NF nearly states the same as the 2.: No column is allowed to be dependent on a non candidate key column. That means for example that you shouldn't store the IBAN of a book and an id of the book in the same table, with only the id being the candidate key, as you'd only need the IBAN to find the name to the book.
If this doesn't explain the matter well enough, there's a lot of information online regarding the normal forms (like Wikipedia).
its not the case that if its in 3 NF its in 1 NF nad 2 nd NF .it was like if its in 2nd NF it has to be in 1st NF beforehand .and same goes for 3NF .for normalising to 3NF it has to clear 1st and 2nd NF forms.
1st normal forms states that no multivalued attribute should be present.
2NF states that there should not be partial dependency on a non prime attribute .
3NF states that no transitive depedency should be there .
thank you
The only NF (normal form) that matters is 5NF.
A relation (value or variable) is in 5NF when for every way it can be losslessly decomposed the components can be joined back in some order where the common columns of each join are a superkey of the original. (Fagin's PJ/NF paper's membership algorithm.)
This allows a table to be the join of others with overlapping meanings but without update anomalies. (Although update anomalies cease at ETNF, between 4NF & 5NF.)
Anyway if you wanted a lower NF you should normalize to 5NF then denormalize. The main reason people settle for lower NFs is ignorance. There are certain costs & benefits, but people don't know or address them--code must restrict updates to account for the problematic update anomalies. Normalization to a given NF is not done by going through lower NFs; one uses an appropriate algorithm for the NF one wants. (This is made clear by most textbooks, although some wrongly say to move through lower NFs, but putting into a lower NF can prevent good higher-NF versions of the original from turning up later.)
PS There is no single notion 1NF and all it has in common with higher NFs is that both seek "better" designs.
From what I recall of the process, it's a method that you follow to get to a state where the storage and search facilities of the database are fully optimised. Yes 3NF does encapsulate the rules below it, 1st and 2nd, but it is far easier to unpick the data if you start at the easier forms of normalization to see if your data is in an efficient format for storage in a RDBMS or SQL based database. Jumping in straight at a higher normal form makes the whole process for beginners harder and intimidating and to not analyse the data correctly. To be honest will make hard work when dealing with difficult data structures that are not just your usual invoice, invoice Lines, address stuff that you tend to deal with day in and day out. Going through the process of normalization, sometimes there is value in unpicking data structures that were not obvious from the start, which not only makes your data more efficient but also helps you reason over what you are trying to accomplish.

Does the min/max notation relationship matche what i am trying to get?

is this the right way to represent this relationship which is described in text on the picture? this is in min/max notation
http://s7.postimg.org/holux2uwb/image.jpg
There is a huge lack of context here. I'll just kick a answer blindly.
In many cases while modeling data an order is usually seen as an event. I do not know exactly what is a "Bugel Card", but if it is a name of an identity such as a noun, and it has properties/attributes that must be stored, as I suspect it is the Customer, then we have two entities that have a relationship: the Customer entity, and the Bugel Card entity. The resulting connection/relationship/link forms the Order event.
If in an Order a Customer ALWAYS uses AT LEAST 1 "Bugel Card", and not more than that, then we have a cardinality (following the notation min max) of (1,1) between Customer and Bugel Card Entities, in both sides. For relationships (1,1) it takes the data modeler's discretion on which side will be set the relationship between the entities, that is, where the foreign key will go (once you decompose the Conceptual Model). It is always recommended to leave the foreign key on the side where in the future the relationship can become "many".
If you can improve a little more the context here, I can give you an answer with more accuracy (more correct), and remember:
Do not model data without a full context. When you go to an Entity Relationship Diagram starting from the Conceptual Model, you need a context, and one that is very well described. Without a full context, there is no diagram, and as a result, there is no database schema (or much less a system to use and manage).
Other than that, it is not possible to model entities without properties / attributes. Without them, an entity is nothing, because in its decomposition there will be no column to be created, and soon there will be no data to be persisted. Even if in your modeling process you let to define the attributes later you can end up confusing yourself and/or forgetting something. This is something prone to errors.
To be honest, there is no standard way of modeling data. What I have spoken so far are just data modeling tips. It is up to you what you want to do, and how you want to do.
Any questions, or anything else you need, please comment and I help you.

Database Design: Explain this schema

Full disclosure...Trying feverishly here to learn more about databases so I am putting in the time and also tried to get this answer from the source to no avail.
Barry Williams from databaseanswers has this schema posted.
Clients and Fees Schema
I am trying to understand the split of address tables in this schema. Its clear to me that the Addresses table contains the details of a given address. The Client_Addresses and Staff_Addresses tables are what gets me.
1) I understand the use of Primary Foreign Keys as shown but I was under the assumption that when these are used you don't have a resident Primary Key in that same table (date_address_from in this case). Can someone explain the reasoning for both and put it into words how this actually works out?
2) Why would you use date_address_from as the primary key instead of something like client_address_id as the PK? What if someone enters two addresses in one day would there be conflicts in his design? If so or if not, what?
3) Along the lines of normalization...Since both date_address_from and date_address_to are the same in the Client_Addresses and Staff_Addresses table should those fields just not be included in the main Address table?
Evaluation
First an Audit, then the specific answers.
This is not a Data Model. This is not a Database. It is a bucket of fish, with each fish drawn as a rectangle, and where the fins of one fish are caught in the the gills of another, there is a line. There are masses of duplication, as well as masses of missing elements. It is completely unworthy of using as an example to learn anything about database design from.
There is no Normalisation at all; the files are very incomplete (see Mike's answer, there are a hundred more problem like that). The other_details and eg.s crack me up. Each element needs to be identified and stored: StreetNo, ApartmentNo, StreetName, StreetType, etc. not line_1_number_street, which is a group.
Customer and Staff should be normalised into a Person table, with all the elements identified.
And yes, if Customer can be either a Person or an Organisation, then a supertype-subtype structure is required to support that correctly.
So what this really is, the technically accurate terms, is a bunch of flat files, with descriptions for groups of fields. Light years distant from a database or a relational one. Not ready for evaluation or inspection, let alone building something with. In a Relational Data Model, that would be approximately 35 normalised tables, with no duplicated columns.
Barry has (wait for it) over 500 "schemas" on the web. The moment you try to use a second "schema", you will find that (a) they are completely different in terms of use and purpose (b) there is no commonality between them (c) let's say there was a customer file in both; they would be different forms of customer files.
He needs to Normalise the entire single "schema" first,
then present the single normlaised data model in 500 sections or subject areas.
I have written to him about it. No response.
It is important to note also, that he has used some unrecognisable diagramming convention. The problem with these nice interesting pictures is that they convey some things but they do not convey the important things about a database or a design. It is no surprise that a learner is confused; it is not clear to experienced database professionals. There is a reason why there is a standard for modelling Relational databases, and for the notation in Data Models: they convey all the details and subtleties of the design.
There is a lot that Barry has not read about yet: naming conventions; relations; cardinality; etc, too many to list.
The web is full of rubbish, anyone can "publish". There are millions of good- and bad-looking "designs" out there, that are not worth looking at. Or worse, if you look, you will learn completely incorrect methods of "design". In terms of learning about databases and database design, you are best advised to find someone qualified, with demonstrated capability, and learn from them.
Answer
He is using composite keys without spelling it out. The PK for client_addresses is client_id, address_id, date_address_from). That is not a bad key, evidently he expects to record addresses forever.
The notion of keeping addresses in a separate file is a good one, but he has not provided any of the fields required to store normalised addresses, so the "schema" will end up with complete duplication of addresses; in which case, he could remove addresses, and put the lines back in the client and staff files, along with their other_details, and remove three files that serve absolutely no purpose other than occupying disk space.
You are thinking about Associative Tables, which resolve the many-to-many relations in Databases. Yes, there, the columns are only the PKs of the two parent tables. These are not Associative Tables or files; they contain data fields.
It is not the PK, it is the third element of the PK.
The notion of a person being registered at more than one address in a single day is not reasonable; just count the one address they slept the most at.
Others have answered that.
Do not expect to identify any evidence of databases or design or Normalisation in this diagram.
1) In each of those tables the primary key is a compound key consisting of three attributes: (staff_id, address_id, date_address_from) and (client_id, address_id, date_address_from). This presumably means that the mapping of clients/staff to addresses is expected to change over time and that the history of those changes is preserved.
2) There's no obvious reason to create a new "id" attribute in those tables. The compound key does the job adequately. Why would you want to create the same address twice for the same client on the same date? If you did then that might be a reason to modify the design but that seems like an unlikely requirement.
3) No. The apparent purpose is that they are the applicable dates for the mapping of address to client/staff - not dates applicable to the address alone.
3) Along the lines of
normalization...Since both
date_address_from and date_address_to
are the same in the Client_Addresses
and Staff_Addresses table should those
fields just not be included in the
main Address table?
No. But you did find a problem.
The designer has decided that clients and staff are two utterly different things. By "utterly different", I mean they have no attributes in common.
That's not true, is it? Both clients and staff have addresses. I'm sure most of them have telephones, too.
Imagine that someone on staff is also a client. How many places is that person's name stored? That person's address? Can you hear Mr. Rogers in the background saying, "Can you spell 'update anomaly'? . . . I knew you could."
The problem is that the designer was thinking of clients and staff as different kinds of people. They're not. "Client" describes a business relationship between a service provider (usually, that is, not a retailer) and a customer, which might be either a person or a company. "Staff" describes a employment relationship between a company and a person. Not different kinds of people--different kinds of relationships.
Can you see how to fix that?
This 2 extra tables enables you to have address history per one person.
You can have them both in one table, but since staff and client are separated, it is better to separate them as well (b/c client id =1 and staff id =1 can't be used on the same table of address).
there is no "single" solution to a design problem, you can use 1 person table and then add a column to different between staff and client. BUT The major Idea is that the DB should be clear, readable and efficient, and not to save tables.
about 2 - the pk is combined, both clientID, AddressID and from.
so if someone lives 6 month in the states, then 6 month in Israel, and then back to the states, to the same address - you need only 2 address in address table, and 3 in the client_address.
The idea of heaving the from_Date as part of the key is right, although it doesn't guaranty data integrity - as you also need manually to check that there isn't overlapping dates between records of the same person.
about 3 - no (look at 2).
Viewing the data model, i think:
1) PF means that the field is both part of the primary key of the table and foreign key with other table.
2) In the same way, the primary key of Staff_Addresses is {staff_id,address_id,date_adderess_from} not just date_adderess_from
3) The same that 2)
In reference to Staff_Addresses table, the Primary Key on date_address_from basically prevents a record with the same staff_id/address_id entered more than once. Now, i'm no DBA, but i like my PKs to be integers or guids for performance reasons/faster indexing. If i were to do this i would make a new column, say, Staff_Address_Id and make it the PK column and put a unique constraint on staff_id/address_id/date_address_from.
As for your last concern, Addresses table is really a generic address storage structure. It shouldn't care about date ranges during which someone resided there. It's better to be left to specific implementations of an address such as Client/Staff addresses.
Hope this helps a little.

What exactly does database normalization do?

Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.? Why can't we have all the data in a single table and retrieve it as required?
The main reason is to eliminate repetition of data, so for example if you had a user with multiple addresses and you stored this information in a single table the user information would be duplicated along with each address entry. Normalisation would seperate the addresses into their own table and then link the two using keys. This way you wouldn't need to duplicate the user data, and your db structure becomes a little cleaner.
Full normalisation will generally not improve performance, in fact it can often make it worse but it will keep your data duplicate free. In fact in some special cases I've denormalised some specific data in order to get a performance increase.
Normalization comes from the mathematical concept of being "normal." Another word would be "perpendicular." Imagine a regular two-axis coordinate system. Moving up just changes the y coordinate, moving to the side just changes the x coordinate. So every movement can be broken down into a sideways and an up-down movement. These two are independent of each other.
Normalization in database essentially means the same thing: If you change a piece of data, this is supposed to change just one single piece of information in a database. Imagine a database of E-Mails: If you store the ID and the name of the recipient in the Mails table, but the Users table also associates the name to the ID, that means if you change a user name, you don't only have to change it in the users table, but also in every single message that this user is involved with. So, the axis "message" and the axis "user" are not "perpendicular" or "normal."
If on the other hand, the Mails table only has the user ID, any change to the user name will automatically apply to all the messages, because on retrieval of a message, all user information is gathered from the Users table (by means of a join).
Database normalisation is, at its simplest, a way to minimise data redundancy. To achieve that, certain forms of normalisation exist.
First normal form can be summarised as:
no repeating groups in single tables.
separate tables for related information.
all items in a table related to the primary key.
Second normal form adds another restriction, basically that every column not part of a candidate key must be dependent on every candidate key (a candidate key being defined as a minimal set of columns which cannot be duplicated in the table).
And third normal form goes a little further, in that every column not part of a candidate key must not be dependent on any other non-candidate-key column. In other words, it can depend only on the candidate keys. This leads to the saying that 3NF depends on the key, the whole key and nothing but the key, so help me Codd1.
Note that the above explanations are tailored toward your question rather than database theorists, so the descriptions are necessarily simplified (and I've used phrases like "summarised as" and "basically").
The field of database theory is a complex one and, if you truly wish to understand it, you'll eventually have to get to the science behind it. But, in terms of your question, hopefully this will be adequate.
Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.
In fact, although all database should start in 3NF, it's sometimes acceptable to drop to 2NF for performance gains, provided you're aware of, and mitigate, the potential problems.
And be aware that there are also "higher" levels of normalisation such as (obviously) fourth, fifth and sixth, but also Boyce-Codd and some others I can't remember off the top of my head. In the vast majority of cases, 3NF should be more than enough.
1 If you don't know who Edgar Codd (or Christopher Date, for that matter) is, you should probably research them, they're the fathers of relational database theory.
We use normalization to reduce the chances of anomalies that may arise as a result of data insertion, deletion, updation. Normalization doesnt necessarily increase performance.
There is much material on internet so i wont repeat the stuff here again. But you can have a look at
Normalization rules
Anomalies
(others aswell)
As well as all the above, it just makes a certain sense. Say you have a user and you want to record what kind of car they have.
Put that all in one table and then you're fine, until someone owns two cars... You're then going to need two rows for that person, and a way of making sure that you can link those two rows together...
And then what if you also want to record how many dogs they have? Same table with lots of confusing dups? Another table with your own custom logic to manage unique users?
Normalization keeps you away from a lot of these problems...

Can I have non-measure codes mixed with measures in my fact table?

We're doing a complex bit of data accumulation. Our customer sends us some stuff that includes two dimensions (time and a business unit). Time is mostly year-month. The business unit dimension has just a few attributes: a name, and a few categories to which BU's can belong for reporting and analysis purposes.
The stuff they send us includes some current state information (dates and codes). These seem fact-like. They also send some information that characterizes the relationship with the business unit (mostly additional codes). Again, these are unique to the business unit and time period.
Finally, they send us stuff that is clearly additive facts. It includes currency and counts that have proper units.
Should I commingle this qualitative information in a single fact table with the additive facts? Or should I separate the qualitative stuff (which can only be used with counts) from the quantitative stuff (which can be used with sum)?
Only put things in the fact table if they are degenerate (causing a high-cardinality/uniqueness problems in your dimension where it takes the dimension to a 1-1 relationship to the fact table). Kimball recommends avoiding the temptation to put anything but degenerate dimensions in with the facts (unique order number, for instance).
You can always put these in what Kimball calls a "junk" dimension. All those codes can simply be lumped into a junk dimension. Most dates would go in the fact table as keys into your date dimension in a particular role (usually with a natural int key of the form YYYYMMDD - one of the only times we don't use a non-identity meaningless surrogate key)
I like to naively view the star as all the facts and then which columns go into which dimensions is simply determined by convenience. One should not necessarily view them as corresponding to a particular business entity - remember, the star is not an ERD-style normalized OLTP database.
If the data is both directly related to the additive fact and is not something you want to be grouping/sorting/search on, then putting it in the fact table is okay.
Be aware, though, that non-additive data in the fact table will either prevent roll-ups or will become a lossy operation.
Brad Wilson accurately describes the risk of adding them to your fact table. In the past, I've added junk attributes to my fact table only to require refactoring later.
The stuff they send us includes some
current state information (dates and
codes). These seem fact-like. They
also send some information that
characterizes the relationship with
the business unit (mostly additional
codes). Again, these are unique to the
business unit and time period.
What business purpose do the dates serve? Offhand, I'd recommend making these their own dimensions and describe them accurately.
How volatile are the extra codes that come in? If the grain of your fact table is date and BU, why can't they be included in the BU dimension and treated as slowly changing attributes?
Without more details I can't make a firm recommendation but these would be the first questions I'd ask myself.

Resources