How far does one go to eliminate duplicate data in a database? - database

How far does one go to eliminate duplicate data in a database? Because you could go OTT and it would get crazy. Let me give you an example...
If I were to create a Zoo database which contains a table 'Animal' which has a 'name', 'species' and 'country_of_birth'
But there will be duplicate data there as many animals could come from same country and there could be lots of tigers, for example.
So really there should be a 'Species' table and a 'Country_of_birth' table
But then after a while you would have tons of tables
So how far do you go?
In this question I am just using one table as an example. One row in the Animal table stores information about a single animal in the zoo. So that animal's name, species and country of birth, as well as a unique animalID.

But there will be duplicate data there as many animals could come from
same country and there could be lots of tigers, for example.
This suggests you want to keep track of individual animals, not just kinds of animals. Let's assume that the zoos use some kind of numeric tattoo or microchip to identify individual animals.
Assume this sample data is representative. (It's not, but it's ok for teaching.)
Animals
Predicate: Animal having microchip <chip_num> of species <species>
has name <name> and was born in <birth_country_code>.
chip_num name species birth_country_code
--
101234 Anita Panthera tigris USA
101235 Bella Panthera tigris USA
101236 Calla Panthera tigris USA
101237 Dingo Canis lupus CAN
101238 Exeter Canis lupus CAN
101239 Bella Canis lupus USA
101240 Bella Canis lupus CAN
There's no redundant data in that table. None of those columns can be dropped without radically changing the meaning of that table. It has a single candidate key: chip_num. It's in 5NF.
Values are repeated in non-key columns. That's kind of the definition of non-key (non-prime) columns. Values in key columns (or sets of key columns) are unique; values in non-key columns aren't.
If you want to restrict the values in "birth_country_code" to the valid three-letter ISO country codes, you can add a table of valid three-letter ISO country codes, and set a foreign key reference to it. This is generally a Good Thing, but it has nothing to do with normalization.
iso_country_code
--
CAN
USA
You could do the same thing again for "species". That, too, would generally be a Good Thing, and it, too, would have nothing to do with normalization.

First you decide What the table is supposed to carry information about. In your example. is the table about individual animals? or is it about species of animals and how many of each species? The fact that you have country of birth might be an indicator that someone wants it to be the former. If that is the case you must have a key that identifies individual animals. You have an attribute, (a property) that is associated with individuals, so each row must (should?) represent an individual. You should read up Here on the database modeling concepts of Identity, and Individuation.
And to do this properly, actually, you do this for each thing in your data model, and then convert that model into database tables.

It comes down to deciding what is important to your system.
Deciding whether something (your e.g. "country of birth") is merely an attribute or is instead a full-blown entity in its own right depends on what else your system needs to know about countries and how many attributes your system may track that are fully functionally dependent on the country.
You should also consider whether your attributes are susceptible to update anomalies. If your statement of country in the animal table is in the form of the full official name of the country, then you might be at risk if, for example, "The Belgian Congo" suddenly becomes "The Democratic Republic of the Congo" - oh wait, that already happened!
The rules of normalization are not sacrosanct. They are pretty darn useful rules of thumb that are intended to keep you out of trouble, most of the time. Still, rules are made to be broken - but you should only break them knowingly and with a carefully considered understanding of the consequences.

Related

ER diagram that implements a database for trainee

I edited and remade the ERD. I have a few more questions.
I included participation constraints(between trainee and tutor), cardinality constraints(M means many), weak entities (double line rectangles), weak relationships(double line diamonds), composed attributes, derived attributes (white space with lines circle), and primary keys.
Questions:
Apparently to reduce redundant attributes I should only keep primary keys and descriptive attributes and the other attributes I will remove for simplicity reasons. Which attributes would be redundant in this case? I am thinking start_date, end_date, phone number, and address but that depends on the entity set right? For example the attribute address would be removed from Trainee because we don't really need it?
For the part: "For each trainee we like to store (if any) also previous companies (employers) where they worked, periods of employment: start date and end date."
Isn't "periods of employment: start date, end date" a composed attribute? because the dates are shown with the symbol ":" Also I believe I didn't make an attribute for "where they worked" which is location?
Also how is it possible to show previous companies (employers) when we already have an attribute employers and different start date? Because if you look at the Question Information it states start_date for employer twice and the second time it says start_date and end_date.
I labeled many attributes as primary keys but how am I able to distinguish from derived attribute, primary key, and which attribute would be redundant?
Is there a multivalued attribute in this ERD? Would salary and job held be a multivalued attribute because a employer has many salaries and jobs.
I believe I did the participation constraints (there is one) and cardinality constraints correctly. But there are sentences where for example "An instructor teaches at least a course. Each course is taught by only one instructor"; how can I write the cardinality constraint for this when I don't have a relationship between course and instructor?
Do my relationship names make sense because all I see is "has" maybe I am not correctly naming the actions of the relationships? Also I believe schedules depend on the actual entity so they are weak entities.... so does that make course entity set also a weak entity (I did not label it as weak here)?
For the company address I put a composed attribute, street num, street address, city... would that be correct? Also would street num and street address be primary keys?
Also I added the final mark attribute to courses and course_schedule is this in the right entity set? The statement for this attribute is "Each trainee identified by: unique code, social security number, name, address, a unique telephone number, the courses attended and the final mark for each course."
For this part: "We store in the database all classrooms available on the site" do i make a composed attribute that contains site information?
Question Information:
A trainee may be self-employed or employee in a company
Each trainee identified by:
unique code, social security number, name, address, a unique
telephone number, the courses attended and the final mark for each course.
If the trainee is an employee in a company: store the current company (employer), start date.
For each trainee we like to store (if any) also previous companies (employers) where they worked, periods of employment: start date and end date.
If a trainee is self-employed: store the area of expertise, and title.
For a trainee that works for a company: we store the salary and job
For each company (employer): name (unique), the address, a unique telephone number.
We store in the database all known companies in the
city.
We need also to represent the courses that each trainee is attending.
Each course has a unique code and a title.
For each course we have to store: the classrooms, dates, and times (start time, and duration in minutes) the course is held.
A classroom is characterized by a building name and a room number and the maximum places’ number.
A course is given in at least a classroom, and may be scheduled in many classrooms.
We store in the database all classrooms
available on the site.
We store in the database all courses given at least once in the company.
For each instructor we will store: the social security number, name, and birth date.
An instructor teaches at least a course.
Each course is taught by only one instructor.
All the instructors’ telephone numbers must also be stored (each instructor has at least a telephone number).
A trainee can be a tutor for one or many trainees for a specific
period of time (start date and end date).
For a trainee it is not mandatory to be a tutor, but it is mandatory to have a tutor
The attribute ‘Code’ will be your PK because it’s only use seems to be that of a Unique Identifier.
The relationship ‘is’ will work but having a reference to two tables like that can get messy. Also you have the reference to "Employers" in the Trainee table which is not good practice. They should really be combined. See my helpful hints section to see how to clean that up.
Company looks like the complete table of Companies in the area as your details suggest. This would mean table is fairly static and used as a reference in your other tables. This means that the attribute ‘employer’ in Employed would simply be a Foreign Key reference to the PK of a specific company in Company. You should draw a relationship between those two.
It seems as though when an employee is ‘employed’ they are either an Employee of a company or self-employed.
The address field in Company will be a unique address your current city, yes, as the question states the table is a complete list of companies in the city. However because this is a unique attribute you must have specifics like street address because simply adding the city name will mean all companies will have the same address which is forbidden in an unique field.
Some other helpful hints:
Stay away from adding fields with plurals on them to your diagram. When you have a plural field it often means you need a separate table with a Foreign Key reference to that table. For example in your Table Trainee, you have ‘Employers’. That should be a Employer table with a foreign key reference to the Trainee Code attribute. In the Employer Table you can combine the Self-employed and Employed tables so that there is a single reference from Trainee to Employer.
ERD Link http://www.imagesup.net/?di=1014217878605. Here's a quick ERD I created for you. Note the use of linker tables to prevent Many to Many relationships in the table. It's important to note there are several ways to solve this schema problem but this is just as I saw your problem laid out. The design is intended to help with normalization of the db. That is prevent redundant data in the DB. Hope this helps. Let me know if you need more clarification on the design I provided. It should be fairly self explanatory when comparing your design parameters to it.
Follow Up Questions:
If you are looking to reduce attributes that might be arbitrary perhaps phone_number and address may be ones to eliminate, but start and end dates are good for sorting and archival reasons when determining whether an entry is current or a past record.
Yes, periods_of_employment does not need to be stored as you can derive that information with start and end dates. Where they worked I believe is just meant to say previous employers, so no location but instead it’s meant that you should be able to get a list all the employers the trainee has had. You can get that with the current schema if you query the employer table for all records where trainee code equals requested trainee and sort by start date. The reason it states start_date twice is to let you know that for all ‘previous’ employers the record will have a start and end date. Hence the previous. However, for current employers the employment hasn't ended which means there will be no end_date so it will null. That’s what the problem was stating in my opinion.
To keep it simple PK’s are unique values used to reference a record within another table. Redundant values are values that you essentially don’t need in a table because the same value can be derived by querying another table. In this case most of your attributes are fine except for Final_Mark in the Course table. This is redundant because Course_Schedule will store the Final_Mark that was received. The Course table is meant to simply hold a list of all potential courses to be referenced by Course_Schedule.
There is no multivalued attributes in this design because that is bad practice Job and salary are singular and if and job or salary changes you would add a new record to the employer table not add to that column. Multivalued attributes make querying a db difficult and I would advise against it. That’s why I mentioned earlier to abstract all attributes with plurals into their own tables and use a foreign key reference.
You essentially do have that written here because Course_Schedule is a linker table meaning that it is meant to simplify relationships between tables so you don’t have many to many relationships.
All your relationships look right to me. Also since the schedules are linker tables and cannot exist without the supporting tables you could consider them weak entities. Course in this schema is a defined list of all courses available so can be independent of any other table. This by definition is not a weak entity. When creating this db you’d probably fill in the course table and it probably wouldn’t change after that, except rarely when adding or removing an available course option.
Yes, you can make address a composite attribute, and that would be right in your diagram. To be clear with your use of Primary key, just because an attribute is unique doesn’t make it a primary key. A table can have one and only one primary key so you must pick a column that you are certain will not be repeated. In this example you may think street number might be unique but what if one company leaves an address and another company moves into that spot. That would break that tables primary key. Typically a company name is licensed in a city or state so cannot be repeated. That would be a better choice for your primary key. You can also make composite primary keys, but that is a more advanced topic that I would recommend reading about at a later date.
Take final_mark out of courses. That’s table will contain rows of only courses, those courses won’t be linked to any trainee except by course_schedule table. The Final_Mark should only be in that table. If you add final_mark to Course table then, if you have 10 trainees in a course, You’d have 10 duplicate rows in the course table with only differing final_marks. Instead only hold the course_code and title that way you can assign different instructors, trainees and classrooms using the linker tables.
No composite attribute is needed using this schema. You have a Classroom table that will hold all available classrooms and their relevant information. You then use the Classroom_Schedule linker table to assign a given Classroom to a Course_Schedule. No attributes of Classroom can be broken down to simpler attributes.

Database Design: Explain this schema

Full disclosure...Trying feverishly here to learn more about databases so I am putting in the time and also tried to get this answer from the source to no avail.
Barry Williams from databaseanswers has this schema posted.
Clients and Fees Schema
I am trying to understand the split of address tables in this schema. Its clear to me that the Addresses table contains the details of a given address. The Client_Addresses and Staff_Addresses tables are what gets me.
1) I understand the use of Primary Foreign Keys as shown but I was under the assumption that when these are used you don't have a resident Primary Key in that same table (date_address_from in this case). Can someone explain the reasoning for both and put it into words how this actually works out?
2) Why would you use date_address_from as the primary key instead of something like client_address_id as the PK? What if someone enters two addresses in one day would there be conflicts in his design? If so or if not, what?
3) Along the lines of normalization...Since both date_address_from and date_address_to are the same in the Client_Addresses and Staff_Addresses table should those fields just not be included in the main Address table?
Evaluation
First an Audit, then the specific answers.
This is not a Data Model. This is not a Database. It is a bucket of fish, with each fish drawn as a rectangle, and where the fins of one fish are caught in the the gills of another, there is a line. There are masses of duplication, as well as masses of missing elements. It is completely unworthy of using as an example to learn anything about database design from.
There is no Normalisation at all; the files are very incomplete (see Mike's answer, there are a hundred more problem like that). The other_details and eg.s crack me up. Each element needs to be identified and stored: StreetNo, ApartmentNo, StreetName, StreetType, etc. not line_1_number_street, which is a group.
Customer and Staff should be normalised into a Person table, with all the elements identified.
And yes, if Customer can be either a Person or an Organisation, then a supertype-subtype structure is required to support that correctly.
So what this really is, the technically accurate terms, is a bunch of flat files, with descriptions for groups of fields. Light years distant from a database or a relational one. Not ready for evaluation or inspection, let alone building something with. In a Relational Data Model, that would be approximately 35 normalised tables, with no duplicated columns.
Barry has (wait for it) over 500 "schemas" on the web. The moment you try to use a second "schema", you will find that (a) they are completely different in terms of use and purpose (b) there is no commonality between them (c) let's say there was a customer file in both; they would be different forms of customer files.
He needs to Normalise the entire single "schema" first,
then present the single normlaised data model in 500 sections or subject areas.
I have written to him about it. No response.
It is important to note also, that he has used some unrecognisable diagramming convention. The problem with these nice interesting pictures is that they convey some things but they do not convey the important things about a database or a design. It is no surprise that a learner is confused; it is not clear to experienced database professionals. There is a reason why there is a standard for modelling Relational databases, and for the notation in Data Models: they convey all the details and subtleties of the design.
There is a lot that Barry has not read about yet: naming conventions; relations; cardinality; etc, too many to list.
The web is full of rubbish, anyone can "publish". There are millions of good- and bad-looking "designs" out there, that are not worth looking at. Or worse, if you look, you will learn completely incorrect methods of "design". In terms of learning about databases and database design, you are best advised to find someone qualified, with demonstrated capability, and learn from them.
Answer
He is using composite keys without spelling it out. The PK for client_addresses is client_id, address_id, date_address_from). That is not a bad key, evidently he expects to record addresses forever.
The notion of keeping addresses in a separate file is a good one, but he has not provided any of the fields required to store normalised addresses, so the "schema" will end up with complete duplication of addresses; in which case, he could remove addresses, and put the lines back in the client and staff files, along with their other_details, and remove three files that serve absolutely no purpose other than occupying disk space.
You are thinking about Associative Tables, which resolve the many-to-many relations in Databases. Yes, there, the columns are only the PKs of the two parent tables. These are not Associative Tables or files; they contain data fields.
It is not the PK, it is the third element of the PK.
The notion of a person being registered at more than one address in a single day is not reasonable; just count the one address they slept the most at.
Others have answered that.
Do not expect to identify any evidence of databases or design or Normalisation in this diagram.
1) In each of those tables the primary key is a compound key consisting of three attributes: (staff_id, address_id, date_address_from) and (client_id, address_id, date_address_from). This presumably means that the mapping of clients/staff to addresses is expected to change over time and that the history of those changes is preserved.
2) There's no obvious reason to create a new "id" attribute in those tables. The compound key does the job adequately. Why would you want to create the same address twice for the same client on the same date? If you did then that might be a reason to modify the design but that seems like an unlikely requirement.
3) No. The apparent purpose is that they are the applicable dates for the mapping of address to client/staff - not dates applicable to the address alone.
3) Along the lines of
normalization...Since both
date_address_from and date_address_to
are the same in the Client_Addresses
and Staff_Addresses table should those
fields just not be included in the
main Address table?
No. But you did find a problem.
The designer has decided that clients and staff are two utterly different things. By "utterly different", I mean they have no attributes in common.
That's not true, is it? Both clients and staff have addresses. I'm sure most of them have telephones, too.
Imagine that someone on staff is also a client. How many places is that person's name stored? That person's address? Can you hear Mr. Rogers in the background saying, "Can you spell 'update anomaly'? . . . I knew you could."
The problem is that the designer was thinking of clients and staff as different kinds of people. They're not. "Client" describes a business relationship between a service provider (usually, that is, not a retailer) and a customer, which might be either a person or a company. "Staff" describes a employment relationship between a company and a person. Not different kinds of people--different kinds of relationships.
Can you see how to fix that?
This 2 extra tables enables you to have address history per one person.
You can have them both in one table, but since staff and client are separated, it is better to separate them as well (b/c client id =1 and staff id =1 can't be used on the same table of address).
there is no "single" solution to a design problem, you can use 1 person table and then add a column to different between staff and client. BUT The major Idea is that the DB should be clear, readable and efficient, and not to save tables.
about 2 - the pk is combined, both clientID, AddressID and from.
so if someone lives 6 month in the states, then 6 month in Israel, and then back to the states, to the same address - you need only 2 address in address table, and 3 in the client_address.
The idea of heaving the from_Date as part of the key is right, although it doesn't guaranty data integrity - as you also need manually to check that there isn't overlapping dates between records of the same person.
about 3 - no (look at 2).
Viewing the data model, i think:
1) PF means that the field is both part of the primary key of the table and foreign key with other table.
2) In the same way, the primary key of Staff_Addresses is {staff_id,address_id,date_adderess_from} not just date_adderess_from
3) The same that 2)
In reference to Staff_Addresses table, the Primary Key on date_address_from basically prevents a record with the same staff_id/address_id entered more than once. Now, i'm no DBA, but i like my PKs to be integers or guids for performance reasons/faster indexing. If i were to do this i would make a new column, say, Staff_Address_Id and make it the PK column and put a unique constraint on staff_id/address_id/date_address_from.
As for your last concern, Addresses table is really a generic address storage structure. It shouldn't care about date ranges during which someone resided there. It's better to be left to specific implementations of an address such as Client/Staff addresses.
Hope this helps a little.

a layman's term for identifying relationship

There are couples of questions around asking for difference / explanation on identifying and non-identifying relationship in relationship database.
My question is, can you think of a simpler term for these jargons? I understand that technical terms have to be specific and unambiguous though. But having an 'alternative name' might help students relate more easily to the concept behind.
We actually want to use a more layman term in our own database modeling tool, so that first-time users without much computer science background could learn faster.
cheers!
I often see child table or dependent table used as a lay term. You could use either of those terms for a table with an identifying relationship
Then say a referencing table is a table with a non-identifying relationship.
For example, PhoneNumbers is a child of Users, because a phone number has an identifying relationship with its user (i.e. the primary key of PhoneNumbers includes a foreign key to the primary key of Users).
Whereas the Users table has a state column that is a foreign key to the States table, making it a non-identifying relationship. So you could say Users references States, but is not a child of it per se.
I think belongs to would be a good name for the identifying relationship.
A "weak entity type" does not have its own key, just a "partial key", so each entity instance of this weak entity type has to belong to some other entity instance so it can be identified, and this is an "identifying relationship". For example, a landlord could have a database with apartments and rooms. A room can be called kitchen or bathroom, and while that name is unique within an apartment, there will be many rooms in the database with the name kitchen, so it is just a partial key. To uniquely identify a room in the database, you need to say that it is the kitchen in this particular apartment. In other words, the rooms belong to apartments.
I'm going to recommend the term "weak entity" from ER modeling.
Some modelers conceptualize the subject matter as being made up of entities and relationships among entities. This gives rise to Entity-Relationship Modeling (ER Modeling). An attribute can be tied to an entity or a relationship, and values stored in the database are instances of attributes.
If you do ER modeling, there is a kind of entity called a "weak entity". Part of the identity of a weak entity is the identity of a stronger entity, to which the weak one belongs.
An example might be an order in an order processing system. Orders are made up of line items, and each line item contains a product-id, a unit-price, and a quantity. But line items don't have an identifying number across all orders. Instead, a line item is identified by {item number, order number}. In other words, a line item can't exist unless it's part of exactly one order. Item number 1 is the first item in whatever order it belongs to, but you need both numbers to identify an item.
It's easy to turn an ER model into a relational model. It's also easy for people who are experts in the data but know nothing about databases to get used to an ER model of the data they understand.
There are other modelers who argue vehemently against the need for ER modeling. I'm not one of them.
Nothing, absolutely nothing in the kind of modeling where one encounters things such as "relationships" (ER, I presume) is "technical", "precise" or "unambiguous". Nor can it be.
A) ER modeling is always and by necessity informal, because it can never be sufficient to capture/express the entire definition of a database.
B) There are so many different ER dialects out there that it is just impossible for all of them to use exactly the same terms with exactly the same meaning. Recently, I even discovered that some UK university that teaches ER modeling, uses the term "entity subtype" for the very same thing that I always used to name "entity supertype", and vice-versa !
One could use connection.
You have Connection between two tables, where the IDs are the same.
That type of thing.
how about
Association
Link
Correlation

What is a good KISS description of Boyce-Codd normal form?

What is a KISS (Keep it Simple, Stupid) way to remember what Boyce-Codd normal form is and how to take a unnormalized table and BCNF it?
Wikipedia's info: not terribly helpful for me.
Chris Date's definition is actually quite good, so long as you understand what he means:
Each attribute
Your data must be broken into separate, distinct attributes/columns/values which do not depend on any other attributes. Your full name is an attribute. Your birthdate is an attribute. Your age is not an attribute, it depends on the current date which is not part of your birthdate.
must represent a fact
Each attribute is a single fact, not a collection of facts. Changing one bit in an attribute changes the whole meaning. Your birthdate is a fact. Is your full name a fact? Well, in some cases it is, because if you change your surname your full name is different, right? But to a genealogist you have a surname and a family name, and if you change your surname your family name does not change, so they are separate facts.
about the key,
One attribute is special, it's a key. The key is an attribute that must be unique for all information in your data and must never change. Your full name is not a key because it can change. Your Social Insurance Number is not a key because they get reused. Your SSN plus birthdate is not a key, even if the combination can never be reused, because an attribute cannot be a combination of two facts. A GUID is a key. A number you increment and never reuse is a key.
the whole key,
The key alone must be sufficient [and necessary!] to identify your values; you cannot have the same data represented by different keys, nor can a subset of the key columns be sufficient to identify the fact.
Suppose you had an address book with a GUID key, name and address values. It is OK to have the same name appearing twice with different keys if they represent different people and are not the "same data".
If Mary Jones in accounting changes her name to Mary Smith, Mary Jones in Sales does not change her name as well.
On the other hand, if Mary Smith and John Smith have the same street address and it really is the same place, this is not allowed. You have to create a new key/value pair with the street address and a new key.
You are also not allowed to use the key for this new single street address as a value in the address book since now the same street address key would be represented twice.
Instead, you have to make a third key/value pair with values of the address book key and the street address key; you find a person's street address by matching their book key and address key in this group of values.
and nothing but the key
There must be nothing other than the key that identifies your values. For example, if you are allowed an address of "The Taj Mahal" (assuming there is only one) you are not allowed a city value in the same record,
since if you know the address you would also know the city. This would also open up the possibility of there being more than one Taj Mahal in a different city.
Instead, you have to again create a secondary Location key with unique values like the Taj, the White House in DC, and so on, and their cities.
Or forbid "addresses" that are unique to a city.
So help me, Codd.
Here are some helpful excerpts from the Wikipedia page on Third Normal Form:
Bill Kent defines Third Normal Form this way:
Each non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
Requiring that non-key attributes be
dependent on "the whole key" ensures
that a table is in 2NF; further
requiring that non-key attributes be
dependent on "nothing but the key"
ensures that the table is in 3NF.
Chris Date adapts Kent's mnemonic to define Boyce-Codd Normal Form:
"Each attribute must represent a fact
about the key, the whole key, and
nothing but the key." Here the
requirement is concerned with every
attribute in the table, not just
non-key attributes.
This comes into play when a table has multiple compound candidate keys, and an attribute within one candidate keys has a dependency on a part of another candidate key. Third Normal Form wouldn't prohibit this, because it excludes key attributes. But BCNF applies the rule to key attributes as well.
As for how to make a table satisfy BCNF, you need to represent the extra dependency, with another attribute and possibly by splitting attributes into another table.
I googled "boyce codd normal form" and after wikipedia this is the second result. My textbook gives a very simple definition in terms of relational database management systems:
The left side of every nontrivial FD must be a superkey.
-"Database Systems The Complete Book" by Garcia-Molina, Ullman and Widom.
The best informal answer I've read is that, in BCNF, every "arrow" in every functional dependency is an "arrow" out of a candidate key. I don't recall the source, but it was probably something Chris Date wrote.
Basically Boyce-Codd is "fifth normal form". It is visually recognizable by the existance of "Attributive entities" in the data model, for things like Types (e.g. roles, status, process state, location-type, phone-type, etc).
The attributive entities (sub-subtypes) are lists of finite sets of values that further categorize a class level entity. So you may have a phone-type ('mobile', ' desk', 'VOIP') email account type ('business', 'personal', 'gaming'), role (project manager, data modeler, super model) etc.
Another morphological clue is the existance of super-types, (aka. master-classes, super-classes, meta-entities) such as Parties (subtypes being company, person, etc.).
It's basically Taxonomy gone wild (..no the video is not that exciting) to the atomic or leaf-level; see Bill Karwin's comment above for a more technical explanation.
Boyce-Codd level models are essentially highly detailed logical models, derived from more simplistic business-based conceptual models. **They are typically NOT implemented ver batim in the PHYSICAL model, because PDM optimization for performance (or functional simplicity) may result in the super-types and attributive entities being managed as drop-down lists in UIs, or in behind the scenes logic in the application, or in database constraints and methods to enforce referential integrity. (i.e. they may end up as look-up tables in the PDM schema, or they may be handled by code and not represented in the database).
So - why do them if they may not end up in the PDM? For the same reason you build a good 3NF model before you 'optimize', so that the database structure reflects the real world and is hence more stable than the typical kludges we inherit and have to do heroic acts to make work as our business/clients requirements change.
Often times it is easiest to listen to your gut and this will come naturally. Generally speaking, if you meet 3NF you have met BCNF. This doesn't cover detailed analysis of an ERD or have examples but there are thirteen rules according to Codd. I find it best to follow these rules but always remember there is no one correct way to do things so follow them loosely. So regarding the RDBMS, here are the rules:
http://www.87android.com/12-rules-of-relational-database-model-by-codd/
This may not answer the question directly, but if you are asking about how to get to BCNF or an easy way to remember it then you don't understand normalization well enough. That is of no concern though. Relational databases take many forms and very few are done well. The best thing you can do is know what it means to be relational, follow the rules above, and do not worry about the level of normalization. The process of normalization eliminates the duplication of data. Each level more so by moving into migration of functional dependencies. Keep that in mind and you will be fine, your gut and intellect will do the rest.

Database, Table and Column Naming Conventions? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Whenever I design a database, I always wonder if there is a best way of naming an item in my database. Quite often I ask myself the following questions:
Should table names be plural?
Should column names be singular?
Should I prefix tables or columns?
Should I use any case in naming items?
Are there any recommended guidelines out there for naming items in a database?
I recommend checking out Microsoft's SQL Server sample databases:
https://github.com/Microsoft/sql-server-samples/releases/tag/adventureworks
The AdventureWorks sample uses a very clear and consistent naming convention that uses schema names for the organization of database objects.
Singular names for tables
Singular names for columns
Schema name for tables prefix (E.g.: SchemeName.TableName)
Pascal casing (a.k.a. upper camel case)
Late answer here, but in short:
Plural table names: My preference is plural
Singular column names: Yes
Prefix tables or columns:
Tables: *Usually* no prefixes is best.
Columns: No.
Use any case in naming items: PascalCase for both tables and columns.
Elaboration:
(1) What you must do. There are very few things that you must do a certain way, every time, but there are a few.
Name your primary keys using "[singularOfTableName]ID" format. That is, whether your table name is Customer or Customers, the primary key should be CustomerID.
Further, foreign keys must be named consistently in different tables. It should be legal to beat up someone who does not do this. I would submit that while defined foreign key constraints are often important, consistent foreign key naming is always important
You database must have internal conventions. Even though in later sections you'll see me being very flexible, within a database naming must be very consistent . Whether your table for customers is called Customers or Customer is less important than that you do it the same way throughout the same database. And you can flip a coin to determine how to use underscores, but then you must keep using them the same way. If you don't do this, you are a bad person who should have low self-esteem.
(2) What you should probably do.
Fields representing the same kind of data on different tables should be named the same. Don't have Zip on one table and ZipCode on another.
To separate words in your table or column names, use PascalCasing. Using camelCasing would not be intrinsically problematic, but that's not the convention and it would look funny. I'll address underscores in a moment. (You may not use ALLCAPS as in the olden days. OBNOXIOUSTABLE.ANNOYING_COLUMN was okay in DB2 20 years ago, but not now.)
Don't artifically shorten or abbreviate words. It is better for a name to be long and clear than short and confusing. Ultra-short names is a holdover from darker, more savage times. Cus_AddRef. What on earth is that? Custodial Addressee Reference? Customer Additional Refund? Custom Address Referral?
(3) What you should consider.
I really think you should have plural names for tables; some think singular. Read the arguments elsewhere. Column names should be singular however. Even if you use plural table names, tables that represent combinations of other tables might be in the singular. For example, if you have a Promotions and an Items table, a table representing an item being a part of a promotion could be Promotions_Items, but it could also legitimately be Promotion_Items I think (reflecting the one-to-many relationship).
Use underscores consistently and for a particular purpose. Just general tables names should be clear enough with PascalCasing; you don't need underscores to separate words. Save underscores either (a) to indicate an associative table or (b) for prefixing, which I'll address in the next bullet.
Prefixing is neither good or bad. It usually is not best. In your first db or two, I would not suggest using prefixes for general thematic grouping of tables. Tables end up not fitting your categories easily, and it can actually make it harder to find tables. With experience, you can plan and apply a prefixing scheme that does more good than harm. I worked in a db once where data tables began with tbl, config tables with ctbl, views with vew, proc's sp, and udf's fn, and a few others; it was meticulously, consistently applied so it worked out okay. The only time you NEED prefixes is when you have really separate solutions that for some reason reside in the same db; prefixing them can be very helpful in grouping the tables. Prefixing is also okay for special situations, like for temporary tables that you want to stand out.
Very seldom (if ever) would you want
to prefix columns.
Ok, since we're weighing in with opinion:
I believe that table names should be plural. Tables are a collection (a table) of entities. Each row represents a single entity, and the table represents the collection. So I would call a table of Person entities People (or Persons, whatever takes your fancy).
For those who like to see singular "entity names" in queries, that's what I would use table aliases for:
SELECT person.Name
FROM People person
A bit like LINQ's "from person in people select person.Name".
As for 2, 3 and 4, I agree with #Lars.
I work in a database support team with three DBAs and our considered options are:
Any naming standard is better than no standard.
There is no "one true" standard, we all have our preferences
If there is standard already in place, use it. Don't create another standard or muddy the existing standards.
We use singular names for tables. Tables tend to be prefixed with the name of the system (or its acronym). This is useful if the system complex as you can change the prefix to group the tables together logically (ie. reg_customer, reg_booking and regadmin_limits).
For fields we'd expect field names to be include the prefix/acryonm of the table (i.e. cust_address1) and we also prefer the use of a standard set of suffixes ( _id for the PK, _cd for "code", _nm for "name", _nb for "number", _dt for "Date").
The name of the Foriegn key field should be the same as the Primary key field.
i.e.
SELECT cust_nm, cust_add1, booking_dt
FROM reg_customer
INNER JOIN reg_booking
ON reg_customer.cust_id = reg_booking.cust_id
When developing a new project, I'd recommend you write out all the preferred entity names, prefixes and acronyms and give this document to your developers. Then, when they decide to create a new table, they can refer to the document rather than "guess" what the table and fields should be called.
No. A table should be named after the entity it represents.
Person, not persons is how you would refer to whoever one of the records represents.
Again, same thing. The column FirstName really should not be called FirstNames. It all depends on what you want to represent with the column.
NO.
Yes. Case it for clarity. If you need to have columns like "FirstName", casing will make it easier to read.
Ok. Thats my $0.02
I hear the argument all the time that whether or not a table is pluralized is all a matter of personal taste and there is no best practice. I don't believe that is true, especially as a programmer as opposed to a DBA. As far as I am aware, there are no legitimate reasons to pluralize a table name other than "It just makes sense to me because it's a collection of objects," while there are legitimate gains in code by having singular table names. For example:
It avoids bugs and mistakes caused by plural ambiguities. Programmers aren't exactly known for their spelling expertise, and pluralizing some words are confusing. For example, does the plural word end in 'es' or just 's'? Is it persons or people? When you work on a project with large teams, this can become an issue. For example, an instance where a team member uses the incorrect method to pluralize a table he creates. By the time I interact with this table, it is used all over in code I don't have access to or would take too long to fix. The result is I have to remember to spell the table wrong every time I use it. Something very similar to this happened to me. The easier you can make it for every member of the team to consistently and easily use the exact, correct table names without errors or having to look up table names all the time, the better. The singular version is much easier to handle in a team environment.
If you use the singular version of a table name AND prefix the primary key with the table name, you now have the advantage of easily determining a table name from a primary key or vice versa via code alone. You can be given a variable with a table name in it, concatenate "Id" to the end, and you now have the primary key of the table via code, without having to do an additional query. Or you can cut off "Id" from the end of a primary key to determine a table name via code. If you use "id" without a table name for the primary key, then you cannot via code determine the table name from the primary key. In addition, most people who pluralize table names and prefix PK columns with the table name use the singular version of the table name in the PK (for example statuses and status_id), making it impossible to do this at all.
If you make table names singular, you can have them match the class names they represent. Once again, this can simplify code and allow you to do really neat things, like instantiating a class by having nothing but the table name. It also just makes your code more consistent, which leads to...
If you make the table name singular, it makes your naming scheme consistent, organized, and easy to maintain in every location. You know that in every instance in your code, whether it's in a column name, as a class name, or as the table name, it's the same exact name. This allows you to do global searches to see everywhere that data is used. When you pluralize a table name, there will be cases where you will use the singular version of that table name (the class it turns into, in the primary key). It just makes sense to not have some instances where your data is referred to as plural and some instances singular.
To sum it up, if you pluralize your table names you are losing all sorts of advantages in making your code smarter and easier to handle. There may even be cases where you have to have lookup tables/arrays to convert your table names to object or local code names you could have avoided. Singular table names, though perhaps feeling a little weird at first, offer significant advantages over pluralized names and I believe are best practice.
I'm also in favour of a ISO/IEC 11179 style naming convention, noting they are guidelines rather than being prescriptive.
See Data element name on Wikipedia:
"Tables are Collections of Entities, and follow Collection naming guidelines. Ideally, a collective name is used: eg., Personnel. Plural is also correct: Employees. Incorrect names include: Employee, tblEmployee, and EmployeeTable."
As always, there are exceptions to rules e.g. a table which always has exactly one row may be better with a singular name e.g. a config table. And consistency is of utmost importance: check whether you shop has a convention and, if so, follow it; if you don't like it then do a business case to have it changed rather than being the lone ranger.
our preference:
Should table names be plural?
Never. The arguments for it being a collection make sense, but you never know what the table is going to contain (0,1 or many items). Plural rules make the naming unnecessarily complicated. 1 House, 2 houses, mouse vs mice, person vs people, and we haven't even looked at any other languages.
Update person set property = 'value' acts on each person in the table.
Select * from person where person.name = 'Greg' returns a collection/rowset of person rows.
Should column names be singular?
Usually, yes, except where you are breaking normalisation rules.
Should I prefix tables or columns?
Mostly a platform preference. We prefer to prefix columns with the table name. We don't prefix tables, but we do prefix views (v_) and stored_procedures (sp_ or f_ (function)). That helps people who want to try to upday v_person.age which is actually a calculated field in a view (which can't be UPDATEd anyway).
It is also a great way to avoid keyword collision (delivery.from breaks, but delivery_from does not).
It does make the code more verbose, but often aids in readability.
bob = new person()
bob.person_name = 'Bob'
bob.person_dob = '1958-12-21'
... is very readable and explicit. This can get out of hand though:
customer.customer_customer_type_id
indicates a relationship between customer and the customer_type table, indicates the primary key on the customer_type table (customer_type_id) and if you ever see 'customer_customer_type_id' whilst debugging a query, you know instantly where it is from (customer table).
or where you have a M-M relationship between customer_type and customer_category (only certain types are available to certain categories)
customer_category_customer_type_id
... is a little (!) on the long side.
Should I use any case in naming items?
Yes - lower case :), with underscores. These are very readable and cross platform. Together with 3 above it also makes sense.
Most of these are preferences though. - As long as you are consistent, it should be predictable for anyone that has to read it.
Take a look at ISO 11179-5: Naming and identification principles
You can get it here: http://metadata-standards.org/11179/#11179-5
I blogged about it a while back here: ISO-11179 Naming Conventions
I know this is late to the game, and the question has been answered very well already, but I want to offer my opinion on #3 regarding the prefixing of column names.
All columns should be named with a prefix that is unique to the table they are defined in.
E.g. Given tables "customer" and "address", let's go with prefixes of "cust" and "addr", respectively. "customer" would have "cust_id", "cust_name", etc. in it. "address" would have "addr_id", "addr_cust_id" (FK back to customer), "addr_street", etc. in it.
When I was first presented with this standard, I was dead-set against it; I hated the idea. I couldn't stand the idea of all that extra typing and redundancy. Now I've had enough experience with it that I'd never go back.
The result of doing this is that all of the columns in your database schema are unique. There is one major benefit to this, which trumps all arguments against it (in my opinion, of course):
You can search your entire code base and reliably find every line of code that touches a particular column.
The benefit from #1 is incredibly huge. I can deprecate a column and know exactly what files need to be updated before the column can safely be removed from the schema. I can change the meaning of a column and know exactly what code needs to be refactored. Or I can simply tell if data from a column is even being used in a particular portion of the system. I can't count the number of times this has turned a potentially huge project into a simple one, nor the amount of hours we've saved in development work.
Another, relatively minor benefit to it is that you only have to use table-aliases when you do a self join:
SELECT cust_id, cust_name, addr_street, addr_city, addr_state
FROM customer
INNER JOIN address ON addr_cust_id = cust_id
WHERE cust_name LIKE 'J%';
My opinions on these are:
1) No, table names should be singular.
While it appears to make sense for the simple selection (select * from Orders) it makes less sense for the OO equivalent (Orders x = new Orders).
A table in a DB is really the set of that entity, it makes more sense once you're using set-logic:
select Orders.*
from Orders inner join Products
on Orders.Key = Products.Key
That last line, the actual logic of the join, looks confusing with plural table names.
I'm not sure about always using an alias (as Matt suggests) clears that up.
2) They should be singular as they only hold 1 property
3) Never, if the column name is ambiguous (as above where they both have a column called [Key]) the name of the table (or its alias) can distinguish them well enough. You want queries to be quick to type and simple - prefixes add unnecessary complexity.
4) Whatever you want, I'd suggest CapitalCase
I don't think there's one set of absolute guidelines on any of these.
As long as whatever you pick is consistent across the application or DB I don't think it really matters.
In my opinion:
Table names should be plural.
Column names should be singular.
No.
Either CamelCase (my preferred) or underscore_separated for both table names and column names.
However, like it has been mentioned, any convention is better than no convention. No matter how you choose to do it, document it so that future modifications follow the same conventions.
I think the best answer to each of those questions would be given by you and your team. It's far more important to have a naming convention then how exactly the naming convention is.
As there's no right answer to that, you should take some time (but not too much) and choose your own conventions and - here's the important part - stick to it.
Of course it's good to seek some information about standards on that, which is what you're asking, but don't get anxious or worried about the number of different answers you might get: choose the one that seems better for you.
Just in case, here are my answers:
Yes. A table is a group of records, teachers or actors, so... plural.
Yes.
I don't use them.
The database I use more often - Firebird - keeps everything in upper case, so it doesn't matter. Anyway, when I'm programming I write the names in a way that it's easier to read, like releaseYear.
Definitely keep table names singular, person not people
Same here
No. I've seen some terrible prefixes, going so far as to state what were dealing with is a table (tbl_) or a user store procedure (usp_). This followed by the database name... Don't do it!
Yes. I tend to PascalCase all my table names
Naming conventions allow the development team to design discovereability and maintainability at the heart of the project.
A good naming convention takes time to evolve but once it’s in place it allows the team to move forward with a common language. A good naming convention grows organically with the project. A good naming convention easily copes with changes during the longest and most important phase of the software lifecycle - service management in production.
Here are my answers:
Yes, table names should be plural when they refer to a set of trades, securities, or counterparties for example.
Yes.
Yes. SQL tables are prefixed with tb_, views are prefixed vw_, stored procedures are prefixed usp_ and triggers are prefixed tg_ followed by the database name.
Column name should be lower case separated by underscore.
Naming is hard but in every organisation there is someone who can name things and in every software team there should be someone who takes responsibility for namings standards and ensures that naming issues like sec_id, sec_value and security_id get resolved early before they get baked into the project.
So what are the basic tenets of a good naming convention and standards: -
Use the language of your client and
your solution domain
Be descriptive
Be consistent
Disambiguate, reflect and refactor
Don’t use abbreviations unless they
are clear to everyone
Don’t use SQL reserved keywords as
column names
Here's a link that offers a few choices. I was searching for a simple spec I could follow rather than having to rely on a partially defined one.
http://justinsomnia.org/writings/naming_conventions.html
SELECT
UserID, FirstName, MiddleInitial, LastName
FROM Users
ORDER BY LastName
Table names should always be singular, because they represent a set of objects. As you say herd to designate a group of sheep, or flock do designate a group of birds. No need for plural. When a table name is composition of two names and naming convention is in plural it becomes hard to know if the plural name should be the first word or second word or both.
It’s the logic – Object.instance, not objects.instance. Or TableName.column, not TableNames.column(s).
Microsoft SQL is not case sensitive, it’s easier to read table names, if upper case letters are used, to separate table or column names when they are composed of two or more names.
Table Name: It should be singular, as it is a singular entity representing a real world object and not objects, which is singlular.
Column Name: It should be singular only then it conveys that it will hold an atomic value and will confirm to the normalization theory. If however, there are n number of same type of properties, then they should be suffixed with 1, 2, ..., n, etc.
Prefixing Tables / Columns: It is a huge topic, will discuss later.
Casing: It should be Camel case
My friend, Patrick Karcher, I request you to please not write anything which may be offensive to somebody, as you wrote, "•Further, foreign keys must be named consistently in different tables. It should be legal to beat up someone who does not do this.". I have never done this mistake my friend Patrick, but I am writing generally. What if they together plan to beat you for this? :)
Very late to the party but I still wanted to add my two cents about column prefixes
There seem to be two main arguments for using the table_column (or tableColumn) naming standard for columns, both based on the fact that the column name itself will be unique across your whole database:
1) You do not have to specify table names and/or column aliases in your queries all the time
2) You can easily search your whole code for the column name
I think both arguments are flawed. The solution for both problems without using prefixes is easy. Here's my proposal:
Always use the table name in your SQL. E.g., always use table.column instead of column.
It obviously solves 2) as you can now just search for table.column instead of table_column.
But I can hear you scream, how does it solve 1)? It was exactly about avoiding this. Yes, it was, but the solution was horribly flawed. Why? Well, the prefix solution boils down to:
To avoid having to specify table.column when there's ambiguity, you name all your columns table_column!
But this means you will from now on ALWAYS have to write the column name every time you specify a column. But if you have to do that anyways, what's the benefit over always explicitly writing table.column? Exactly, there is no benefit, it's the exact same number of characters to type.
edit: yes, I am aware that naming the columns with the prefix enforces the correct usage whereas my approach relies on the programmers
Essential Database Naming Conventions (and Style) (click here for more detailed description)
table names
choose short, unambiguous names, using no more than one or two words
distinguish tables easily
facilitates the naming of unique field names as well as lookup and linking tables
give tables singular names, never plural (update: i still agree with the reasons given for this convention, but most people really like plural table names, so i’ve softened my stance)... follow the link above please
Table names singular. Let's say you were modelling a realtionship between someone and their address.
For example, if you are reading a datamodel would you prefer
'each person may live at 0,1 or many address.' or
'each people may live at 0,1 or many addresses.'
I think its easier to pluralise address, rather than have to rephrase people as person. Plus collective nouns are quite often dissimlar to the singular version.
--Example SQL
CREATE TABLE D001_Students
(
StudentID INTEGER CONSTRAINT nnD001_STID NOT NULL,
ChristianName NVARCHAR(255) CONSTRAINT nnD001_CHNA NOT NULL,
Surname NVARCHAR(255) CONSTRAINT nnD001_SURN NOT NULL,
CONSTRAINT pkD001 PRIMARY KEY(StudentID)
);
CREATE INDEX idxD001_STID on D001_Students;
CREATE TABLE D002_Classes
(
ClassID INTEGER CONSTRAINT nnD002_CLID NOT NULL,
StudentID INTEGER CONSTRAINT nnD002_STID NOT NULL,
ClassName NVARCHAR(255) CONSTRAINT nnD002_CLNA NOT NULL,
CONSTRAINT pkD001 PRIMARY KEY(ClassID, StudentID),
CONSTRAINT fkD001_STID FOREIGN KEY(StudentID)
REFERENCES D001_Students(StudentID)
);
CREATE INDEX idxD002_CLID on D002_Classes;
CREATE VIEW V001_StudentClasses
(
SELECT
D001.ChristianName,
D001.Surname,
D002.ClassName
FROM
D001_Students D001
INNER JOIN
D002_Classes D002
ON
D001.StudentID = D002.StudentID
);
These are the conventions I was taught, but you should adapt to whatever you developement hose uses.
Plural. It is a collection of entities.
Yes. The attribute is a representation of singular property of an entity.
Yes, prefix table name allows easily trackable naming of all constraints indexes and table aliases.
Pascal Case for table and column names, prefix + ALL caps for indexes and constraints.

Resources