Normalize database table to 1NF - database

I am creating a database for a DVD rental shop, I have various entities that are related to this question, such as Film, FilmStar.
For each film, you record its unique number, title, the year in which it was made, its category (action adventure, science fiction, horror, romance, comedy, classic, children's), its director, and all stars that appeared in it. For each film, you also want to store the type of DVD hire (new release, classics, other).
I am mostly unsure about "all the stars that appeared in it". I first thought just having an attribute in the 'Film' Entity, for example filmStar and then each star would be inserted into that attribute, for example: "John Doe, Jane Doe" for each film. But then I realised that this wouldn't be 1NF as : "the domains of attributes must include only atomic values, the value of an attribute must be a single value from the domain of that attribute", as it contains more than one value and isn't atomic.
I then thought about having a separate entity that contains certain attributes such as: filmID, filmStarID. So John Doe would have the filmStarID of '0001' (all of this would be in the FilmStar entity, which is a separate entity). But then the same problem would occur, for example the filmID attribute would have all of the filmID's that the filmStar has starred in, for example: John Doe would have "101, 115, 009". Which again wouldn't be 1NF.
I was just wondering what your thoughts are on this?

What you're describing is a many-to-many relationship. Storing such a relationship would require a connecting table between the two related entities.
So you have two essential entities here:
Film
--------
ID
Title
etc.
CastMember
--------
ID
Name
etc.
Neither of these can store their relations to the other, because that would be a list of values rather than a single value. So the relationship itself essentially becomes an entity independent of the main entities. Something like this:
FilmCastMember
--------
FilmID
CastMemberID
NameInFilm
etc.
This relationship entity would be where you store any information specific to the relationship itself, but not descriptive of the entities being related. The lines above, for example, include NameInFilm which would be the character name played by that cast member in that film.

Related

Storing Entities with User-defined Components in Relational Database

I'm struggling to find the best way to store entities with user-defined fields. I would like to be able to do queries on these fields, so I feel NoSQL may not be the best approach. Constant schema migrations seems like a pain, especially since different users may want different fields on similar entities.
For example, let's say we have an entity representing a village. The village has a name (West Town), a type (village), a population (114). The user may want to add their own attributes to the village, say, a nickname. This is not known up front, and may not be required for other villages.
The best technique I've come up with is a table for the entities, and then a separate table for "components" of the entities, consisting of: a component id, a foreign key to the entity it's on, the name of the component, and its value.
So, the village from the example would exist as:
Table 1 - Entity
ID
1
Table 2 - String Components
ID ENTITY_ID NAME VALUE
1 1 name West Town
2 1 type village
Table 3 - Integer Components
ID ENTITY_ID NAME VALUE
1 1 population 114
Then, if the user wanted to add a "nickname" to the village, they could push a button, select a string component, call it "nickname" and give it a value of "Wesson":
Table 2 - String Components
ID ENTITY_ID NAME VALUE
1 1 name West Town
2 1 type village
3 1 nickname Wesson
Then, when the entity needs to be displayed, we query the component tables for the entity ID, and display the information:
name: West Town
population: 114
type: village
nickname: Wesson
Is this crazy? It feels both sort of like an elegant way to represent a mutable schema in a relational database, and like trying to get around the whole point of a relational database. Is there a better way?
Answering my own question. This seems to generally be addressed using a pattern known as "entity-attribute-value" which is similar to what I've suggested.
The entities table could be a little richer, storing also information common to all entities, like "name" and maybe a foreign key into an "entity_type" table.
At its simplest, the attributes tables could be as above, with one for each data type.
https://en.wikipedia.org/wiki/Entity%E2%80%93attribute%E2%80%93value_model

How to put this in a E/R diagram?

Have a simple question, but I think I am overthinking it. I need to make an E/R diagram out of this:
Substantial fees are due every calendar year. Fee payments must be
made via a bank transfer, mentioning the member number and the
membership year it applies to. The database should store the date of
payment.
I am ignoring calendar year, as I think it is not relevant for the E/R diagram. I have an entity called "Members" which I like to "Fee" via *"payed via the relationship (diamond symbol) a bank transfer"*.
Now, my question is: should "member number" and "membership" be part of the "fee" entity or the "member" entity? Or both? Because I am thinking to add a new relationship to "fee" giving it the name "consists of" and then link "member number" and "membership", but I don't know whether that's good or not.
And what to do with the last sentence? "The database should store the date of payment."? Can I ignore it?
From your description I got:
You have entity sets Members and Payments
Members are identified by a member_number
Payments have attributes date, amount and membership_year
Obviously, we also need:
Payments have an attribute amount
How are we going to identify Payments? No combination of the listed attributes are uniquely identifying in my opinion. A Member could make two identical Payments on the same date with the same amount, for the same membership year, e.g. if they accidentally only paid half of the annual fee at first then made a second payment to correct.
Let's introduce a surrogate key:
Payments are identified by a payment_id
We also need a relationship between the two entity sets:
Each Payment is associated with a single Member
Each Member can make multiple Payments
We can put this info into an ER diagram:
To derive a table diagram, Chen's original method implemented every entity relation (entity key and attributes) and relationship relation (relationship keys (i.e. related entity keys) and relationship attributes) as separate tables:
However, it's common practice to denormalize tables with the same primary key:
I recommend you study Chen's paper The Entity-Relationship Model - Toward a Unified View of Data. Codd's paper A Relational Model of Data for Large Shared Databanks provides valuable background.

ER diagram that implements a database for trainee

I edited and remade the ERD. I have a few more questions.
I included participation constraints(between trainee and tutor), cardinality constraints(M means many), weak entities (double line rectangles), weak relationships(double line diamonds), composed attributes, derived attributes (white space with lines circle), and primary keys.
Questions:
Apparently to reduce redundant attributes I should only keep primary keys and descriptive attributes and the other attributes I will remove for simplicity reasons. Which attributes would be redundant in this case? I am thinking start_date, end_date, phone number, and address but that depends on the entity set right? For example the attribute address would be removed from Trainee because we don't really need it?
For the part: "For each trainee we like to store (if any) also previous companies (employers) where they worked, periods of employment: start date and end date."
Isn't "periods of employment: start date, end date" a composed attribute? because the dates are shown with the symbol ":" Also I believe I didn't make an attribute for "where they worked" which is location?
Also how is it possible to show previous companies (employers) when we already have an attribute employers and different start date? Because if you look at the Question Information it states start_date for employer twice and the second time it says start_date and end_date.
I labeled many attributes as primary keys but how am I able to distinguish from derived attribute, primary key, and which attribute would be redundant?
Is there a multivalued attribute in this ERD? Would salary and job held be a multivalued attribute because a employer has many salaries and jobs.
I believe I did the participation constraints (there is one) and cardinality constraints correctly. But there are sentences where for example "An instructor teaches at least a course. Each course is taught by only one instructor"; how can I write the cardinality constraint for this when I don't have a relationship between course and instructor?
Do my relationship names make sense because all I see is "has" maybe I am not correctly naming the actions of the relationships? Also I believe schedules depend on the actual entity so they are weak entities.... so does that make course entity set also a weak entity (I did not label it as weak here)?
For the company address I put a composed attribute, street num, street address, city... would that be correct? Also would street num and street address be primary keys?
Also I added the final mark attribute to courses and course_schedule is this in the right entity set? The statement for this attribute is "Each trainee identified by: unique code, social security number, name, address, a unique telephone number, the courses attended and the final mark for each course."
For this part: "We store in the database all classrooms available on the site" do i make a composed attribute that contains site information?
Question Information:
A trainee may be self-employed or employee in a company
Each trainee identified by:
unique code, social security number, name, address, a unique
telephone number, the courses attended and the final mark for each course.
If the trainee is an employee in a company: store the current company (employer), start date.
For each trainee we like to store (if any) also previous companies (employers) where they worked, periods of employment: start date and end date.
If a trainee is self-employed: store the area of expertise, and title.
For a trainee that works for a company: we store the salary and job
For each company (employer): name (unique), the address, a unique telephone number.
We store in the database all known companies in the
city.
We need also to represent the courses that each trainee is attending.
Each course has a unique code and a title.
For each course we have to store: the classrooms, dates, and times (start time, and duration in minutes) the course is held.
A classroom is characterized by a building name and a room number and the maximum places’ number.
A course is given in at least a classroom, and may be scheduled in many classrooms.
We store in the database all classrooms
available on the site.
We store in the database all courses given at least once in the company.
For each instructor we will store: the social security number, name, and birth date.
An instructor teaches at least a course.
Each course is taught by only one instructor.
All the instructors’ telephone numbers must also be stored (each instructor has at least a telephone number).
A trainee can be a tutor for one or many trainees for a specific
period of time (start date and end date).
For a trainee it is not mandatory to be a tutor, but it is mandatory to have a tutor
The attribute ‘Code’ will be your PK because it’s only use seems to be that of a Unique Identifier.
The relationship ‘is’ will work but having a reference to two tables like that can get messy. Also you have the reference to "Employers" in the Trainee table which is not good practice. They should really be combined. See my helpful hints section to see how to clean that up.
Company looks like the complete table of Companies in the area as your details suggest. This would mean table is fairly static and used as a reference in your other tables. This means that the attribute ‘employer’ in Employed would simply be a Foreign Key reference to the PK of a specific company in Company. You should draw a relationship between those two.
It seems as though when an employee is ‘employed’ they are either an Employee of a company or self-employed.
The address field in Company will be a unique address your current city, yes, as the question states the table is a complete list of companies in the city. However because this is a unique attribute you must have specifics like street address because simply adding the city name will mean all companies will have the same address which is forbidden in an unique field.
Some other helpful hints:
Stay away from adding fields with plurals on them to your diagram. When you have a plural field it often means you need a separate table with a Foreign Key reference to that table. For example in your Table Trainee, you have ‘Employers’. That should be a Employer table with a foreign key reference to the Trainee Code attribute. In the Employer Table you can combine the Self-employed and Employed tables so that there is a single reference from Trainee to Employer.
ERD Link http://www.imagesup.net/?di=1014217878605. Here's a quick ERD I created for you. Note the use of linker tables to prevent Many to Many relationships in the table. It's important to note there are several ways to solve this schema problem but this is just as I saw your problem laid out. The design is intended to help with normalization of the db. That is prevent redundant data in the DB. Hope this helps. Let me know if you need more clarification on the design I provided. It should be fairly self explanatory when comparing your design parameters to it.
Follow Up Questions:
If you are looking to reduce attributes that might be arbitrary perhaps phone_number and address may be ones to eliminate, but start and end dates are good for sorting and archival reasons when determining whether an entry is current or a past record.
Yes, periods_of_employment does not need to be stored as you can derive that information with start and end dates. Where they worked I believe is just meant to say previous employers, so no location but instead it’s meant that you should be able to get a list all the employers the trainee has had. You can get that with the current schema if you query the employer table for all records where trainee code equals requested trainee and sort by start date. The reason it states start_date twice is to let you know that for all ‘previous’ employers the record will have a start and end date. Hence the previous. However, for current employers the employment hasn't ended which means there will be no end_date so it will null. That’s what the problem was stating in my opinion.
To keep it simple PK’s are unique values used to reference a record within another table. Redundant values are values that you essentially don’t need in a table because the same value can be derived by querying another table. In this case most of your attributes are fine except for Final_Mark in the Course table. This is redundant because Course_Schedule will store the Final_Mark that was received. The Course table is meant to simply hold a list of all potential courses to be referenced by Course_Schedule.
There is no multivalued attributes in this design because that is bad practice Job and salary are singular and if and job or salary changes you would add a new record to the employer table not add to that column. Multivalued attributes make querying a db difficult and I would advise against it. That’s why I mentioned earlier to abstract all attributes with plurals into their own tables and use a foreign key reference.
You essentially do have that written here because Course_Schedule is a linker table meaning that it is meant to simplify relationships between tables so you don’t have many to many relationships.
All your relationships look right to me. Also since the schedules are linker tables and cannot exist without the supporting tables you could consider them weak entities. Course in this schema is a defined list of all courses available so can be independent of any other table. This by definition is not a weak entity. When creating this db you’d probably fill in the course table and it probably wouldn’t change after that, except rarely when adding or removing an available course option.
Yes, you can make address a composite attribute, and that would be right in your diagram. To be clear with your use of Primary key, just because an attribute is unique doesn’t make it a primary key. A table can have one and only one primary key so you must pick a column that you are certain will not be repeated. In this example you may think street number might be unique but what if one company leaves an address and another company moves into that spot. That would break that tables primary key. Typically a company name is licensed in a city or state so cannot be repeated. That would be a better choice for your primary key. You can also make composite primary keys, but that is a more advanced topic that I would recommend reading about at a later date.
Take final_mark out of courses. That’s table will contain rows of only courses, those courses won’t be linked to any trainee except by course_schedule table. The Final_Mark should only be in that table. If you add final_mark to Course table then, if you have 10 trainees in a course, You’d have 10 duplicate rows in the course table with only differing final_marks. Instead only hold the course_code and title that way you can assign different instructors, trainees and classrooms using the linker tables.
No composite attribute is needed using this schema. You have a Classroom table that will hold all available classrooms and their relevant information. You then use the Classroom_Schedule linker table to assign a given Classroom to a Course_Schedule. No attributes of Classroom can be broken down to simpler attributes.

How far does one go to eliminate duplicate data in a database?

How far does one go to eliminate duplicate data in a database? Because you could go OTT and it would get crazy. Let me give you an example...
If I were to create a Zoo database which contains a table 'Animal' which has a 'name', 'species' and 'country_of_birth'
But there will be duplicate data there as many animals could come from same country and there could be lots of tigers, for example.
So really there should be a 'Species' table and a 'Country_of_birth' table
But then after a while you would have tons of tables
So how far do you go?
In this question I am just using one table as an example. One row in the Animal table stores information about a single animal in the zoo. So that animal's name, species and country of birth, as well as a unique animalID.
But there will be duplicate data there as many animals could come from
same country and there could be lots of tigers, for example.
This suggests you want to keep track of individual animals, not just kinds of animals. Let's assume that the zoos use some kind of numeric tattoo or microchip to identify individual animals.
Assume this sample data is representative. (It's not, but it's ok for teaching.)
Animals
Predicate: Animal having microchip <chip_num> of species <species>
has name <name> and was born in <birth_country_code>.
chip_num name species birth_country_code
--
101234 Anita Panthera tigris USA
101235 Bella Panthera tigris USA
101236 Calla Panthera tigris USA
101237 Dingo Canis lupus CAN
101238 Exeter Canis lupus CAN
101239 Bella Canis lupus USA
101240 Bella Canis lupus CAN
There's no redundant data in that table. None of those columns can be dropped without radically changing the meaning of that table. It has a single candidate key: chip_num. It's in 5NF.
Values are repeated in non-key columns. That's kind of the definition of non-key (non-prime) columns. Values in key columns (or sets of key columns) are unique; values in non-key columns aren't.
If you want to restrict the values in "birth_country_code" to the valid three-letter ISO country codes, you can add a table of valid three-letter ISO country codes, and set a foreign key reference to it. This is generally a Good Thing, but it has nothing to do with normalization.
iso_country_code
--
CAN
USA
You could do the same thing again for "species". That, too, would generally be a Good Thing, and it, too, would have nothing to do with normalization.
First you decide What the table is supposed to carry information about. In your example. is the table about individual animals? or is it about species of animals and how many of each species? The fact that you have country of birth might be an indicator that someone wants it to be the former. If that is the case you must have a key that identifies individual animals. You have an attribute, (a property) that is associated with individuals, so each row must (should?) represent an individual. You should read up Here on the database modeling concepts of Identity, and Individuation.
And to do this properly, actually, you do this for each thing in your data model, and then convert that model into database tables.
It comes down to deciding what is important to your system.
Deciding whether something (your e.g. "country of birth") is merely an attribute or is instead a full-blown entity in its own right depends on what else your system needs to know about countries and how many attributes your system may track that are fully functionally dependent on the country.
You should also consider whether your attributes are susceptible to update anomalies. If your statement of country in the animal table is in the form of the full official name of the country, then you might be at risk if, for example, "The Belgian Congo" suddenly becomes "The Democratic Republic of the Congo" - oh wait, that already happened!
The rules of normalization are not sacrosanct. They are pretty darn useful rules of thumb that are intended to keep you out of trouble, most of the time. Still, rules are made to be broken - but you should only break them knowingly and with a carefully considered understanding of the consequences.

ERD - Entity relationship diagram - complex and tricky relations

Here is the scenario.
Two completely different Entities are independently related to the third entity in the same way. How do we represent it in the ERD? or (Enhanced ER)
Ex:
Student "BORROWS" BOOK (from the library)
DEPARTMENT "BORROWS" BOOK (from the same library).
If I define 'BORROWS' relationship twice, it would be awkward and clumsy in terms of appearance in the diagram, and increase the complexity of implementation as well.
At the same time, I can not declare a ternary relationship since STUDENT and DEPARTMENT are not inter-related in a relationship-instance.
However, I couldn't find a better way.
How do I solve it?
If Wikipedia is to be believed, Enhanced ER permits inheritance. Why don't you have a BORROWER entity (with the appropriate relationship), and have STUDENT and DEPARTMENT subclass that?
I've been having a similar issue - where a company or a person can order a product.
You've got an order, that can belong to either a person, or a company - so what do you link the relationship to? I'm thinking orders will have a companyId, and a personId foreign key, but how do you make them exclusive? The data returned won't necessarily be the same - a company doesn't have a first name / last name field for example.
I guess it could be done by having a name returned, and in the case of a person build the string out of firstname / lastname, and in the case of a company use the companyname field .

Resources