How do I normalize data in a database?

How do I normalize data in a database? - database

I'm in an intro to database management course and we're learning about normalizing data (1NF, 2NF, 3NF, etc.) and I'm super confused on how to actually go about and do it. I've read up on this, consulted various sites and youtube videos and I still can't seem to get it to click. I am using Microsoft Access 2013 if that's of any help.
This is the data I'm working with.
Thanks.
Edit1: Alright, I think I have the tables set up correctly. But now I'm having trouble actually inputting data to go from one table to the next. Here's my relationship table.

On a very basic level, any repeating values in a table are candidates for normalization. Duplicated data is usually a bad idea. Say you needed to update a patient's surname - you now have to update all the occurrences in this table, and possibly many others throughout the rest of the database. Much better to store each patient's details in one place only.
This is where normalization comes in. Looking down the columns, you can see that there are repeating values for data about dentists, patients and surgeries, so we should normalize towards having tables for each of these entities, as well as the original table that contains appointments, giving you four tables in total.
Extract the entities out into their own tables, and give each row a primary (unique) key - just use an incrementing integer for now. (Edit: as suggested in the comment we could use the natural keys of PatientNo, StaffNo and SurgeryNo instead of creating surrogates.)
Then, instead of each patient's name and number appearing multiple times in the appointments table, we just reference the key of the master record in the Patient table. This is called a foreign key.
Then, do the same for Dentist and Surgery.
You will end up with tables looking something like this:
APPOINTMENT
AppointmentID DentistID PatientID AppointmentTime SurgeryID
----------------------------------------------------------------
1 1 1 12 Aug 03 10:00 1
2 1 2 ... 2
3 2 3 ... 1
4 2 3 ... 1
5 3 2 ... 2
6 3 4 ... 3
DENTIST
DentistID Name StaffNo
--------------------------------------
1 Tony Smith S1011
2 Helen Pearson S1024
3 Robin Plevin S1032
PATIENT
PatientID Name PatientNo
---------------------------------------
1 Gillian White P100
2 Jill Bell P105
3 Ian MackKay P108
4 John Walker P110
SURGERY
SurgeryID SurgeryNo
-------------------------
1 S10
2 S15
3 S13

The first step is to data modelling and denormalization is to understand your data. Study it an understand the domain "objects" or tables that exist within your model. That will give you an idea of how to start. Sometimes a single table or query sample is not enough to fully understand the database, but in your case, we can use the sample data and make some assumptions.
Secondly, look for repeated / redundant data. If you see copies of names, there is a good chance that is a candidate for a foreign key. Our assumption tells us that STAFF_NO is a primary key candidate for DENTIST because each unique STAFF_NO correlates to a unique DENTIST_NAME, so I see a good candidate DENTIST table (STAFF_NO, DENTIST_NAME)
Example in some table of SURGERY:
ID STAFF_NO DENTIST_NAME
1 1 Fred Sanford
2 1 Fred Sanford
3 3 Lamont Sanford
4 3 Lamont Sanford
Why store these over and over? What happens when Fred says "But my correct name is Fred G Sanford", so you have to update your database. In the current table, you have to update the name is many rows. If you had normalized it, you'd have a single location for the name, in the DENTIST table.
So I can take the unique dentists and store them in DENTIST
create table DENTIST(staff_no integer primary key, dentist_name varchar(100));
-- One possible way to populate our dentist table is to use a distinct query from surgery
insert into DENTIST
select distinct staff_no, dentist_name from surgery;
STAFF_NO DENTIST_NAME
1 Fred Sanford
3 Lamont Sanford
SURGERY table now points to DENTIST table
ID STAFF_NO
1 1
2 1
3 3
4 3
And you can now create a view, VIEW_SURGERY to join the DENTIST_NAME back in to satisfy the needs of typical queries.
select s.id, d.staff_no, d.dentist_name
from surgery s join dentist d
on s.staff_no = d.staff_no -- join here
So now a unique update to DENTIST, by the dentist primary key will update a single row.
update dentist set name = 'Fred G Sanford' where staff_no = 1;
Add query view will show the updated name for N rows:
select * from view_surgery
ID STAFF_NO DENTIST_NAME
1 1 Fred G Sanford
2 1 Fred G Sanford
3 3 Lamont Sanford
4 3 Lamont Sanford
In short, you are removing redundancy.
This is just a sample, and one way to do it. Manual normalization like this is not as common when you have modelling tools, but the point is, we can look at data, spot redundancies and factor those redundancies into new tables, and relate those new tables by foreign keys and joins, then build views to represent the original data.

Related

Simple database design - some columns have multiple values

Caveat: very new to database design/modeling, so bear with me :)
I'm trying to design a simple database that stores information about images in an archive. Along with file_name (which is one distinct string), I have fields like genre and starring where each field might contains multiple strings (if an image is associated with multiple genres, and/or if an image has multiple actors in it).
Right now the database is just a single table keyed on file_name, and the fields like starring and genre just have multiple comma-separated values stored. I can query it fine by using wildcards and like and in operators, but I'm wondering if there's a more elegant way to break out the data such that it is easier to use/query. For instance, I'd like to be able to find how many unique actors are represented in the archive, but I don't think that's possible with the current model.
I realize this is a pretty elementary question about data modeling, but any guidance anyone can provide or reading you can direct me to would be greatly appreciated!
Thanks!

You need to create extra tables in order to stick with the normalization. In your situation you need 4 extra tables to represent these n->m relations(2 extra would be enough if the relations were 1->n).
Tables:
image(id, file_name)
genre(id, name)
image_genres(image_id, genre_id)
stars(id, name, ...)
image_stars(image_id, star_id)
And some data in tables:
image table
id
file_name
1
/users/home/song/empire.png
2
/users/home/song/promiscuous.png
genre table
id
name
1
pop
2
blues
3
rock
image_genres table
image_id
genre_id
1
2
1
3
2
1
stars table
id
name
1
Jay-Z
2
Alicia Keys
3
Nelly Furtado
4
Timbaland
image_stars table
image_id
star_id
1
1
1
2
2
3
2
4
For unique actor count in database you can simply run the sql query below
SELECT COUNT(name) FROM stars

At which Step of normalization i should include the new Primary Key Column

I have a table with the following data.
I need to Normalize them as per Normalization Rules, but I am confused, on which step of normalization I should introduce CustomerId as new PrimaryKey column.
CustomerName Address ObjectRented objectCatetory
-------------------------------------------------------
Mr A Street 1 Obj1,Obj2 Cat1,Cat1
Mr B Street 2 Obj3,Obj4 Cat2,Cat2
MR B Street 3 Obj2 Cat1

In this particular example, you can introduce primary key column even before doing normalization. Your table could just look denormalized like this after adding primary key (simplified structure):
CustomerID CustomerName Address ObjectRented
---------- ------------- ------- ------------
1 Mr A Street 1 Obj1,Obj2
2 Mr B Street 2 Obj3,Obj4
2 Mr B Street 3 Obj2
I am writing this rather quickly, so please do read up on other answers and blogs about normal forms as well.
1NF - Remove repeating group
CustomerID CustomerName Address ObjectRented
---------- ------------- ------- ------------
1 Mr A Street 1 Obj1
1 Mr A Street 1 Obj2
2 Mr B Street 2 Obj3
2 Mr B Street 2 Obj4
2 Mr B Street 3 Obj2
2NF - Remove partial dependency
CustomerID is actually a customer and lives at a particular location. Keep them in a single table. Customer can rent whatever they like...keep whatever they rented in a different table like this:
Customers
CustomerID CustomerName Address
---------- ------------- -------
1 Mr A Street 1
2 Mr B Street 2
2 Mr B Street 3
ObjectRental
CustomerID ObjectRented
---------- ------------
1 Obj1
1 Obj2
2 Obj3
2 Obj4
2 Obj2
During this stage, you can also move Objects into its own table
Objects
ObjectID ObjectName
-------- ----------
1 Obj1
2 Obj2
3 Obj3
4 Obj4
ObjectRental becomes
CustomerID ObjectRentedID
---------- ------------
1 1
1 2
2 3
2 4
2 2
At this point I believe you have automatically gained 3NF. In 3NF, you'd want to make sure that - loosely speaking - none of your child table's primary key associates with a non-primary key in parent table.

Normalization doesn't require you to introduce surrogate keys. It's not about eliminating duplicate values either. The purpose of normalization is to avoid recording the same fact more than once, since multiple instances of the same fact could be updated inconsistently and cause anomalies.
The 1st normal form is slightly different from the rest, since its purpose is to ensure that your data is in a regular arrangement of scalar values so that normalization can be applied systematically. For the 1st normal form, you need to ensure that every field of every row contains a single value. "Obj1,Obj2" looks like multiple values, so you might want to start there. Then, write out your functional dependencies before proceeding to the higher normal forms.

Best way to store results data in database? [duplicate]

This question already has answers here:
Is storing a delimited list in a database column really that bad?
(10 answers)
Closed 9 years ago.
I have results data like this:
1. account, name, #, etc
2. account, name, #, etc
...
10. account, name, #, etc
I have approximately 1 set of results data generated each week.
Currently it's stored like so:
DATETIME DATA_BLOB
Which is annoying because I can't query any of the data without parsing the BLOB into a custom object. I'm thinking of changing this.
I'm thinking of having one giant table:
DATETIME RANK ACCOUNT NAME NUMBER ... ETC
date1 1 user1 nn #
date1 2 user2 nn #
...
date1 10 userN nn #
date2 1 user5 nn #
date2 2 user12 nn #
...
date2 10 userX nn #
I don't know anything about database design principles, so can someone give me feedback on whether this is a good approach or there might be a better one?
Thanks

I think it is ok to have a table like that, if there are not one-to-many relationships. In that case, it would be more efficient to have multiple tables like in my example below. Here are some general tips as well:
Tip: Good practice My professor told me that it's always good to have an "ID" column, which is a unique number identifier for each item in the table (1, 2, 3… etc.). (Perhaps that was the intent of your "Number" column.) I think SQLite forces each table to have an ID column anyways.
Tip: Saving storage space - Also, if there is a one-to-many relationship (example: one name has many accounts) then it might save space to have a separate table for the accounts, and then store the ID of the name in the first table- so that way you are storing many ints instead of duplicate strings.
Tip: Efficiency - Some databases have specific frameworks designed to handle relationships such as many-to-one or many-to-many, so if you use their framework for that (I don't remember exactly how to do it) it will probably work more efficiently.
Tip: Saving storage space - If you make your own ID column it might be a waste if it automatically includes an "ID" column anyways - so you might want to check for that possibility.
Conceptual Example: (Storing multiple accounts for the same name)
Poor Solution:
Storing everything in 1 table (inefficient, because it duplicates Bob's name, rank, and datetime):
ID NAME RANK DATETIME ACCOUNT
1 Bob 1 date1 bob_account_1
2 Joe 2 date2 user2_joe
3 Bob 1 date1 bob_account_2
4 Bob 1 date1 bobs_third_account
Better Solution: Having 2 tables to prevent duplicated information (Also demonstrates the usefulness of ID's). I named the 2 tables "Account" and "Name."
Table 1: "Account" (Note that NAME_ID refers to the ID column of Table 2)
ID NAME_ID ACCOUNT
1 1 bob_account_1
2 2 user2_joe
3 1 bob_account_2
4 1 bobs_third_account
Table 2: "Name"
ID NAME RANK DATETIME
1 Bob 1 date1
2 Joe 2 date2
I'm not a database expert so this is just some of what I learned in my internet programming class. I hope this helps lead you in the right direction in further research.

Query a Lookup Table in Excel or Access

I've got a mental block about what I'm sure is a common scenario:
I have some data in a csv file that I need to do some very basic reporting from.
The data is essentially a table with Resources as column headings and People as row headings, the rest of the table consists of Y/N flag, "Y" if the person has access to the resource, "N" if they don't. Both the resources and the people have unique names.
Sample data:
Res1 Res2 Res3
Bob Y Y N
Tom N N N
Jim Y N Y
The table is too large to simply view it as whole in Excel(say 300 resources and 600 people), so I need a way to easily query and display (A simple list would be ok) what resources a person has access to, given the person's name.
The person that will need to use this has MS Office, and not much else on their PC.
So, the question is: What is the best way to manipulate this data to get the report I need? My gut says MS Access would be the best, but I can't figure out to automatically import data like this into a normal relational database. If not Access, perhaps there are some functions in Excel that could help me out?

You should normalize your data. This will make it easier to query against. For example:
table users:
UserID UserName
1 Bob
2 Tim
3 Jim
table resources:
ResourceID ResourceDesc
1 Printer #1
2 Fax Machine
3 Bowling Ball Wax
table users_resources:
LinkID UserID ResourceID
1 1 1
2 1 2
3 3 1
4 3 3
SELECT ResourceID
FROM users_resources, users
WHERE users.UserName="Bob"

Database relation many to many

alt text http://produits-lemieux.com/database.jpg
This is basicly my database structure
one product (let say soap) will have many retail selling size
1 liter
4 liters
20 liters
In my "produit" database I will have the soap item (id #1)
In the size database i will have many size availible :
1liter
4liter
20liter
How not to duplicate the product 3 time with a different size... i like to be able to have check box in the product size of all the size available in the database and check if yes or no (boolean)
The answer a got is perfect, but how to have the option like that :
soap [x] 1 liter , [ ] 4 liter , [x] 20 liter

I'm not sure I understand your exact scenario, but to create a many-to-many relationship, you simply create a "relationship table", in which you store id's for the two records you want to link.
Example:
Products
********
ProductID (PK)
Price
Retailers
*********
RetailerID (PK)
Name
ProductRetailerRelationships
****************************
ProductID
RetailerID

A many-to-many relationship is almost always modeled using an intermediate table. For your example,
Product
--------
prod_numero
...
Size
--------
size_numero
...
Product_Size
--------
prod_numero
size_numero
...
The Size table would contain particular sizes (say, 10 liter) and the Product_Size table creates a Product and Size pairing.

You Will need an Intermediary, or "Join" Table
ProductSizes
.......................
ProductID
SizeID
One record for each product-size combination

Based on the answers, here is the database tables layout as proposed, it look complicated to me, but are you sure it is the way to do this, the BEST solution ?
alt text http://produits-lemieux.com/database2.jpg

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight