Am I Properly Normalizing this Data - database

I am completing normalization exercises from the web to test my abilities to normalize data. This particular problem was found at: https://cs.senecac.on.ca/~dbs201/pages/Normalization_Practice.htm (Exercise 1)
The table this problem is based of is as follows:
The unnormalized table that can be created from this table is:
To comply with First Normal form, I have to get rid of repeating fields in the table by moving visitdate, procedure_no, and procedure_name to their own respective tables:
This also complies with 2NF and 3NF which makes me question whether I have performed the process of normalization correctly. Please provide feedback if I did not properly move from UNF to 1NF.

In a first step you could create the following tables (assuming pet_id is unique in the table):
Pets: pet_id, pet_name, pet_type, pet_age, owner
Visits: pet_id, visit_date, procedure
Going further you could split procedure since the description is repeating:
Pets: pet_id, pet_name, pet_type, pet_age, owner
Visits: pet_id, visit_date, procedure_id
Procedures: procedure_id, description
Although there can be multiple procedures on the same visit_date for the same pet_id, I see no reason to split those further: a date could (in theory) be stored in 2 bytes, and splitting that data would create more overhead (plus an extra index).
You would also want to change pet_age to pet_birth_date since the age changes over time.
Since this is the first exercise in your list, the above will probably be more than enough.
Going even further:
An owner can have multiple pets, so another table could be created:
Pet_owners: owner_id, owner_name
and then only use owner_id in the Pets table. In a real system there would be customer_id, name, address, phone, email, etc. - so that should always be in a separate table.
You could even do the same for pet_type and store the id in 1 or 2 bytes, but it all depends on the type of queries you want to do later on the data.

The question is poorly presented. Look at the last two columns. The askers do not mean that each column's types are sets. They mean that pairs of values on the same line make an element of a set. They should have had one column whose values were triplets--date, number & name. That's what they did when they used just one column (the last one) for number & name. Notice that their solution in the pdf linked to by the page you link to has a table that has all three of date, number & name.
But how are you supposed to know that the values should be paired? After all if the date column gave the set of a pet's visit dates & the procedure column gave the set of procedure number & names a pet ever had then we wouldn't be supposed to take a pair of values on the same line as an element of a set. Unfortunately you are just supposed to magically guess correctly. (A hint is that the number of dates & number-name pairs for a pet are always the same.)
The above took the blank areas in the illustration to be there to make room for the vertical display of set-valued attributes; the portrayed table has 4 rows. But maybe they are there because you are supposed to get a relation from this illustration by interpreting a blank subrow as representing the most recent non-blank subrow. Then the table wouldn't have any set-valued columns; the portrayed table has 9 rows. It happens that this interpretation disagrees with the linked answer's UNF & 1NF sections.
If they weren't going to explain the table & were just relying on your guesses it would have been clearer if they put a visit's procedure date, number & name under one column--just as they put a procedure number & name in one column. But really, they should always tell you how to read the illustration. But really, you should always ask how read an illustration. If you have any interpretation conventions from a related course/textbook then you should have put it in your question for us to know.
Unfortunately "UNF" tables are almost always similarly poorly given without any description about how they are to be interpreted. Also "1NF" has no standard meaning & there is no standard notion of "normalizing to 1NF".

Related

One-To-Many relathionship task

I have a table Subject
It has many fields, two of them are code and flag.
Earlier those two fields was an idempotention key for rows in this table.
But, now I need one more option system.
There are tens of rows in Subject
And 4-7 systems.
What is a better way?
Create table System for systems and create cross-table of mapping sysytems on subjects (code and flag are still in table Subject)
Create one table of mapping without creating table System
Just add another column in table Subject
Create table System and add to the table Subject foreign key for table System?
So, It's all about database normalization.
And the third option is pretty bad.
As for me the better way is fourth option.
But, I can`t explain to yourself why this option is better than 1 and 2.
So, I read rules of database normalization. And as for me, the first option satisfies all rules too.
This is the reason why I am asking this question.
It is not typically a great idea to design a SQL schema to 1st or 2nd normal form; many databases use something at or near 3rd normal form however there many still be some relationships in 3rd normal form where dependencies exist where redundancy still exists. This can be addressed by Boyce–Codd Normal Form (Codd, 1974).
It is also not typical to see 4th normal form and less so 5th normal form and beyond due to challenges with data maintenance in a "living" database.
Let's put this another way.
If you find yourself creating NULLable values constraints on many columns consider a table to contain those in an organized fashion - for example an Address table for addresses with a linking table from say for example a Person to a PersonAddress linking table to that Address table where the PersonAddress linking table might even have an AdddressTypeId column which links to an AddressType table with rows for Address Type Postal and Address Type Street or Address Type Business. For another example consider email addresses where people have personal, family, business and other email address types; even multiple of the same type for different uses; a doctor with a business practice email and 2-3 hospital email addresses where the doctor practices.
Linking tables for those type scenarios are likely better than 3-4 or more email or postal address columns in one table where many after the first are nullable or perhaps redundant.
Review your data; consider if your Subject for example may link to multiple System or placing a SubjectId column may lead to duplicates of that ID for differing system rows. If it is always and forever a 1-1 relationship it may be OK but for a 1-n or n-n it may not be ok to have the id in the other table and a linking table may provide a good mechanism to link them.

Need help in developing DB logic

This is a mini-project of mine - Airline reservation system - lets call this airline FlyMi : I have a database (Not decided which one, friend of mine wants to go with MongoDB). Anyhoo, this is my requirement :
I have a table which has details of the flight - Flight number, schedule etc. I'm going to use this table to perform various operations - booking , cancellation , modification
This is where I'm stuck : For the desktop app and the web application - I'm offering an option to select seats. This means I've got to keep track of which seats are booked , which ones are not. And assume I have an UI , which shows seats as Red - Booked Green - Not Booked.And all of this - for each and every flight. My question is : What do you think would be the most efficient way to track seat bookings , for each flight in that airline?
Current Idea : Keep a table named passenger - with all the details such as name , address etc. which keep track of all passengers, and maintain a passenger ID such that , first 4 characters are flight ID, Last 2 character are seat numbers they have chosen, with random number in-between ( I say random because I think it is immaterial here). So, for any flight , If I have to find out number of un-booked seats, I will have to scan through every passenger , who has booked, and who has booked in that flight. I think this is really in-efficient. Provide me with the most efficient logic to do this.
Don't use "smart keys".
This is a bad idea called "smart keys" or "encoding information in keys".
See this answer which contains this excerpt:
Despite it now being easy to implement a Smart Key, it is hard to recommend that you create one of your own that isn't a natural key, because they tend to eventually run into trouble, whatever their advantages, because it makes the databases harder to refactor, imposes an order which is difficult to change and may not be optimal for your queries, requires a string comparison if the Smart Key includes non-numeric characters, and is less effective than a composite key in helping range-based aggregations. It also violates the basic relational guideline that every column should store atomic values
Smart Keys also tend to outgrow their original coding constraints
(Notice that seat locations are typically identified by smart keys in that they are a row number and a count across a row. But they are also typically visibly physically permanently bolted into that formation. And imagine if they were labelled and rearranged.)
Educate yourself about database design.
Just describe your business in the most straightforward terms. That is how relational model databases & DBMSs work.
Find enough fill-in-the-[named-]blanks sentence templates to describe your business situations:
"customer [cid] has name [firstname] [lastname]
AND customer [cid] has a phone number [phonenumber] of type [type] ..."
"customer [cid] can use credit card #[card_no]"
"seat [seatid] is at row [row] and column [column]"
"seat [seatid] is booked"
"seat [seatid] is temporarily committed to an unfinished booking"
...
For each such parameterized sentence template (aka predicate) have a base table where the names of the blanks/parameters are column names. Each row in a table states the statement (proposition) got from filling in the blanks per its column values; each row not in a table states NOT the statement from filling in the blanks per its column values.
Then for each table find every functional dependency (FD) that holds. (When a predicate can be expressed in the form "... AND column = F(column1,...)" then we say that column set {column1,...} functionally determines column column and that FD set → column holds.) Then identify every candidate key (CK). (A superkey is a column set that functionally determines every column. Ie that is unique, ie where each subrow of values for those columns appears only in one row of a table. A CK is a superkey that doesn't contain a smaller superkey.) Then find every join dependency (JD). (Some predicates say "... AND ..." for some number of ANDs & "..."s. There is a JD when the table for each predicate "..." would look like what you get from taking only its columns from the original table.) Note that every FD comes with an associated (binary) JD.
Then normalize your tables to fifth normal form (5NF). This means decomposing (ie replacing a table in which JD "... AND ..." holds by tables whose predicates are the "..."s) until each JD that holds is implied by the CKs (ie must hold when the JDs from the FDs from the CKs hold.) (For performance reasons one can also then denormalize by combining to base tables that aren't in 5NF.)
See this answer and this one.
Then we query by describing the rows we want. We do this by connecting base table predicates with logical operators (ie AND, OR, NOT, FOR SOME, FOR ALL etc) and function calls to give the predicates for the tables we want and/or by connecting base table names by relation operators (ie JOIN, UNION, MINUS/EXCEPT, PROJECT/SELECT, RENAME/AS) to give the values of the tables we want and/or both (eg RESTRICT/WHERE).
The JOIN of two tables holds the rows that make a true statement from, ie has as predicate, the AND of their predicates; and the UNION the OR, the MINUS/EXCEPT the AND NOT; and that PROJECT/SELECT columns of a table puts FOR SOME all-other-columns before its predicate; and RESTRICT/WHERE puts AND condition after its predicate; and the RENAME/AS of column renames that parameter in its predicate. So a table expression corresponds to a predicate: A table (base table or query result) value contains the rows that make a true statement from its (base table's or query expression's) predicate.
See this answer.
The same goes for constraints, which are true statements that collectively describe the application situations and database states than can arise given the situations that can arise and the base table predicates.
See this answer.

T-SQL: query which joins with all dependent tables and produce cartesian product

I have a bunch of tables which refer to some number of other tables (zero, one, two or more).
My example tables might contain following columns:
Id | StatementTable1Id | StatementTable2Id | Value
where StatementTable1 will contain following columns:
Id | Name | Label
I wish to get all possible combinations and join all of them.
I found this link very useful (query which produce information about dependencies).
I would imagine my code as follows:
Prepare list of tables which I wish to query.
Query link for all my tables and save results into temporary table.
Check maximum number of dependent tables. Prepare query template - for example if maximum number of dependent tables is equal two:
Select
Id, '%Table1Name%' as Table1Name,
'%StatementLabelTable1%' as StatementLabelTable1,
'%Table2Name%' as Table2Name,
'%StatementLabelTable2%' as StatementLabelTable2, Value"
Use cursor - for each dependent table replace appropriate part with dependent table name and label of elements within it.
When all dependent tables have been used - replace all remaining columns with empty string.
add "UNION ALL" and proceed to next table
Run query
Could you tell me if there's any easier or better way?
What you've listed there sounds like you'll need to do if you don't know the column details ahead of time. There's likely going to be some trial-and-error to get the details correct, but it's a good plan to start.
That being said, why on earth would you want to do such a thing? It sounds like you need to narrow down your requirements on what data is actually needed. Otherwise, as you add data to your database, this query and resulting data set is going to quickly become quite unwieldy (these data sets are the kinds you hear about becoming daily "door-stop reports"; no one uses them, but they never remember why it was created, so they keep running the report, and just use it as a door-stop).

The Relational Model & Queries That Naturally Return Duplicate Rows

It's commonly understood that in the relational model:
Every relational operation should yield a relation.
Relations, being sets, cannot contain duplicate rows.
Imagine a 'USERS' relation that contains the following data.
ID FIRST_NAME LAST_NAME
1 Mark Stone
2 Jane Stone
3 Michael Stone
If someone runs a query select LAST_NAME from USERS, a typical database will return:
LAST_NAME
Stone
Stone
Stone
Since this is not a relation - because it contains duplicate rows - what should an ideal RDBMS return?
"But some information is lost - that there are 3 users with that last name."
If the count of users with that name is what you are interested in, then the query of your example is not the question you should be asking.
The query of your example will provide the answer to the question "What are all the last names such that there exists a user that has that last name?".
If the question you want to ask is "How many users are there that are named 'Stone'", then the query you should submit is Select count(...) from users where last_name = 'Stone';
Projection always "loses" information: the information that is tied to the attributes that are projected away. I don't see how a known property of a useful relational operator can be explained as an argument against that operator.
In a RDBMS a relational projection on the last name column alone would return only a set of tuples with distinct values of last name. There would be no duplicate tuples.
In SQL it is true that you would get duplicates unless you specified the DISTINCT keyword. That's because SQL is not a truly relational language - among other things because SQL tables and table expressions are not proper relations. A SQL DBMS is not a RDBMS.
"what should an ideal RDBMS return?"
As David indicated, it should return (in your example) one single row.
An SQL DBMS is only a relational one if it treats every SELECT as if SELECT DISTINCT were requested. (But there are a few tiny additional conditions to be met too.)
The reason this is so is that the "meaning" of that single row is as follows : "There exists some user such that he has a first_name, he has an ID, and his last_name is 'Stone'".
There is never any logical need to repeat that statement a second time. The authoritative reference that you asked for, is Ted Codd himself : "If something is true, then saying it twice won't make it any truer.".
I'm not sure I see a problem with the returned values. There are three records that contain "Stone" for LAST_NAME. This would have been obvious if FIRST_NAME or ID had been included in the query, but it was not. Usually, the DISTINCT keyword is used to handle this and ensure that there will be no duplicates.
In fact, if my database started applying DISTINCT automatically (which it sounds like you think maybe it should), I'd be somewhat annoyed. Seeing duplicate rows when you don't expect to is often the needed break when debugging some weird data problem in a database.
I would argue that your original query did not return duplicate rows. It returned 3 separate rows of data from the database in which you only included the last name column. I would say that your question is not phrased correctly and hence why RDBMS function in the manner they do (which I also argue is the correct manner).
To translate your query:
select LAST_NAME from USERS
into English, it would be:
"tell me the last name of all the users"
If I went into a highschool gym class and asked the teacher "using your class list sheet, tell me the last name of all the students in your class", if there were twin brothers in the class, I would think he would list their last name twice (or he'd at least ask the question to you if he should). He would just go down the list of people in the class and read off their last names.
If you were wanting to ask the question, "what are the different last names of students in the class", he would not list the names duplicated. However that's what the "DISTINCT" key word exists.
So the query would be:
select distinct LAST_NAME from USERS
And if you were actually interested in the number of unique last names in English is "How many different last names are there of the students in the class" or using your example:
select count(distinct LAST_NAME) from USERS
whereas:
select count(LAST_NAME) from USERS
would mean in English:
"How many people in the class have a last name?"

How do you manage "pick lists" in a database

I have an application with multiple "pick list" entities, such as used to populate choices of dropdown selection boxes. These entities need to be stored in the database. How do one persist these entities in the database?
Should I create a new table for each pick list? Is there a better solution?
In the past I've created a table that has the Name of the list and the acceptable values, then queried it to display the list. I also include a underlying value, so you can return a display value for the list, and a bound value that may be much uglier (a small int for normalized data, for instance)
CREATE TABLE PickList(
ListName varchar(15),
Value varchar(15),
Display varchar(15),
Primary Key (ListName, Display)
)
You could also add a sortOrder field if you want to manually define the order to display them in.
It depends on various things:
if they are immutable and non relational (think "names of US States") an argument could be made that they should not be in the database at all: after all they are simply formatting of something simpler (like the two character code assigned). This has the added advantage that you don't need a round trip to the db to fetch something that never changes in order to populate the combo box.
You can then use an Enum in code and a constraint in the DB. In case of localized display, so you need a different formatting for each culture, then you can use XML files or other resources to store the literals.
if they are relational (think "states - capitals") I am not very convinced either way... but lately I've been using XML files, database constraints and javascript to populate. It works quite well and it's easy on the DB.
if they are not read-only but rarely change (i.e. typically cannot be changed by the end user but only by some editor or daily batch), then I would still consider the opportunity of not storing them in the DB... it would depend on the particular case.
in other cases, storing in the DB is the way (think of the tags of StackOverflow... they are "lookup" but can also be changed by the end user) -- possibly with some caching if needed. It requires some careful locking, but it would work well enough.
Well, you could do something like this:
PickListContent
IdList IdPick Text
1 1 Apples
1 2 Oranges
1 3 Pears
2 1 Dogs
2 2 Cats
and optionally..
PickList
Id Description
1 Fruit
2 Pets
I've found that creating individual tables is the best idea.
I've been down the road of trying to create one master table of all pick lists and then filtering out based on type. While it works, it has invariably created headaches down the line. For example you may find that something you presumed to be a simple pick list is not so simple and requires an extra field, do you now split this data into an additional table or extend you master list?
From a database perspective, having individual tables makes it much easier to manage your relational integrity and it makes it easier to interpret the data in the database when you're not using the application
We have followed the pattern of a new table for each pick list. For example:
Table FRUIT has columns ID, NAME, and DESCRIPTION.
Values might include:
15000, Apple, Red fruit
15001, Banana, yellow and yummy
...
If you have a need to reference FRUIT in another table, you would call the column FRUIT_ID and reference the ID value of the row in the FRUIT table.
Create one table for lists and one table for list_options.
# Put in the name of the list
insert into lists (id, name) values (1, "Country in North America");
# Put in the values of the list
insert into list_options (id, list_id, value_text) values
(1, 1, "Canada"),
(2, 1, "United States of America"),
(3, 1, "Mexico");
To answer the second question first: yes, I would create a separate table for each pick list in most cases. Especially if they are for completely different types of values (e.g. states and cities). The general table format I use is as follows:
id - identity or UUID field (I actually call the field xxx_id where xxx is the name of the table).
name - display name of the item
display_order - small int of order to display. Default this value to something greater than 1
If you want you could add a separate 'value' field but I just usually use the id field as the select box value.
I generally use a select that orders first by display order, then by name, so you can order something alphabetically while still adding your own exceptions. For example, let's say you have a list of countries that you want in alpha order but have the US first and Canada second you could say "SELECT id, name FROM theTable ORDER BY display_order, name" and set the display_order value for the US as 1, Canada as 2 and all other countries as 9.
You can get fancier, such as having an 'active' flag so you can activate or deactivate options, or setting a 'x_type' field so you can group options, description column for use in tooltips, etc. But the basic table works well for most circumstances.
Two tables. If you try to cram everything into one table then you break normalization (if you care about that). Here are examples:
LIST
---------------
LIST_ID (PK)
NAME
DESCR
LIST_OPTION
----------------------------
LIST_OPTION_ID (PK)
LIST_ID (FK)
OPTION_NAME
OPTION_VALUE
MANUAL_SORT
The list table simply describes a pick list. The list_ option table describes each option in a given list. So your queries will always start with knowing which pick list you'd like to populate (either by name or ID) which you join to the list_ option table to pull all the options. The manual_sort column is there just in case you want to enforce a particular order other than by name or value. (BTW, whenever I try to post the words "list" and "option" connected with an underscore, the preview window goes a little wacky. That's why I put a space there.)
The query would look something like:
select
b.option_name,
b.option_value
from
list a,
list_option b
where
a.name="States"
and
a.list_id = b.list_id
order by
b.manual_sort asc
You'll also want to create an index on list.name if you think you'll ever use it in a where clause. The pk and fk columns will typically automatically be indexed.
And please don't create a new table for each pick list unless you're putting in "relationally relevant" data that will be used elsewhere by the app. You'd be circumventing exactly the relational functionality that a database provides. You'd be better off statically defining pick lists as constants somewhere in a base class or a properties file (your choice on how to model the name-value pair).
Depending on your needs, you can just have an options table that has a list identifier and a list value as the primary key.
select optionDesc from Options where 'MyList' = optionList
You can then extend it with an order column, etc. If you have an ID field, that is how you can reference your answers back... of if it is often changing, you can just copy the answer value to the answer table.
If you don't mind using strings for the actual values, you can simply give each list a different list_id in value and populate a single table with :
item_id: int
list_id: int
text: varchar(50)
Seems easiest unless you need multiple things per list item
We actually created entities to handle simple pick lists. We created a Lookup table, that holds all the available pick lists, and a LookupValue table that contains all the name/value records for the Lookup.
Works great for us when we need it to be simple.
I've done this in two different ways:
1) unique tables per list
2) a master table for the list, with views to give specific ones
I tend to prefer the initial option as it makes updating lists easier (at least in my opinion).
Try turning the question around. Why do you need to pull it from the database? Isn't the data part of your model but you really want to persist it in the database? You could use an OR mapper like linq2sql or nhibernate (assuming you're in the .net world) or depending on the data you could store it manually in a table each - there are situations where it would make good sense to put it all in the same table but do consider this only if you feel it makes really good sense. Normally putting different data in different tables makes it a lot easier to (later) understand what is going on.
There are several approaches here.
1) Create one table per pick list. Each of the tables would have the ID and Name columns; the value that was picked by the user would be stored based on the ID of the item that was selected.
2) Create a single table with all pick lists. Columns: ID; list ID (or list type); Name. When you need to populate a list, do a query "select all items where list ID = ...". Advantage of this approach: really easy to add pick lists; disadvantage: a little more difficult to write group-by style queries (for example, give me the number of records that picked value X".
I personally prefer option 1, it seems "cleaner" to me.
You can use either a separate table for each (my preferred), or a common picklist table that has a type column you can use to filter on from your application. I'm not sure that one has a great benefit over the other generally speaking.
If you have more than 25 or so, organizationally it might be easier to use the single table solution so you don't have several picklist tables cluttering up your database.
Performance might be a hair better using separate tables for each if your lists are very long, but this is probably negligible provided your indexes and such are set up properly.
I like using separate tables so that if something changes in a picklist - it needs and additional attribute for instance - you can change just that picklist table with little effect on the rest of your schema. In the single table solution, you will either have to denormalize your picklist data, pull that picklist out into a separate table, etc. Constraints are also easier to enforce in the separate table solution.
This has served us well:
SQL> desc aux_values;
Name Type
----------------------------------------- ------------
VARIABLE_ID VARCHAR2(20)
VALUE_SEQ NUMBER
DESCRIPTION VARCHAR2(80)
INTEGER_VALUE NUMBER
CHAR_VALUE VARCHAR2(40)
FLOAT_VALUE FLOAT(126)
ACTIVE_FLAG VARCHAR2(1)
The "Variable ID" indicates the kind of data, like "Customer Status" or "Defect Code" or whatever you need. Then you have several entries, each one with the appropriate data type column filled in. So for a status, you'd have several entries with the "CHAR_VALUE" filled in.

Resources