Database Tables via Normalization vs Experience/Opinion - database

I am attempting to explain how the mock artist/soundtrack data (first image below) can be normalized from 1NF to 2NF to 3NF, step-by-step to get the result that I think is best for the database. It's almost as if normalization is getting in the way of what I want to do, but am I just not understanding something in the normalization process? I can definitely see how this mock data can be normalized to 1NF by making each row unique and removing duplicates, but at which stage, for example are we told to assign Composer ID as a foreign key for the tracks table or the movie table? Is that just something we do from experience? Is there no right or wrong?
In short, my question is, can anyone show or explain how the mock data here ...
Was turned into this using all the first 3 stages of normalization?

Well your 1NF would be to have a distinct record for each track name so essentially the mock data with the first record split into 2...
2NF is to take out the repeated keys which to my mind is what you've displayed as the 3 separate tables, and potentially that might be as far as you need to go.
You could add a further table to allow for a track to feature in more than one movie ie create a movie tracks table referencing the track id and movie id respectively (and removing movie id from the tracks table).
Similarly, you could go to the extreme of allowing for collaborative compositions by also having a track composers table but that is probably not sufficiently common to make the effort worthwhile
Normalization is something that definitely becomes easier with experience and can be taken as far as suits the purpose of the data as can be seen from the example.

Related

Designing two DB tables with same id? Good practices?

I have to take an online course on DB design once again since I got a really lazy teacher that I thought had taught us everything and I continue to find out he didn't.
I'm designing a small DB in which two particular tables brought up this question.
I have a table called "Athlete" which stores Athlete info and a second table called "EntryInfo" which stores a guy's objectives, if he was a referral by another athlete.
There is no way an athlete could have more than one of this entry infos, so I thought idAthlete would apply to both "Athlete" and "EntryInfo" but I don´t know if this is correct or not. Now I have these questions:
1) In trying to keep "Athlete" table as clean as possible I didn't include this "EntryInfo" in the "Athlete" table from the beginning but it COULD be in the same table. Is this the best way to handle it? Regarding good practices in DB design should they be in 1 or 2 tables?
2) If it´s better to keep it in two separate tables, can I have idAthlete as PK in Athlete table (identity, incremental) and have it also as a PK in Entry Info only as a FK? or would it be a better practice to have a PK identity incremental idEntryInfo on EntryInfo table with a FK idAthlete?
I know this is such a basic question and I know I should take a course on DB design and normalisation (and I will do).
When you have two tables with the same key it's called vertical partitioning and it's a valid design for various reasons.
However I don't see any reasons in your explanation. I only see your statement keep "Athlete" table as clean as possible, which has a pretty general meaning. If you're going to put different groups of fields into different tables you can categorise that any number of ways
If you had a zillion records and you had performance issues it might be worth considering.
It will be simpler for you if you keep it in one table, then you don't have to fiddle about synchronising keys between the tables

Is it possible to normalize a one-to-many table schema further than entity level?

Note: this section is a little long-winded, so feel free to just read the bolded summary below.
For example, take the schema below. Could the 2 "Statistic" tables at the bottom be normalized further?
My initial reaction is to feel that it could be normalized further (having a Statistics table for each and every sport could result in a lot of tables), but if I were to do that, then what mechanism would enforce the requirements for each respective sport's statistics?
For example, a Baseball statistic MUST include values for TotalHits and TotalHomeRuns. If I keep the BaseballStatistic table the way it is, then I can set these columns to Not Null.
Thoughts? Maybe there is even a better way than this current schema.
My basic problem is that I need to associate different types (BaseballStatistic and FootballStatistic) to the same parent type (Game). Is there any way to do this other than what I have done here?
If it matters, I am using SQL Server.

Parent child design to easily identify child type

In our database design we have a couple of tables that describe different objects but which are of the same basic type. As describing the actual tables and what each column is doing would take a long time I'm going to try to simplify it by using a similar structured example based on a job database.
So say we have following tables:
These tables have no connections between each other but share identical columns. So the first step was to unify the identical columns and introduce a unique personId:
Now we have the "header" columns in person that are then linked to the more specific job tables using a 1 to 1 relation using the personId PK as the FK. In our use case a person can only ever have one job so the personId is also unique across the Taxi driver, Programmer and Construction worker tables.
While this structure works we now have the use case where in our application we get the personId and want to get the data of the respective job table. This gets us to the problem that we can't immediately know what kind of job the person with this personId is doing.
A few options we came up with to solve this issue:
Deal with it in the backend
This means just leaving the architecture as it is and look for the right table in the backend code. This could mean looking through every table present and/or construct a semi-complicated join select in which we have to sift through all columns to find the ones which are filled.
All in all: Possible but means a lot of unecessary selects. We also would like to keep such database oriented logic in the actual database.
Using a Type Field
This means adding a field column in the Person table filled for example with numbers to determine the correct child table like:
So you could add a 0 in Type if it's a taxi driver, a 1 if it's a programmer and so on...
While this greatly reduced the amount of backend logic we then have to make sure that the numbers we use in the Type field are known in the backend and don't ever change.
Use separate IDs for each table
That means every job gets its own ID (has to be nullable) in Person like:
Now it's easy to find out which job each person has due to the others having an empty ID.
So my question is: Which one of these designs is the best practice? Am i missing an obvious solution here?
Bill Karwin made a good explanation on a problem similar to this one. https://stackoverflow.com/a/695860/7451039
We've now decided to go with the second option because it seem to come with the least drawbacks as described by the other commenters and posters. As there was no actual answer portraying the second option as a solution i will try to summarize our reasoning:
Against Option 1:
There is no way to distinguish the type from looking at the parent table. As a result the backend would have to include all logic which includes scanning all tables for the that contains the id. While you can compress most of the logic into a single big Join select it would still be a lot more logic as opposed to the other options.
Against Option 3:
As #yuri-g said this one is technically not possible as the separate IDs could not setup as primary keys. They would have to be nullable and as a result can't be indexed, essentially rendering the parent table useless as one of the reasons for it was to have a unique personID across the tables.
Against a single table containing all columns:
For smaller use cases as the one i described in the question this might me viable but we are talking about a bunch of tables with each having roughly 2-6 columns. This would make this option turn into a column-mess really quickly.
Against a flat design with a key-value table:
Our properties have completly different data types, different constraints and foreign key relations. All of this would not be possible/difficult in this design.
Against custom database objects containt the child specific properties:
While this option that #Matthew McPeak suggested might be a viable option for a lot of people our database design never really used objects so introducing them to the mix would likely cause confusion more than it would help us.
In favor of the second option:
This option is easy to use in our table oriented database structure, makes it easy to distinguish the proper child table and does not need a lot of reworking to introduce. Especially since we already have something similar to a Type table that we can easily use for this purpose.
Third option, as you describe it, is impossible: no RDBMS (at least, of I personally know about) would allow you to use NULLs in PK (even composite).
Second is realistic.
And yes, first would take up to N queries to poll relatives in order to determine the actual type (where N is the number of types).
Although you won't escape with one query in second case either: there would always be two of them, because you cant JOIN unless you know what exactly you should be joining.
So basically there are flaws in your design, and you should consider other options there.
Like, denormalization: line non-shared attributes into the parent table anyway, then fields become nulls for non-correpondent types.
Or flexible, flat list of attribute-value pairs related through primary key (yes, schema enforcement is a trade-off).
Or switch to column-oriented DB: that's a case for it.

Reverse index implementation with spring-data and postgres

I have a table, say orders, which has a column, say a alphanumeric 15 character long itemId and a bunch of other columns. The same itemId can be repeated say up to 900 times for very popular items, which means the data will be repeated about 900 times. Obviously, we need to separate it out. However, we need the lookup for the list of items to be very quick and efficient. I read up a bit and thought reverse indexing would be a good way to achieve this. However, I am a bit confused on the actual implementation. I couldn't find any examples online as well other than http://blog.frankel.ch/tag/spring-data , but it uses solr. I was thinking of creating a items-orders table, adding a repository class which will have a method to . However, since there is many-many relation between items and orders, it will require a join table. This makes me think that i am on the wrong path as i intended items-orders table itself as a kind of join table as it only as itemId and orderId in it.
I am pretty sure I am doing something wrong. Any pointers are greatly appreciated. Sorry for a basic question, but I could not find much information with samples online.
thanks,
Alice
You're on the right track with an item orders link table. You will probably find that you end up using the table for additional columns you haven't considered yet (quantity, price, etc.)
The main thing to do starting with is make sure your database design is right, look up the basic normalization rules about making sure you don't duplicate information. Also when you create your tables make sure you're explicitly telling the database of the the relationships between the tables using FOREIGN KEY and PRIMARY KEY constraints.
Once you have the correct logical structure in place you can see if you have any performance issues that require you to do anything clever.
Relational databases were designed to exactly what you're contemplating though so the performance will probably be much better than you fear. Premature optimization would be a huge mistake.
You mentioned solr, this is a generic text search engine (sort of like google). For your requirements you want to stick to a pure relational database. You want a base that delivers exact results based on exact criteria exactly what products have been included in an order etc. you don't want any fuzzy matching or artificial intelligence guessing about what has been ordered.
You might also store the product catalogue in solr so the user could pick look for products that mention pink,blue or purple in the description and comes in a size 4 etc, then once the product has been chosen use the itemId in the relational database.

What exactly does database normalization do?

Supposedly normalization reduces redundancy of data and increases performance. What is the reason for dividing the master table into other small tables, applying relationships between them, retrieving the data using all possible unions, subqueries, joins etc.? Why can't we have all the data in a single table and retrieve it as required?
The main reason is to eliminate repetition of data, so for example if you had a user with multiple addresses and you stored this information in a single table the user information would be duplicated along with each address entry. Normalisation would seperate the addresses into their own table and then link the two using keys. This way you wouldn't need to duplicate the user data, and your db structure becomes a little cleaner.
Full normalisation will generally not improve performance, in fact it can often make it worse but it will keep your data duplicate free. In fact in some special cases I've denormalised some specific data in order to get a performance increase.
Normalization comes from the mathematical concept of being "normal." Another word would be "perpendicular." Imagine a regular two-axis coordinate system. Moving up just changes the y coordinate, moving to the side just changes the x coordinate. So every movement can be broken down into a sideways and an up-down movement. These two are independent of each other.
Normalization in database essentially means the same thing: If you change a piece of data, this is supposed to change just one single piece of information in a database. Imagine a database of E-Mails: If you store the ID and the name of the recipient in the Mails table, but the Users table also associates the name to the ID, that means if you change a user name, you don't only have to change it in the users table, but also in every single message that this user is involved with. So, the axis "message" and the axis "user" are not "perpendicular" or "normal."
If on the other hand, the Mails table only has the user ID, any change to the user name will automatically apply to all the messages, because on retrieval of a message, all user information is gathered from the Users table (by means of a join).
Database normalisation is, at its simplest, a way to minimise data redundancy. To achieve that, certain forms of normalisation exist.
First normal form can be summarised as:
no repeating groups in single tables.
separate tables for related information.
all items in a table related to the primary key.
Second normal form adds another restriction, basically that every column not part of a candidate key must be dependent on every candidate key (a candidate key being defined as a minimal set of columns which cannot be duplicated in the table).
And third normal form goes a little further, in that every column not part of a candidate key must not be dependent on any other non-candidate-key column. In other words, it can depend only on the candidate keys. This leads to the saying that 3NF depends on the key, the whole key and nothing but the key, so help me Codd1.
Note that the above explanations are tailored toward your question rather than database theorists, so the descriptions are necessarily simplified (and I've used phrases like "summarised as" and "basically").
The field of database theory is a complex one and, if you truly wish to understand it, you'll eventually have to get to the science behind it. But, in terms of your question, hopefully this will be adequate.
Normalization is a valuable tool in ensuring we don't have redundant data (which becomes a real problem if the two redundant areas get out of sync). It doesn't generally increase performance.
In fact, although all database should start in 3NF, it's sometimes acceptable to drop to 2NF for performance gains, provided you're aware of, and mitigate, the potential problems.
And be aware that there are also "higher" levels of normalisation such as (obviously) fourth, fifth and sixth, but also Boyce-Codd and some others I can't remember off the top of my head. In the vast majority of cases, 3NF should be more than enough.
1 If you don't know who Edgar Codd (or Christopher Date, for that matter) is, you should probably research them, they're the fathers of relational database theory.
We use normalization to reduce the chances of anomalies that may arise as a result of data insertion, deletion, updation. Normalization doesnt necessarily increase performance.
There is much material on internet so i wont repeat the stuff here again. But you can have a look at
Normalization rules
Anomalies
(others aswell)
As well as all the above, it just makes a certain sense. Say you have a user and you want to record what kind of car they have.
Put that all in one table and then you're fine, until someone owns two cars... You're then going to need two rows for that person, and a way of making sure that you can link those two rows together...
And then what if you also want to record how many dogs they have? Same table with lots of confusing dups? Another table with your own custom logic to manage unique users?
Normalization keeps you away from a lot of these problems...

Resources