Elegant normalization without adding fields, extra table. Best relationship - database

I have 2 tables I am trying to normalize. The problem is I don't want to create an offhand table with new fields, though a link table perhaps works. What is the most elegant way to convey that the "Nintendo" entry is BOTH a publisher and a developer? I don't want "Nintendo" to be duplicated. I am thinking a many-to-many relationship can be key here.
I want to stress that I absolutely want the developer and a publisher tables to remain. I don't mind creating a link between the 2 with a new relationship.
Here are the 2 tables I am trying to normalize:
Below is a solution I tried (I don't like it):

There is nothing wrong with your two tables.
In fact all you need is
developer(name) -- company [name] is a developer
publisher(name) -- company [name] is a publisher
Your changes have nothing to do with normalization. Normalization never creates new column names. 'I don't want "Nintendo" to be duplicated' is misconceived. There is nothing wrong per se with values appearing in multiple places. See the answers by sqlvogel & myself here.
BUT: Depending on what it means for a row to be in one of your tables there might be a better design to reduce errors because the two tables' values could be "constrained" ie depend on each other. That has something to do with "redundancy" but it is about constraints and does not involve normalization. And for us to address it you have to tell us exactly when a row goes into each table based on the world situation.
If you don't want to repeat the strings for implementation(-dependent) reasons (space taken or speed of operations at the expense of more joins) then add a table of name ids and strings (actually company ids and names) and replace your old name columns and values by company id columns and values. But that's not normalization, that's complicating your schema for the sake of implementation-dependent data optimization tradeoffs. (And you should demonstrate this is needed and works.)
The currently accepted answer (tables Game_Company, Company_Role & Game_Company_Role) just adds a lot of redundant data. Just like your question adds three redundant tables. The original two tables already say what companies are developers and which are publishers. The other tables are just views/queries on the two!
If you want a new table for "[id] identifies a company named [name] with ..." then this is a case of developers and publisher as subtypes of supertype company. Search on database subtypes. See this answer. Then you would use company id instead of name to identify companies. You could also then further simplify (!) by using company id as the only column in tables developer and publisher and also everywhere else instead of developer_id and publisher_id.
"Redundancy" is not about values appearing in multiple places. It is about multiple rows stating the same thing about the application. When using a design like that there are two basic problems: to say certain things multiple rows are involved (while the normalized version involves just one row); and there is no way to say just one of the things at a time (which normalization can help with). If you make two different independent statements about Nintendo then you need two tables and Nintendo mentioned in each one. Re rows making statements about the application see this. (And search my other answers re a table's "statement" or criterion".) Normalization helps because it replaces tables whose rows state things of the form "... AND ..." by other tables that state the "..." separately. See this and this. (Normalization is commonly erroneously thought to involve or include avoiding multiple similar columns, avoiding columns whose values have repetitive structure and/or replacing strings by ids, but although these can be good design ideas they're not normalization.)
In comments, chat and another answer you gave this starting point:
Here's the simplest design. (I'll assume game titles are not unique so you need game_ids.)
-- game [game_id] with title [title] released on [release_date] is rated [rating]
game(game_id,title,release_date,rating)
game_developer(game_id,name) -- game [game_id] is developed by company [name]
game_publisher(game_id,name) -- game [game_id] is published by company [name]
game_platform(game_id,name) -- game [game_id] is on platform [name]
Only if you want a separate list of companies so that a company can exist without developing or publishing and/or can have its own data do you need to add:
company(name,...) -- [name] identifies a company
Only if you want role-specific data for developers and publishers do you need to add:
developer(name,...) -- developer [name] has ...
publisher(name,...) -- publisher [name] has ...
The relevant foreign keys of the various options are straightward.
None of your versions need _ids. Your versions 2 & 3 won't work because they don't say what companies develop a game or what companies publish a game. You don't need roles but if you have them (Verison 2) then you need a table "game [game_id] has company [name] as [role]". Otherwise (Verision 3) you need tables for "[game_id] is developed by company [name]" and "game [game_id] is published by company [name]". Wherever you differ from my designs ask yourself why you have additional structure and why you can do without it and (possibly) why you would explicitly want it anyway.

I think you want something like this:
Game_Company
ID Name
1 Retro Studios
2 HAL Laboratories
3 Nintendo
...
Company_Role
ID Name
1 Developer
2 Publisher
...
Game_Company_Role
CompanyID RoleID
1 1
2 1
3 1
3 2
...
To get a list of all companies that have role 'Developer':
SELECT gc.name
FROM Game_Company gc JOIN Game_Company_Role gcr ON gcr.CompanyID=gc.ID
WHERE gcr.RoleID = 1

This is a bit generic approach to the problem, it may be of interest. As #Dour High Arch has pointed out in his solution, the Developer and Publisher are just roles for a 'party'. Each part has 0,1 or more roles with a given product and roles may overlap.This is good and bad. For example, a product may be developed by 5 developers but published by at most 1 publisher.
I have chosen to introduce a serial_id as system generated PK, but this is not mandatory. You could use the 3FKs as a PK and not user the serial_id.
Notice that having a party as a generalization of different entity types is not always good since 1 or more columns will have to be set to not mandatory if it is not common to all parties, however, this is very common in real applications.
Convention:
name_PK = Primary Key,
name_FK = Foreign Key

Here are three final solutions as proposed by the comments. You can see the table being broken down from the top "un-normalized" table.
The rules are as follows:
1 game can have 1 or many developers and 1 developer can have 1 or many games.
1 game can have 1 or many publishers and 1 publisher can have 1 or many games.
1 game can have 1 or many platforms and 1 platform can have 1 or many games.
Version 1
I left the 2 "Nintendo" entries in red. According to research and implementation, this is not technically redundant data. See my comments under philipxy's answer. This looks simple and elegant. 4 tables with a many-to-many relationship.
Here is the relationship diagram (4 tables and 3 link tables):
Verison 2
Version 1 "repeats" "Nintendo" but Version 2 has a "Company" table instead. Compare the 2 different versions. What is the right way?
Version 3
Here is the subtyping philipxy was talking about. How is this version?

Related

One-To-Many relathionship task

I have a table Subject
It has many fields, two of them are code and flag.
Earlier those two fields was an idempotention key for rows in this table.
But, now I need one more option system.
There are tens of rows in Subject
And 4-7 systems.
What is a better way?
Create table System for systems and create cross-table of mapping sysytems on subjects (code and flag are still in table Subject)
Create one table of mapping without creating table System
Just add another column in table Subject
Create table System and add to the table Subject foreign key for table System?
So, It's all about database normalization.
And the third option is pretty bad.
As for me the better way is fourth option.
But, I can`t explain to yourself why this option is better than 1 and 2.
So, I read rules of database normalization. And as for me, the first option satisfies all rules too.
This is the reason why I am asking this question.
It is not typically a great idea to design a SQL schema to 1st or 2nd normal form; many databases use something at or near 3rd normal form however there many still be some relationships in 3rd normal form where dependencies exist where redundancy still exists. This can be addressed by Boyce–Codd Normal Form (Codd, 1974).
It is also not typical to see 4th normal form and less so 5th normal form and beyond due to challenges with data maintenance in a "living" database.
Let's put this another way.
If you find yourself creating NULLable values constraints on many columns consider a table to contain those in an organized fashion - for example an Address table for addresses with a linking table from say for example a Person to a PersonAddress linking table to that Address table where the PersonAddress linking table might even have an AdddressTypeId column which links to an AddressType table with rows for Address Type Postal and Address Type Street or Address Type Business. For another example consider email addresses where people have personal, family, business and other email address types; even multiple of the same type for different uses; a doctor with a business practice email and 2-3 hospital email addresses where the doctor practices.
Linking tables for those type scenarios are likely better than 3-4 or more email or postal address columns in one table where many after the first are nullable or perhaps redundant.
Review your data; consider if your Subject for example may link to multiple System or placing a SubjectId column may lead to duplicates of that ID for differing system rows. If it is always and forever a 1-1 relationship it may be OK but for a 1-n or n-n it may not be ok to have the id in the other table and a linking table may provide a good mechanism to link them.

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

Linking an address table to multiple other tables

I have been asked to add a new address book table to our database (SQL Server 2012).
To simplify the related part of the database, there are three tables each linked to each other in a one to many fashion: Company (has many) Products (has many) Projects and the idea is that one or many addresses will be able to exist at any one of these levels. The thinking is that in the front-end system, a user will be able to view and select specific addresses for the project they specify and more generic addresses relating to its parent product and company.
The issue now if how best to model this in the database.
I have thought of two possible ideas so far so wonder if anyone has had a similar type of relationship to model themselves and how they implemented it?
Idea one:
The new address table will additionally contain three fields: companyID, productID and projectID. These fields will be related to the relevant tables and be nullable to represent company and product level addresses. e.g. companyID 2, productID 1, projectID NULL is a product level address.
My issue with this is that I am storing the relationship information in the table so if a project is ever changed to be related to a different product, the data in this table will be incorrect. I could potentially NULL all but the level I am interested in but this will make getting parent addresses a little harder to get
Idea two:
On the address table have a typeID and a genericID. genericID could contain the IDs from the Company, Product and Project tables with the typeID determining which table it came from. I am a little stuck how to set up the necessary constraints to do this though and wonder if this is going to get tricky to deal with in the future
Many thanks,
I will suggest using Idea one and preventing Idea two.
Second Idea is called Polymorphic Association anti pattern
Objective: Reference Multiple Parents
Resulting side effect: Using dual-purpose foreign key will violating first normal form (atomic issue), loosing referential integrity
Solution: Simplify the Relationship
The simplification of the relationship could be obtained in two ways:
Having multiple null-able forging keys (idea number 1): That will be
simple and applicable if the tables(product, project,...) that using
the relation are limited. (think about when they grow up to more)
Another more generic solution will be using inheritance. Defining a
new entity as the base table for (product, project,...) to satisfy
Addressable. May naming it organization-unit be more rational. Primary key of this organization_unit table will be the primary key of (product, project,...). Other collections like Address, Image, Contract ... tables will have a relation to this base table.
It sounds like you could use Junction tables http://en.wikipedia.org/wiki/Junction_table.
They will give you the flexibility you need to maintain your foreign key restraints, as well as share addresses between levels or entities if that is desired.
One for Company_Address, Product_Address, and Project_Address

Database Design: Explain this schema

Full disclosure...Trying feverishly here to learn more about databases so I am putting in the time and also tried to get this answer from the source to no avail.
Barry Williams from databaseanswers has this schema posted.
Clients and Fees Schema
I am trying to understand the split of address tables in this schema. Its clear to me that the Addresses table contains the details of a given address. The Client_Addresses and Staff_Addresses tables are what gets me.
1) I understand the use of Primary Foreign Keys as shown but I was under the assumption that when these are used you don't have a resident Primary Key in that same table (date_address_from in this case). Can someone explain the reasoning for both and put it into words how this actually works out?
2) Why would you use date_address_from as the primary key instead of something like client_address_id as the PK? What if someone enters two addresses in one day would there be conflicts in his design? If so or if not, what?
3) Along the lines of normalization...Since both date_address_from and date_address_to are the same in the Client_Addresses and Staff_Addresses table should those fields just not be included in the main Address table?
Evaluation
First an Audit, then the specific answers.
This is not a Data Model. This is not a Database. It is a bucket of fish, with each fish drawn as a rectangle, and where the fins of one fish are caught in the the gills of another, there is a line. There are masses of duplication, as well as masses of missing elements. It is completely unworthy of using as an example to learn anything about database design from.
There is no Normalisation at all; the files are very incomplete (see Mike's answer, there are a hundred more problem like that). The other_details and eg.s crack me up. Each element needs to be identified and stored: StreetNo, ApartmentNo, StreetName, StreetType, etc. not line_1_number_street, which is a group.
Customer and Staff should be normalised into a Person table, with all the elements identified.
And yes, if Customer can be either a Person or an Organisation, then a supertype-subtype structure is required to support that correctly.
So what this really is, the technically accurate terms, is a bunch of flat files, with descriptions for groups of fields. Light years distant from a database or a relational one. Not ready for evaluation or inspection, let alone building something with. In a Relational Data Model, that would be approximately 35 normalised tables, with no duplicated columns.
Barry has (wait for it) over 500 "schemas" on the web. The moment you try to use a second "schema", you will find that (a) they are completely different in terms of use and purpose (b) there is no commonality between them (c) let's say there was a customer file in both; they would be different forms of customer files.
He needs to Normalise the entire single "schema" first,
then present the single normlaised data model in 500 sections or subject areas.
I have written to him about it. No response.
It is important to note also, that he has used some unrecognisable diagramming convention. The problem with these nice interesting pictures is that they convey some things but they do not convey the important things about a database or a design. It is no surprise that a learner is confused; it is not clear to experienced database professionals. There is a reason why there is a standard for modelling Relational databases, and for the notation in Data Models: they convey all the details and subtleties of the design.
There is a lot that Barry has not read about yet: naming conventions; relations; cardinality; etc, too many to list.
The web is full of rubbish, anyone can "publish". There are millions of good- and bad-looking "designs" out there, that are not worth looking at. Or worse, if you look, you will learn completely incorrect methods of "design". In terms of learning about databases and database design, you are best advised to find someone qualified, with demonstrated capability, and learn from them.
Answer
He is using composite keys without spelling it out. The PK for client_addresses is client_id, address_id, date_address_from). That is not a bad key, evidently he expects to record addresses forever.
The notion of keeping addresses in a separate file is a good one, but he has not provided any of the fields required to store normalised addresses, so the "schema" will end up with complete duplication of addresses; in which case, he could remove addresses, and put the lines back in the client and staff files, along with their other_details, and remove three files that serve absolutely no purpose other than occupying disk space.
You are thinking about Associative Tables, which resolve the many-to-many relations in Databases. Yes, there, the columns are only the PKs of the two parent tables. These are not Associative Tables or files; they contain data fields.
It is not the PK, it is the third element of the PK.
The notion of a person being registered at more than one address in a single day is not reasonable; just count the one address they slept the most at.
Others have answered that.
Do not expect to identify any evidence of databases or design or Normalisation in this diagram.
1) In each of those tables the primary key is a compound key consisting of three attributes: (staff_id, address_id, date_address_from) and (client_id, address_id, date_address_from). This presumably means that the mapping of clients/staff to addresses is expected to change over time and that the history of those changes is preserved.
2) There's no obvious reason to create a new "id" attribute in those tables. The compound key does the job adequately. Why would you want to create the same address twice for the same client on the same date? If you did then that might be a reason to modify the design but that seems like an unlikely requirement.
3) No. The apparent purpose is that they are the applicable dates for the mapping of address to client/staff - not dates applicable to the address alone.
3) Along the lines of
normalization...Since both
date_address_from and date_address_to
are the same in the Client_Addresses
and Staff_Addresses table should those
fields just not be included in the
main Address table?
No. But you did find a problem.
The designer has decided that clients and staff are two utterly different things. By "utterly different", I mean they have no attributes in common.
That's not true, is it? Both clients and staff have addresses. I'm sure most of them have telephones, too.
Imagine that someone on staff is also a client. How many places is that person's name stored? That person's address? Can you hear Mr. Rogers in the background saying, "Can you spell 'update anomaly'? . . . I knew you could."
The problem is that the designer was thinking of clients and staff as different kinds of people. They're not. "Client" describes a business relationship between a service provider (usually, that is, not a retailer) and a customer, which might be either a person or a company. "Staff" describes a employment relationship between a company and a person. Not different kinds of people--different kinds of relationships.
Can you see how to fix that?
This 2 extra tables enables you to have address history per one person.
You can have them both in one table, but since staff and client are separated, it is better to separate them as well (b/c client id =1 and staff id =1 can't be used on the same table of address).
there is no "single" solution to a design problem, you can use 1 person table and then add a column to different between staff and client. BUT The major Idea is that the DB should be clear, readable and efficient, and not to save tables.
about 2 - the pk is combined, both clientID, AddressID and from.
so if someone lives 6 month in the states, then 6 month in Israel, and then back to the states, to the same address - you need only 2 address in address table, and 3 in the client_address.
The idea of heaving the from_Date as part of the key is right, although it doesn't guaranty data integrity - as you also need manually to check that there isn't overlapping dates between records of the same person.
about 3 - no (look at 2).
Viewing the data model, i think:
1) PF means that the field is both part of the primary key of the table and foreign key with other table.
2) In the same way, the primary key of Staff_Addresses is {staff_id,address_id,date_adderess_from} not just date_adderess_from
3) The same that 2)
In reference to Staff_Addresses table, the Primary Key on date_address_from basically prevents a record with the same staff_id/address_id entered more than once. Now, i'm no DBA, but i like my PKs to be integers or guids for performance reasons/faster indexing. If i were to do this i would make a new column, say, Staff_Address_Id and make it the PK column and put a unique constraint on staff_id/address_id/date_address_from.
As for your last concern, Addresses table is really a generic address storage structure. It shouldn't care about date ranges during which someone resided there. It's better to be left to specific implementations of an address such as Client/Staff addresses.
Hope this helps a little.

merging two duplicate contacts/ColdFusion

Having to do with data integrity - I maintain a coldfusion database at a small shop that keeps addresses of different contacts. These contacts sometimes contain notes in them.
When you are merging two duplicate contacts, one may be created in 2002 and one in 2008. If the contact in 2002 has notes prior to 2008, my question would be does it matter if you merge these contacts and keep the 2008 contact's ID number? Would that affect the data integrity or create any sort of issues with the notes earlier than 2008?
I hope I've accurately described my scenario, as I am not familiar with the proper technical terms.
I really appreciate the help sir!
I will say that the fact that the app is ColdFusion is pretty well irrelevant to your problem.
It seems like some of what you're asking depends on your business requirements. Do you need to retain older notes?
As other folks are saying, it depends in large part on your table structure. If, as I suspect, you've got just one table that has a NOTES column in it, you'll need to figure out how to concatenate the values in multiple rows that all refer to the same person.
It sounds like you have two tables - contacts and notes. The notes table has a foreign key back to the contacts table to record which contact a note belongs to.
So, imagine two Contacts - Bill (primary key 1, created in 2002) and William (primary key 2, created in 2008).
Imagine one Note with a foreign key 1 (ie this note belongs to Bill).
If you merge Bill and William, and only keep the William record, then you would need to update the foreign key from 1 (Bill - deleted) to 2 (William) on the note or it will not display on William's record.
(If you add further details about your table structure we can probably help more.)

Resources