Type tables and software hardcoded values - database

I have this DB model:
(TEXT is actually VARCHAR)
entity_group_type is not modifiable at runtime, but it will be modified in a near future to add more entries, several times, by the development team.
Now I need to retrieve all entries from entity that are of a given entity_group_type. How should the software handle this kind of queries? Should I hardcode entity_group_type _id/name in the software? If so, why do I even need this table then? And what's better, hardcode _id or name?
Or is this the wrong way to structure my data?
Thanks in advance!

Taking your questions one at a time:
How should you refer to the entity group in the software? Hard-code the id, or the name?
Refer to the entity groups in a way that makes your code the most readable. So, perhaps you use the name, or perhaps a constant that looks like the name but whose value is the id. Using a constant can avoid one join when you are looking up entities by group type, but usually this is not much of a concern.
Why do I even need that table in the DB then? Is this the wrong way to structure my data?
This is a perfectly acceptable way to structure your data. The most correct way depends on what you are doing with the data, but for most applications your structure would be correct. However, you certainly don't need that table in the database--you could instead have a "group_type" field on the entity_group table. Here are the pros and cons:
Advantages of the current structure:
Easy to add fields that describe the entity_group_type. For example, you might want to make some group types only viewable by admin users, or disabled, or whatever. If this is a possiblity for the future, it pretty much requires this database structure.
Ability to have your database software enforce referential integrity, meaning the data is kept consistent between the entity_group and entity_group_type tables.
Advantages of adding a group_type field on the entity_group table:
Possibly a simpler representation in your code. For exameple, if you use a MVC architecture, having the extra table might require another model object in your code. That's usually not a problem, and may have advantages, but sometimes simpler is better.
When you are looking up entities by entity group type, your SQL statements will be slightly simpler, since there will be one less table/join involved.
I think in most cases your current structure comes out ahead, although it does depend of how you use the data. Unless you had a good reason to structure the data differently, I would stick with your current structure.

I think you should definitely store entity_group_type in the database, in its own table, as per your design diagram above. Storing that information in the DB makes it possible to query against that information, which adds flexibility to your design.
Once you have this information in the DB, your question seems to be whether it should be broken out into its own table, or just be stored as column in the entity_type table. I think you should break it out into its own table, with a foreign key constraint from entity_type to entity_group_type.
Having entity_group_type in its own table allows for group types to be preserved even if all of the entity groups of that type are removed from the DB.
You can leverage the foreign key to ensure that the entity_group_type name is spelled consistently. Inserted entity_groups must satisfy the foreign key constraint, and so would have to either reference an existing entity_group_type having a properly spelled name, or insert an entity_group_type which will set the proper spelling for that new entity_group_type.
Use a synthetic key for your entity_group_type table, to reduce painful redundancy in the appearance of entity_group_type name appearance. This makes the data model more DRY, and allows updating the entity_group_type's name to be a simple update to one table.
As for storing the entity_group_type in your application logic, I would suggest storing the name, and looking up the id for the entity_group_type when it is needed. I think that would make the application codebase sowewhat more readable, and I think I've made a compelling argument for why I think this relevant information should have a representation in the database.

Given, that you know the name of your entity_group_type at runtime, then the correct way to find all entity_groups with that entity_group_type.name, you would use a join:
SELECT entity_group._id, entity_group.name, entity_group_type.name AS groupTypeName
FROM entity_group
JOIN entity_group_type ON entity_group.id_type = entity_group_type._id
WHERE entity_group_type.name = 'someGroupName';
This would result in all entity_group information for the given entity_group_type name.
And yes, this is a good or at least possible way to store that kind of information. Just imagine that eventually the entity_group_type gets a new attribute, e.g. disabled. Then it is still easily possible to find all entity_groups which are (not) in disabled types.

Related

Parent child design to easily identify child type

In our database design we have a couple of tables that describe different objects but which are of the same basic type. As describing the actual tables and what each column is doing would take a long time I'm going to try to simplify it by using a similar structured example based on a job database.
So say we have following tables:
These tables have no connections between each other but share identical columns. So the first step was to unify the identical columns and introduce a unique personId:
Now we have the "header" columns in person that are then linked to the more specific job tables using a 1 to 1 relation using the personId PK as the FK. In our use case a person can only ever have one job so the personId is also unique across the Taxi driver, Programmer and Construction worker tables.
While this structure works we now have the use case where in our application we get the personId and want to get the data of the respective job table. This gets us to the problem that we can't immediately know what kind of job the person with this personId is doing.
A few options we came up with to solve this issue:
Deal with it in the backend
This means just leaving the architecture as it is and look for the right table in the backend code. This could mean looking through every table present and/or construct a semi-complicated join select in which we have to sift through all columns to find the ones which are filled.
All in all: Possible but means a lot of unecessary selects. We also would like to keep such database oriented logic in the actual database.
Using a Type Field
This means adding a field column in the Person table filled for example with numbers to determine the correct child table like:
So you could add a 0 in Type if it's a taxi driver, a 1 if it's a programmer and so on...
While this greatly reduced the amount of backend logic we then have to make sure that the numbers we use in the Type field are known in the backend and don't ever change.
Use separate IDs for each table
That means every job gets its own ID (has to be nullable) in Person like:
Now it's easy to find out which job each person has due to the others having an empty ID.
So my question is: Which one of these designs is the best practice? Am i missing an obvious solution here?
Bill Karwin made a good explanation on a problem similar to this one. https://stackoverflow.com/a/695860/7451039
We've now decided to go with the second option because it seem to come with the least drawbacks as described by the other commenters and posters. As there was no actual answer portraying the second option as a solution i will try to summarize our reasoning:
Against Option 1:
There is no way to distinguish the type from looking at the parent table. As a result the backend would have to include all logic which includes scanning all tables for the that contains the id. While you can compress most of the logic into a single big Join select it would still be a lot more logic as opposed to the other options.
Against Option 3:
As #yuri-g said this one is technically not possible as the separate IDs could not setup as primary keys. They would have to be nullable and as a result can't be indexed, essentially rendering the parent table useless as one of the reasons for it was to have a unique personID across the tables.
Against a single table containing all columns:
For smaller use cases as the one i described in the question this might me viable but we are talking about a bunch of tables with each having roughly 2-6 columns. This would make this option turn into a column-mess really quickly.
Against a flat design with a key-value table:
Our properties have completly different data types, different constraints and foreign key relations. All of this would not be possible/difficult in this design.
Against custom database objects containt the child specific properties:
While this option that #Matthew McPeak suggested might be a viable option for a lot of people our database design never really used objects so introducing them to the mix would likely cause confusion more than it would help us.
In favor of the second option:
This option is easy to use in our table oriented database structure, makes it easy to distinguish the proper child table and does not need a lot of reworking to introduce. Especially since we already have something similar to a Type table that we can easily use for this purpose.
Third option, as you describe it, is impossible: no RDBMS (at least, of I personally know about) would allow you to use NULLs in PK (even composite).
Second is realistic.
And yes, first would take up to N queries to poll relatives in order to determine the actual type (where N is the number of types).
Although you won't escape with one query in second case either: there would always be two of them, because you cant JOIN unless you know what exactly you should be joining.
So basically there are flaws in your design, and you should consider other options there.
Like, denormalization: line non-shared attributes into the parent table anyway, then fields become nulls for non-correpondent types.
Or flexible, flat list of attribute-value pairs related through primary key (yes, schema enforcement is a trade-off).
Or switch to column-oriented DB: that's a case for it.

What is a good approach in creating a non-NoSQL, relational multi-schema database?

Consider a situation where the schema of a database table may change, that is, the fields, number of fields, and types of those fields may vary based on, say a client ID.
Take, for example a Users table. Typically we would represent this in a horizontal table with the following fields:
FirstName
LastName
Age
However, as I mentioned, each client may have different requirements.
I was thinking that to represent a multi-schema approach to Users in a relational database like SQL Server, this would be done with two tables:
UsersFieldNames - {FieldNameId, ClientId, FieldName, FieldType}
UsersValues - {UserValueId, FieldNameId, FieldValue}
To retrieve the data (using EntityFramework DB First), I'm thinking pivot table, and the use of something like LINQ Extentions - Pivot Extensions may be useful.
I would like to know of any other approaches that would satisfy this requirement.
I'm asking this question for my own curiosity, as I recall similar conversations coming up in the past, and in relation to this question posed.
Thanks.
While I think a NoSQL data base would work best for this, I once tried something like this.
Have a table named something like METATABLES, like this
METATABLE = {table_name, field name}
and another,
ACTUAL_DATA ={table_name, field_name, actual_data_id, float_value, string_value, double_value, varchar_value}
in actual_data, the fields table_name and field_name would be foreign keys, pointing to METATABLES. In METATABLES you define the specific fields each client requires. the ACTUAL_DATA table holds the actual values of those fields, stored in the appropiate value field, depending on the data type (if the field value is a string, it would be stored in the string_Value field).
This approach is probably not the most efficient, though. Hope it helps.
I think it would be a mistake to have the schema vary. It is typically something you want to be standard.
In this case you may have users that have different attributes. In the user table you store attributes that are common across all users:
USER {id(primary key), username, first, last, DOB, etc...}
Note: Age is something that should not be stored, it should be calculated.
Then you could have a USER_ATTRIBUTE table:
{userId,key,value}
So users can have multiple attributes that are unrelated to one another without the schema changing.
Changing the schema often breaks the application.

Database Is-a relationship

My problem relates to DB schema developing and is as follows.
I am developing a purchasing module, in which I want to use for purchasing items and SERVICES.
Following is my EER diagram, (note that service has very few specialized attributes – max 2)
My problem is to keep products and services in two tables or just in one table?
One table option –
Reduces complexity as I will only need to specify item id which refers to item table which will have an “item_type” field to identify whether it’s a product or a service
Two table option –
Will have to refer separate product or service in everywhere I want to refer to them and will have to keep “item_type” field in every table which refers to either product or service?
Currently planning to use option 1, but want to know expert opinion on this matter. Highly appreciate your time and advice. Thanks.
I'd certainly go to the "two tables" option. You see, you have to distinguish Products and Services, so you may either use switch(item_type) { ... } in your program or entirely distinct code paths for Product and for Service. And if a need for updating the DB schema arises, switch is harder to maintain.
The second reason is NULLs. I'd advise avoid them as much as you can — they create more problems than they solve. With two tables you can declare all fields non-NULL and forget about NULL-processing. With one table option, you have to manually write code to ensure that if item_type=product, then Product-specific fields are not NULL, and Service-specific ones are, and that if item_type=service, then Service-specific fields are not NULL, and Product-specific ones are. That's not quite pleasant work, and the DBMS can't do it for you (there is no NOT NULL IF another_field = value column constraint in SQL or anything like this).
Go with two tables. It's easier to support. I once saw a DB where everything, every single piece of data went in just two tables — there were pages and pages of code to make sure that necessary fields are not NULL.
If I were to implement I would have gone for the Two table option, It's kinda like the first rule of normalization of the schema. To remove multi-valued attributes. Using item_type is not recommended. Once you create separate tables you dont need to use the item_type you can just use the foreign key relationship.
Consider reading this article :
http://en.wikipedia.org/wiki/Database_normalization
It should help.

General database design: Is it ever considered "okay" to create a non-normalized table on purpose?

After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.

Table "Inheritance" in SQL Server

I am currently in the process of looking at a restructure our contact management database and I wanted to hear peoples opinions on solving the problem of a number of contact types having shared attributes.
Basically we have 6 contact types which include Person, Company and Position # Company.
In the current structure all of these have an address however in the address table you must store their type in order to join to the contact.
This consistent requirement to join on contact type gets frustrating after a while.
Today I stumbled across a post discussing "Table Inheritance" (http://www.sqlteam.com/article/implementing-table-inheritance-in-sql-server).
Basically you have a parent table and a number of sub tables (in this case each contact type). From there you enforce integrity so that a sub table must have a master equivalent where it's type is defined.
The way I see it, by this method I would no longer need to store the type in tables like address, as the id is unique across all types.
I just wanted to know if anybody had any feelings on this method, whether it is a good way to go, or perhaps alternatives?
I'm using SQL Server 05 & 08 should that make any difference.
Thanks
Ed
I designed a database just like the link you provided suggests. The case was to store the data for many different technical reports. The number of report types is undefined and will probably grow to about 40 different types.
I created one master report table, that has an autoincrement primary key. That table contains all common information like customer, testsite, equipmentid, date etc.
Then I have one table for each report type that contains the spesific information relating to that report type. That table have the same primary key as the master and references the master as well.
My idea for splitting this into different tables with a 1:1 relation (which normally would be a no-no) was to avoid getting one single table with a huge number of columns, that gets very difficult to maintain as your constantly adding columns.
My design with table inheritance gave me segmented data and expandability without beeing difficult to maintain. The only thing I had to do was to write special a special save method to handle writing to two tables automatically. So far I'm very happy with the design and haven't really found any drawbacks, except for a little more complicated save method.
Google on "gen-spec relational modeling". You'll find a lot of articles discussing exactly this pattern. Some of them focus on table design, while others focus on an object oriented approach.
Table inheritance pops up in a few of them.
I know this won't help much now, but initially it may have been better to have an Entity table rather than 6 different contact types. Then each Entity could have as many addresses as necessary and there would be no need for type in the join.
You'll still have the problem that if you want the sub-type fields and you have only the master contact, you'll have to know what table to go looking at - or else join to all of them. But otherwise this is a workable solution to a common problem.
Another possibility (fairly similar in structure, but different in how you think of it) is to simply put all your contacts into one table. Then for the more specific fields (birthday say for people and department for position#company) create separate tables that are associated with that contact.
Contact Table
--------------
Name
Phone Number
Address Table
-------------
Street / state, etc
ContactId
ContactBirthday Table
--------------
Birthday
ContactId
Departments Table
-----------------
Department
ContactId
It requires a different way of thinking of things though - instead of thinking of people vs. companies, you think of the various functional requirements for the task at hand - if you want to send out birthday cards, get all the contacts that have birthdays associated with them, etc..
I'm going to go out on a limb here and suggest you should rethink your normalization strategy (as you seem to be lucky enough to be able to rethink your schema quite fundamentally). If you typically store an address for each contact, then your contact table should have the address fields in it. Alternatively if the address is stored per company then the address should be stored in the company table and your contacts linked to that company.
If your contacts only have one address, or one (or even 3, just not 'many') instance of the other fields, think about rationalizing them into a single table. In my experience having a few null fields is a far better alternative than needing left joins to data you aren't sure exists.
Fortunately for anyone who vehemently disagrees with me you did ask for opinions! :) IMHO you should only normalize when you really need to. Where you are rethinking schemas, denormalization should be considered at every opportunity.
When you have a 7th type, you'll have to create another table.
I'm going to try this approach. Yes, you have to create new tables when you have a new type, but since this table will probably have different columns, you'll end up doing this anyway if you don't use this scheme.
If the tables that inherit the master don't differentiate much from one another, I'd recommend you try another approach.
May I suggest that we just add a Type table. Ie a person has an address, name etc then the student, teacher as each use case presents its self we have a PersonType table that has an entry from the person table to n types and the subsequent new tables teacher, alien, singer as the system eveolves...

Resources