I am attempting to merge data from various sources into an existing data model. Each source uses different types of IDs (such as GUID, Salesforce IDs, etc.). For example, if I were to merge data from two different sources, the table may look like the following (where the first two SalesPersonIDs are GUID IDs and the second two are Salesforce IDs):
Is this a bad practice? I could also imagine a table where each ID type was its own column and could be left blank if it was not applicable. Something like the following:
I apologize, I am a bit new to this. Thanks in advance for any insight, I greatly appreciate it!
The big roles of an ID column are to act as a key connecting data in different tables, and to help indexing - quickly find rows so your queries run fast.
The second solution wouldn't work well for these purposes, and will lead to big headaches in queries: every time you want to group by the ID, you'll have to combine the info from 2 columns in some way, hopefully getting a correct unique result every time.
On the one hand, all you might ever need from an ID is for it to be unique. The first solution might be fine this respect - but are you sure you'll never, ever get data about one SalesPerson from more than one source?
I'd suggest keeping all the IDs in one column, and adding a column to say what kind of ID this is. At least this way, you won't lose any information and can do other things in the future.
One thing you might consider is making a separate table of SalesPerson with all their possible IDs, and have this keyed to other (Sales?) data by a unique ID used only in your database.
I have a table say Table1 which has following columns
1. Id
2. Name
3. TransportModeId
4. ParkingId
5. ActivityId
Column 3,4,5 are the foreign keys and all three are simple list tables which has following columns
1. Id
2. Item
For simplicity I have shown 3 tables otherwise my actual schema contains almost 25 List table.
What should be the best Practice
Option 1.
Keep all list table separate which will create 25 tables but on the other hand i will have a clean modular schema
Option 2.
Make a table with self join and add all the items in that table in which ParentId null will represent the name of the table and it can have more than one references in other tables as described above and it has to be kept in some kind of common module
thanks
Option 1 is the way how it is normally done when designing a system that is not supposed to be very configurable by end user/implementator. It has several important advantages, two of them:
when you need to add an extra attribute to any of the enumerations (e.g. parking location to the Parking enumeration), it is quite simple and does not produce extra problems.
It is optimized for speed using relation database engine's native algorithms for linking records.
As for Option 2:
It is something called Generalization. You take more types with similar attributes (methods) and create a class/table with a structure that fits different purposes.
The self reference, as you speak about it, is not a good idea for Option 2, rather make a reference to another EnumerationType table containing type names like Parking, Activity etc. with id.
Using this approach could make sense in case you need to enable end user to configure the attributes himself within your app. But otherwise it could cause you problems when you find out, that different enumeration tables need to have different structures.
In our database design we have a couple of tables that describe different objects but which are of the same basic type. As describing the actual tables and what each column is doing would take a long time I'm going to try to simplify it by using a similar structured example based on a job database.
So say we have following tables:
These tables have no connections between each other but share identical columns. So the first step was to unify the identical columns and introduce a unique personId:
Now we have the "header" columns in person that are then linked to the more specific job tables using a 1 to 1 relation using the personId PK as the FK. In our use case a person can only ever have one job so the personId is also unique across the Taxi driver, Programmer and Construction worker tables.
While this structure works we now have the use case where in our application we get the personId and want to get the data of the respective job table. This gets us to the problem that we can't immediately know what kind of job the person with this personId is doing.
A few options we came up with to solve this issue:
Deal with it in the backend
This means just leaving the architecture as it is and look for the right table in the backend code. This could mean looking through every table present and/or construct a semi-complicated join select in which we have to sift through all columns to find the ones which are filled.
All in all: Possible but means a lot of unecessary selects. We also would like to keep such database oriented logic in the actual database.
Using a Type Field
This means adding a field column in the Person table filled for example with numbers to determine the correct child table like:
So you could add a 0 in Type if it's a taxi driver, a 1 if it's a programmer and so on...
While this greatly reduced the amount of backend logic we then have to make sure that the numbers we use in the Type field are known in the backend and don't ever change.
Use separate IDs for each table
That means every job gets its own ID (has to be nullable) in Person like:
Now it's easy to find out which job each person has due to the others having an empty ID.
So my question is: Which one of these designs is the best practice? Am i missing an obvious solution here?
Bill Karwin made a good explanation on a problem similar to this one. https://stackoverflow.com/a/695860/7451039
We've now decided to go with the second option because it seem to come with the least drawbacks as described by the other commenters and posters. As there was no actual answer portraying the second option as a solution i will try to summarize our reasoning:
Against Option 1:
There is no way to distinguish the type from looking at the parent table. As a result the backend would have to include all logic which includes scanning all tables for the that contains the id. While you can compress most of the logic into a single big Join select it would still be a lot more logic as opposed to the other options.
Against Option 3:
As #yuri-g said this one is technically not possible as the separate IDs could not setup as primary keys. They would have to be nullable and as a result can't be indexed, essentially rendering the parent table useless as one of the reasons for it was to have a unique personID across the tables.
Against a single table containing all columns:
For smaller use cases as the one i described in the question this might me viable but we are talking about a bunch of tables with each having roughly 2-6 columns. This would make this option turn into a column-mess really quickly.
Against a flat design with a key-value table:
Our properties have completly different data types, different constraints and foreign key relations. All of this would not be possible/difficult in this design.
Against custom database objects containt the child specific properties:
While this option that #Matthew McPeak suggested might be a viable option for a lot of people our database design never really used objects so introducing them to the mix would likely cause confusion more than it would help us.
In favor of the second option:
This option is easy to use in our table oriented database structure, makes it easy to distinguish the proper child table and does not need a lot of reworking to introduce. Especially since we already have something similar to a Type table that we can easily use for this purpose.
Third option, as you describe it, is impossible: no RDBMS (at least, of I personally know about) would allow you to use NULLs in PK (even composite).
Second is realistic.
And yes, first would take up to N queries to poll relatives in order to determine the actual type (where N is the number of types).
Although you won't escape with one query in second case either: there would always be two of them, because you cant JOIN unless you know what exactly you should be joining.
So basically there are flaws in your design, and you should consider other options there.
Like, denormalization: line non-shared attributes into the parent table anyway, then fields become nulls for non-correpondent types.
Or flexible, flat list of attribute-value pairs related through primary key (yes, schema enforcement is a trade-off).
Or switch to column-oriented DB: that's a case for it.
After-edit: Wow, this question go long. Please forgive =\
I am creating a new table consisting of over 30 columns. These columns are largely populated by selections made from dropdown lists and their options are largely logically related. For example, a dropdown labeled Review Period will have options such as Monthly, Semi-Annually, and Yearly. I came up with a workable method to normalize these options down to numeric identifiers by creating a primitives lookup table that stores values such as Monthly, Semi-Annually, and Yearly. I then store the IDs of these primitives in the table of record and use a view to join that table out to my lookup table. With this view in place, the table of record can contain raw data that only the application understands while allowing external applications and admins to run SQL against the view and return data that is translated into friendly information.
It just got complicated. Now these dropdown lists are going to have non-logically-related items. For example, the Review Period dropdown list now needs to have options of NA and Manual. This blows my entire grouping scheme out of the water.
Similar constructs that have been used in this application have resorted to storing repeated string values across multiple records. This means you could have hundreds of records with the string 'Monthly' stored in the table's ReviewPeriod column. The thought of this happening has made me cringe since I've started working here, but now I am starting to think that non-normalized data may be the best option here.
The only other way I can think of doing this using my initial method while allowing it to be dynamic and support the constant adding of new options to any dropdown list at any time is this: When saving the data to the database, iterate through every single property of my business object (.NET class in this case) and check for any string value that exists in the primitives table. If it doesn't, add it and return the auto-generated unique identifier for storage in the table of record. It seems so complicated, but is this what one is to go through for the sake of normalized data?
Anything is possible. Nobody is going to haul you off to denormalization jail and revoke your DBA card. I would say that you should know the rules and what breaking them means. Once you have those in hand, it's up to your and your best judgement to do what you think is best.
I came up with a workable method to normalize these options down to
numeric identifiers by creating a primitives lookup table that stores
values such as Monthly, Semi-Annually, and Yearly. I then store the
IDs of these primitives in the table of record and use a view to join
that table out to my lookup table.
Replacing text with ID numbers has nothing at all to do with normalization. You're describing a choice of surrogate keys over natural keys. Sometimes surrogate keys are a good choice, and sometimes surrogate keys are a bad choice. (More often a bad choice than you might believe.)
This means you could have hundreds of records with the string
'Monthly' stored in the table's ReviewPeriod column. The thought of
this happening has made me cringe since I've started working here, but
now I am starting to think that non-normalized data may be the best
option here.
Storing the string "Monthly" in multiple rows has nothing to do with normalization. (Or with denormalization.) This seems to be related to the notion that normalization means "replace all text with id numbers". Storing text in your database shouldn't make you cringe. VARCHAR(n) is there for a reason.
The only other way I can think of doing this using my initial method
while allowing it to be dynamic and support the constant adding of new
options to any dropdown list at any time is this: When saving the data
to the database, iterate through every single property of my business
object (.NET class in this case) and check for any string value that
exists in the primitives table. If it doesn't, add it and return the
auto-generated unique identifier for storage in the table of record.
Let's think about this informally for a minute.
Foreign keys provide referential integrity. Their purpose is to limit the values allowed in a column. Informally, the referenced table provides a set of valid values. Values that aren't in that table aren't allowed in the referencing column of other tables.
But no matter what the user types in, you're going to add it to that table of valid values.
If you're going to accept everything the user types in the first place, why use a foreign key at all?
The main problem here is that you've been poorly served by the people who taught you (mis-taught you) the relational model. (And, probably, equally poorly by the people who taught you SQL.) I hope you can unlearn those mistaken notions quickly, and soon make real progress.
I have an application with multiple "pick list" entities, such as used to populate choices of dropdown selection boxes. These entities need to be stored in the database. How do one persist these entities in the database?
Should I create a new table for each pick list? Is there a better solution?
In the past I've created a table that has the Name of the list and the acceptable values, then queried it to display the list. I also include a underlying value, so you can return a display value for the list, and a bound value that may be much uglier (a small int for normalized data, for instance)
CREATE TABLE PickList(
ListName varchar(15),
Value varchar(15),
Display varchar(15),
Primary Key (ListName, Display)
)
You could also add a sortOrder field if you want to manually define the order to display them in.
It depends on various things:
if they are immutable and non relational (think "names of US States") an argument could be made that they should not be in the database at all: after all they are simply formatting of something simpler (like the two character code assigned). This has the added advantage that you don't need a round trip to the db to fetch something that never changes in order to populate the combo box.
You can then use an Enum in code and a constraint in the DB. In case of localized display, so you need a different formatting for each culture, then you can use XML files or other resources to store the literals.
if they are relational (think "states - capitals") I am not very convinced either way... but lately I've been using XML files, database constraints and javascript to populate. It works quite well and it's easy on the DB.
if they are not read-only but rarely change (i.e. typically cannot be changed by the end user but only by some editor or daily batch), then I would still consider the opportunity of not storing them in the DB... it would depend on the particular case.
in other cases, storing in the DB is the way (think of the tags of StackOverflow... they are "lookup" but can also be changed by the end user) -- possibly with some caching if needed. It requires some careful locking, but it would work well enough.
Well, you could do something like this:
PickListContent
IdList IdPick Text
1 1 Apples
1 2 Oranges
1 3 Pears
2 1 Dogs
2 2 Cats
and optionally..
PickList
Id Description
1 Fruit
2 Pets
I've found that creating individual tables is the best idea.
I've been down the road of trying to create one master table of all pick lists and then filtering out based on type. While it works, it has invariably created headaches down the line. For example you may find that something you presumed to be a simple pick list is not so simple and requires an extra field, do you now split this data into an additional table or extend you master list?
From a database perspective, having individual tables makes it much easier to manage your relational integrity and it makes it easier to interpret the data in the database when you're not using the application
We have followed the pattern of a new table for each pick list. For example:
Table FRUIT has columns ID, NAME, and DESCRIPTION.
Values might include:
15000, Apple, Red fruit
15001, Banana, yellow and yummy
...
If you have a need to reference FRUIT in another table, you would call the column FRUIT_ID and reference the ID value of the row in the FRUIT table.
Create one table for lists and one table for list_options.
# Put in the name of the list
insert into lists (id, name) values (1, "Country in North America");
# Put in the values of the list
insert into list_options (id, list_id, value_text) values
(1, 1, "Canada"),
(2, 1, "United States of America"),
(3, 1, "Mexico");
To answer the second question first: yes, I would create a separate table for each pick list in most cases. Especially if they are for completely different types of values (e.g. states and cities). The general table format I use is as follows:
id - identity or UUID field (I actually call the field xxx_id where xxx is the name of the table).
name - display name of the item
display_order - small int of order to display. Default this value to something greater than 1
If you want you could add a separate 'value' field but I just usually use the id field as the select box value.
I generally use a select that orders first by display order, then by name, so you can order something alphabetically while still adding your own exceptions. For example, let's say you have a list of countries that you want in alpha order but have the US first and Canada second you could say "SELECT id, name FROM theTable ORDER BY display_order, name" and set the display_order value for the US as 1, Canada as 2 and all other countries as 9.
You can get fancier, such as having an 'active' flag so you can activate or deactivate options, or setting a 'x_type' field so you can group options, description column for use in tooltips, etc. But the basic table works well for most circumstances.
Two tables. If you try to cram everything into one table then you break normalization (if you care about that). Here are examples:
LIST
---------------
LIST_ID (PK)
NAME
DESCR
LIST_OPTION
----------------------------
LIST_OPTION_ID (PK)
LIST_ID (FK)
OPTION_NAME
OPTION_VALUE
MANUAL_SORT
The list table simply describes a pick list. The list_ option table describes each option in a given list. So your queries will always start with knowing which pick list you'd like to populate (either by name or ID) which you join to the list_ option table to pull all the options. The manual_sort column is there just in case you want to enforce a particular order other than by name or value. (BTW, whenever I try to post the words "list" and "option" connected with an underscore, the preview window goes a little wacky. That's why I put a space there.)
The query would look something like:
select
b.option_name,
b.option_value
from
list a,
list_option b
where
a.name="States"
and
a.list_id = b.list_id
order by
b.manual_sort asc
You'll also want to create an index on list.name if you think you'll ever use it in a where clause. The pk and fk columns will typically automatically be indexed.
And please don't create a new table for each pick list unless you're putting in "relationally relevant" data that will be used elsewhere by the app. You'd be circumventing exactly the relational functionality that a database provides. You'd be better off statically defining pick lists as constants somewhere in a base class or a properties file (your choice on how to model the name-value pair).
Depending on your needs, you can just have an options table that has a list identifier and a list value as the primary key.
select optionDesc from Options where 'MyList' = optionList
You can then extend it with an order column, etc. If you have an ID field, that is how you can reference your answers back... of if it is often changing, you can just copy the answer value to the answer table.
If you don't mind using strings for the actual values, you can simply give each list a different list_id in value and populate a single table with :
item_id: int
list_id: int
text: varchar(50)
Seems easiest unless you need multiple things per list item
We actually created entities to handle simple pick lists. We created a Lookup table, that holds all the available pick lists, and a LookupValue table that contains all the name/value records for the Lookup.
Works great for us when we need it to be simple.
I've done this in two different ways:
1) unique tables per list
2) a master table for the list, with views to give specific ones
I tend to prefer the initial option as it makes updating lists easier (at least in my opinion).
Try turning the question around. Why do you need to pull it from the database? Isn't the data part of your model but you really want to persist it in the database? You could use an OR mapper like linq2sql or nhibernate (assuming you're in the .net world) or depending on the data you could store it manually in a table each - there are situations where it would make good sense to put it all in the same table but do consider this only if you feel it makes really good sense. Normally putting different data in different tables makes it a lot easier to (later) understand what is going on.
There are several approaches here.
1) Create one table per pick list. Each of the tables would have the ID and Name columns; the value that was picked by the user would be stored based on the ID of the item that was selected.
2) Create a single table with all pick lists. Columns: ID; list ID (or list type); Name. When you need to populate a list, do a query "select all items where list ID = ...". Advantage of this approach: really easy to add pick lists; disadvantage: a little more difficult to write group-by style queries (for example, give me the number of records that picked value X".
I personally prefer option 1, it seems "cleaner" to me.
You can use either a separate table for each (my preferred), or a common picklist table that has a type column you can use to filter on from your application. I'm not sure that one has a great benefit over the other generally speaking.
If you have more than 25 or so, organizationally it might be easier to use the single table solution so you don't have several picklist tables cluttering up your database.
Performance might be a hair better using separate tables for each if your lists are very long, but this is probably negligible provided your indexes and such are set up properly.
I like using separate tables so that if something changes in a picklist - it needs and additional attribute for instance - you can change just that picklist table with little effect on the rest of your schema. In the single table solution, you will either have to denormalize your picklist data, pull that picklist out into a separate table, etc. Constraints are also easier to enforce in the separate table solution.
This has served us well:
SQL> desc aux_values;
Name Type
----------------------------------------- ------------
VARIABLE_ID VARCHAR2(20)
VALUE_SEQ NUMBER
DESCRIPTION VARCHAR2(80)
INTEGER_VALUE NUMBER
CHAR_VALUE VARCHAR2(40)
FLOAT_VALUE FLOAT(126)
ACTIVE_FLAG VARCHAR2(1)
The "Variable ID" indicates the kind of data, like "Customer Status" or "Defect Code" or whatever you need. Then you have several entries, each one with the appropriate data type column filled in. So for a status, you'd have several entries with the "CHAR_VALUE" filled in.