This is a theoretical question which I ask due to a request that has come my way recently. I own the support of a master operational data store which maintains a set of data tables (with the master data), along with a set of lookup tables (which contain a list of reference codes along with their descriptions). There has been recently a push from the downstream applications to unite the two structures (data and lookup values) logically in the presentation layer so that it is easier for them to find out if there have been updates in the overall data.
While the request is understandable, my first thought is that it should be implemented at the interface level rather than at the source. Combining the two tables logically (last_update_date) at ODS level is almost similar to the de-normalization of data and seems contrary to the idea of keeping lookups and data separate.
That said, I cannot think of any reason of why it should not be done at ODS level apart from the fact that it does not "seem" to be right... Does anyone have any thoughts around why such an approach should or should not be followed?
I am listing an example here for simplicity's sake.
Data table
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-01
2 Y E2 2014-08-01
Code table
Emp_typ_cd Emp_typ_desc Last_Update_date
E1 Employee_1 2014-08-23
E2 Employee_2 2013-09-01
The downstream request is to represent the data as
Data view
ID Name Emp_typ_cd Last_update_date
1 X E1 2014-08-23
2 Y E2 2014-08-01
or
Data view
ID Name Emp_typ_cd Emp_typ_desc Last_update_date
1 X E1 Employee_1 2014-08-23
2 Y E2 Employee_2 2014-08-01
You are correct, it is demoralizing the database because someone wants to see the data in a certain way. The side effects, as you know, are that you are duplicating data, reducing flexibility, increasing table size, storing dissimilar objects together, etc. You are also correct that their problem should be solved somewhere or somehow else. They won’t get what they want if they change the database the way they want to change it. If they want to make it “easier for them to find out if there have been updates in the overall data” but they duplicate massive amounts of it, they’re just opening themselves up to errors. In your example the Emp_typ_cd Updated value must be updated for all employees with that emp type code. A good update statement will do that, but still, instead of updating a single row in the lookup table you’re updating every single employee that has the emp type.
We use lookup tables all the time. We can add a new value to a lookup table, add employees to the database with a fk to that new attribute, and any report that joins on that table now has the ID, Value, Sort Order, etc. Let’s say we add ‘Veteran’ to the lu_Work_Experience. We add an employee with the veteran fk_Id and now any existing query that joins on lu_Work_Experience has that value. They sort Work Experience alphabetically or by our pre-defined sort.
There is a valid reason for flattening your data structure though, and that is speed. If you’re running a very large report it will be faster with now joins (and good indexing). If the business knows it’s going to run a very large report many times and is worried about end user wait times, then it is a good idea to build a single table for that one report. We do it all the time for calculated measures. If we know that a certain analytic report will have a ton of aggregation and joins we pre-aggregate the data into the data store. That being said, we don’t do that very often in SQL because we use cubes for analytics.
So why use lookup tables in a database? Logical separation of data. An employee has a employee code, but it does NOT have a date of when an employee code was updated. Reduce duplicate data. Minimize design complexity. To avoid building a table for a specific report and then having to build a different table for a different report even if it has similar data.
Anyway, the rest of my argument would be comprised of facts from the Database Normalization wikipedia page so I’ll skip it.
Related
The following shows two tables, one has employee details and has their address. If both the tables has 'one to one' relationship (i.e one employee has only one address and vice versa), then why not combine the two into one table, like shown below.
Two tables with 1 to 1 relation
Both tables combined into one.
Normalization is a quality decision while de-normalization is a performance decision.
If you would join the two tables, and store them as one, your reads and writes would be faster because you will not have to go through the query to join these two tables to work with their combined data.
However, if you keep the tables separate, then looking at individual tables' data may make more sense to you and also give you the freedom to modify one table's data without touching other's. But to work with the joined data of the two tables in 1 to 1 relation would force you to write a (maybe unnecessary) join query every time.
The decision is yours to make in the end. IMO, unless the performance of separated tables is below acceptance, leaving the data stored in a cleaner manner may be a better idea.
There are two kinds of database analysis methods.
The first form is one that simplifies the schema in the case of the table used, employees and address it should merge.
The other method is called the third form, it consists in making the tables as independent as possible. In the case of the table employees and address it should be separated. There is no right or wrong method it is a choice to make.
However, if the database contains many tables, it is more sensible to simplify and get to the first form, but there is no obligation.
If you always have a 1-1 relationship then you have a mutual functional dependency so you get no normalisation benefit.
However there may be reasons to do this:
Easier management of difference in permission.
Easier management of null values
faster aggregation within the table
Avoid running heavy triggers on unnecessary frequent updates
On the other hand aggregation across the join becomes more expensive.
What is the best way to store settings for certain objects in my database?
Method one: Using a single table
Table: Company {CompanyID, CompanyName, AutoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
Method two: Using two tables
Table Company {CompanyID, COmpanyName}
Table2 CompanySettings{CompanyID, utoEmail, AutoEmailAddress, AutoPrint, AutoPrintPrinter}
I would take things a step further...
Table 1 - Company
CompanyID (int)
CompanyName (string)
Example
CompanyID 1
CompanyName "Swift Point"
Table 2 - Contact Types
ContactTypeID (int)
ContactType (string)
Example
ContactTypeID 1
ContactType "AutoEmail"
Table 3 Company Contact
CompanyID (int)
ContactTypeID (int)
Addressing (string)
Example
CompanyID 1
ContactTypeID 1
Addressing "name#address.blah"
This solution gives you extensibility as you won't need to add columns to cope with new contact types in the future.
SELECT
[company].CompanyID,
[company].CompanyName,
[contacttype].ContactTypeID,
[contacttype].ContactType,
[companycontact].Addressing
FROM
[company]
INNER JOIN
[companycontact] ON [companycontact].CompanyID = [company].CompanyID
INNER JOIN
[contacttype] ON [contacttype].ContactTypeID = [companycontact].ContactTypeID
This would give you multiple rows for each company. A row for "AutoEmail" a row for "AutoPrint" and maybe in the future a row for "ManualEmail", "AutoFax" or even "AutoTeleport".
Response to HLEM.
Yes, this is indeed the EAV model. It is useful where you want to have an extensible list of attributes with similar data. In this case, varying methods of contact with a string that represents the "address" of the contact.
If you didn't want to use the EAV model, you should next consider relational tables, rather than storing the data in flat tables. This is because this data will almost certainly extend.
Neither EAV model nor the relational model significantly slow queries. Joins are actually very fast, compared with (for example) a sort. Returning a record for a company with all of its associated contact types, or indeed a specific contact type would be very fast. I am working on a financial MS SQL database with millions of rows and similar data models and have no problem returning significant amounts of data in sub-second timings.
In terms of complexity, this isn't the most technical design in terms of database modelling and the concept of joining tables is most definitely below what I would consider to be "intermediate" level database development.
I would consider if you need one or two tables based onthe following criteria:
First are you close the the record storage limit, then two tables definitely.
Second will you usually be querying the information you plan to put inthe second table most of the time you query the first table? Then one table might make more sense. If you usually do not need the extended information, a separate ( and less wide) table should improve performance on the main data queries.
Third, how strong a possibility is it that you will ever need multiple values? If it is one to one nopw, but something like email address or phone number that has a strong possibility of morphing into multiple rows, go ahead and make it a related table. If you know there is no chance or only a small chance, then it is OK to keep it one assuming the table isn't too wide.
EAV tables look like they are nice and will save futue work, but in reality they don't. Genreally if you need to add another type, you need to do future work to adjust quesries etc. Writing a script to add a column takes all of five minutes, the other work will need to be there regarless of the structure. EAV tables are also very hard to query when you don;t know how many records you wil need to pull becasue normally you want them on one line and will get the information by joining to the same table multiple times. This causes performance problmes and locking especially if this table is central to your design. Don't use this method.
It depends if you will ever need more information about a company. If you notice yourself adding fields like companyphonenumber1 companyphonenumber2, etc etc. Then method 2 is better as you would seperate your entities and just reference a company id. If you do not plan to make these changes and you feel that this table will never change then method 1 is fine.
Usually, if you don't have data duplication then a single table is fine.
In your case you don't so the first method is OK.
I use one table if I estimate the data from the "second" table will be used in more than 50% of my queries. Use two tables if I need multiple copies of the data (i.e. multiple phone numbers, email addresses, etc)
This app I'm working on needs to store some meta data fields about an entity. The problem is that we can already foresee that these fields are going to change a lot in the future. Right now every entity's property is translated to one column in the entity table, but altering table columns later down the road will be costly and error-prone right?
Should I go for something like this (key-value store) instead?
MetaDataField
-----
metaDataFieldID (PK), name
FieldValue
----------
EntityID (PK, FK), metaDataFieldID (PK, FK), value [varchar(255)]
p.s. I also thought of using XML on SQL Server 05+. After talking to some ppl, seems like it is not a viable solution 'cause it will be too slow for doing certain query for reporting purposes.
You're right, you don't want to go changing your data schema any time a new parameter comes up!
I've seen two ways of doing something like this. One, just have a "meta" text field, and format the value to define both the parameter and the value. Joomla! does this, for example, to track custom article properties. It looks like this:
ProductTable
id name meta
--------------------------------------------------------------------------
1 prod-a title:'a product title',desc:'a short description'
2 prod-b title:'second product',desc:'n/a'
3 prod-c title:'3rd product',desc:'please choose sm med or large'
Another way of handling this is to use additional tables, like this:
ProductTable
product_id name
-----------------------
1 prod-a
2 prod-b
3 prod-c
MetaParametersTable
meta_id name
--------------------
1 title
2 desc
ProductMetaMapping
product_id meta_id value
-------------------------------------
1 1 a product title
1 2 a short description
2 1 second product
2 2 n/a
3 1 3rd product
3 2 please choose sm med or large
In this case, a query will need to join the tables, but you can optimize the tables better, can query for independent meta without returning all parameters, etc.
Choosing between them will depend on complexity, whether data rows ever need to have differing meta, and how the data will be consumed.
The Key Value table is a good idea and it works much faster than the SQL Server 2005 XML indexes. I started the same type of solution with XML in a project and had to change it to a indexed Key Value table to gain performance. I think SQL Server 2008 XML Indexes are faster, but have not tried them yet.
The XML speed only factors in depending on the size of the data going into the xml column. We had a project that stuffed data into and processed data from an xml column. It was very fast.. until you hit around 64kb. 63KB and less took milliseconds to get the data out or insert into. 64KB and the operations jumped to a full minute. Go figure.
Other than that the main issue we had was complexity. Working with xml data in sql server is not for the faint of heart.
Regardless, your best bet is to have a table of name / value pairs tied to the entity in question. Then it's easy to support having entities with either different properties or dynamically adding / removing properties. This too has it's caveats. For example, if you have more than say 10 properties, then it will be much faster to do pivots in code.
There is also a pattern for this to consider -- called the observation pattern.
See similar questions/answers: one, two, three.
The pattern is described in Martin Fowler's book Analysis Patterns, essentially it is an OO pattern, but can be done in DB schema too.
"altering table columns later down the road will be costly and error-prone right?"
A "table column", as you name it, has exactly two properties : its name and its data type. Therefore, "altering a table column" can refer only to two things : altering the name or altering the data type.
Wanting to alter the name is indeed a costly and error-prone operation, but fortunately there should never be a genuine business need for it. If a certain established column seems somewhat inappropriate, with afterthought, and "it might have been given a better name", then it is still not the case that the business incurs losses from that fact! Just stick with the old name, even if with afterthought, it was poorly chosen.
Wanting to alter the data type is indeed a costly operation, susceptible to breaking business operations that were running smoothly, but fortunately it is quite rare that a user comes round to tell you that "hey, I know I told you this attribute had to be a Date, but guess what, I was wrong, it has to be a Float.". And other changes of the same nature, but more likely to occur (e.g. from shortint to integer or so), can be avoided by being cautious when defining the database.
Other types of database changes (e.g. adding a new column) are usually not that dangerous and/or disruptive.
So don't let yourself be scared by those vague sloganesque phrases such as "changing a database is expensive and dangerous". They usually come from ignorants who know too little about database management to be involved in that particular field of our profession anyway.
Maintaining queries, constraints and constraint enforcement on an EAV database is very likely to turn out to be thousands of times more expensive than "regular" database structure changes.
I am using SQL Server 2005 Express and Visual Studio 2008.
I have a database which has a table with 400 Columns. Things were (just about manageable) until I had to perform bi-directional sync between several databases.
I am wondering what arguments are for and against using 400 column database or 40 table database are?
The table in not normalised and comprises of mainly nvarchar(64) columns and some TEXT columns. (there are no datatypes as it was converted from text files).
There is one other table that links to this table and is a 1-1 relationship (i.e one entry relates to one entry in the 400 column table).
The table is a list files that contained parameters that are "plugged" into a application.
I look forward to your replies.
Thank you
Based on your process description I would start with something like this. The model is simplified, does not capture history, etc -- but, it is a good starting point. Note: parameter = property.
- Setup is a collection of properties. One setup can have many properties, one property belongs to one setup only.
- Machine can have many setups, one setup belongs to one machine only.
- Property is of a specific type (temperature, run time, spindle speed), there can be many properties of a certain type.
- Measurement and trait are types of properties. Measurement is a numeric property, like speed. Trait is a descriptive property, like color or some text.
For having a wide table:
Quick to report on as it's presumably denormalized and so no joins are needed.
Easy to understand for end-consumers as they don't need to hold a data model in their heads.
Against having a wide table:
Probably need to have multiple composite indexes to get good query performance
More difficult to maintain data consistency i.e. need to update multiple rows when data changes if that data is on multiple rows
As you're having to update multiple rows and maintain multiple indexes, concurrent performance for updates may become an issue as locks escalate.
You might end up with records with loads of nulls in columns if the attribute isn't relevant to the entity on that row which can make handling results awkward.
If lazy developers do a SELECT * from the table you end up dragging loads of data across the network, so you generally have to maintain suitable subset views.
So it all really depends on what you're doing. If the main purpose of the table is OLAP reporting and updates are infrequent and affect few rows then perhaps a wide, denormalized table is the right thing to have. In an OLTP environment then it's probably not and you should prefer narrower tables. (I generally design in 3NF and then denormalize for query performance as I go along.)
You could always take the approach of normalizing and providing a wide-view for readers if that's what they want to see.
Without knowing more about the situation it's not really possible to say more about the pros and cons in your particular circumstance.
Edit:
Given what you've said in your comments, have you considered just having a long & skinny name=value pair table so you'd just have UserId, PropertyName, PropertyValue columns? You might want to add in some other meta-attributes into it too; timestamp, version, or whatever. SQL Server is quite efficient at handling these sorts of tables so don't discount a simple solution like this out-of-hand.
In a recent project I have seen a tables from 50 to 126 columns.
Should a table hold less columns per table or is it better to separate them out into a new table and use relationships? What are the pros and cons?
Generally it's better to design your tables first to model the data requirements and to satisfy rules of normalization. Then worry about optimizations like how many pages it takes to store a row, etc.
I agree with other posters here that the large number of columns is a potential red flag that your table is not properly normalized. But it might be fine in this case. We can't tell from your description.
In any case, splitting the table up just because the large number of columns makes you uneasy is not the right remedy. Is this really causing any defects or performance bottleneck? You need to measure to be sure, not suppose.
A good rule of thumb that I've found is simply whether or not a table is growing rows as a project continues,
For instance:
On a project I'm working on, the original designers decided to include site permissions as columns in the user table.
So now, we are constantly adding more columns as new features are implemented on the site. obviously this is not optimal. A better solution would be to have a table containing permissions and a join table between users and permissions to assign them.
However, for other more archival information, or tables that simply don't have to grow or need to be cached/minimize pages/can be filtered effectively, having a large table doesn't hurt too much as long as it doesn't hamper maintenance of the project.
At least that is my opinion.
Usually excess columns points to improper normalization, but it is hard to judge without having some more details about your requirements.
I can picture times when it might be necessary to have this many, or more columns. Examples would be if you had to denormalize and cache data - or for a type of row with many attributes. I think the keys are to avoid select * and make sure you are indexing the right columns and composites.
If you had an object detailing the data in the database, would you have a single object with 120 fields, or would you be looking through the data to extract data that is logically distinguishable? You can inline Address data with Customer data, but it makes sense to remove it and put it into an Addresses table, even if it keeps a 1:1 mapping with the Person.
Down the line you might need to have a record of their previous address, and by splitting it out you've removed one major problem refactoring your system.
Are any of the fields duplicated over multiple rows? I.e., are the customer's details replicated, one per invoice? In which case there should be one customer entry in the Customers table, and n entries in the Invoices table.
One place where you need to not fix broken normalisation is where you have a facts table (for auditing, etc) where the purpose is to aggregate data to run analyses on. These tables are usually populated from the properly normalised tables however (overnight for example).
It sounds like you have potential normalization issues.
If you really want to, you can create a new table for each of those columns (a little extreme) or group of related columns, and join it on the ID of each record.
It could certainly affect performance if people are running around with a lot of "Select * from GiantTableWithManyColumns"...
Here are the official statistics for SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms143432.aspx
Keep in mind these are the maximums, and are not necessarily the best for usability.
Think about splitting the 126 columns into sections.
For instance, if it is some sort of "person" table
you could have
Person
ID, AddressNum, AddressSt, AptNo, Province, Country, PostalCode, Telephone, CellPhone, Fax
But you could separate that into
Person
ID, AddressID, PhoneID
Address
ID, AddressNum, AddressSt, AptNo, Province, Country, PostalCode
Phone
ID, Telephone, Cellphone, fax
In the second one, you could also save yourself from data replication by having all the people with the same address have the same addressId instead of copying the same text over and over.
The UserData table in SharePoint has 201 fields but is designed for a special purpose.
Normal tables should not be this wide in my opinion.
You could probably normalize some more. And read some posts on the web about table optimization.
It is hard to say without knowing a little bit more.
Well, I don't know how many columns are possible in sql but one thing for which I am very sure is that when you design table, each table is an entity means that each table should contain information either about a person, a place, an event or an object. So till in my life I don't know that a thing may have that much data/information.
Second thing that you should notice is that that there is a method called normalization which is basically used to divide data/information into sub section so that one can easily maintain database. I think this will clear your idea.
I'm in a similar position. Yes, there truly is a situation where a normalized table has, like in my case, about 90, columns: a work flow application that tracks many states that a case can have in addition to variable attributes to each state. So as each case (represented by the record) progresses, eventually all columns are filled in for that case. Now in my situation there are 3 logical groupings (15 cols + 10 cols + 65 cols). So do I keep it in one table (index is CaseID), or do I split into 3 tables connected by one-to-one relationship?
Columns in a table1 (merge publication)
246
Columns in a table2 (SQL Server snapshot or transactional publication)
1,000
Columns in a table2 (Oracle snapshot or transactional publication)
995
in a table, we can have maximum 246 column
http://msdn.microsoft.com/en-us/library/ms143432.aspx
A table should have as few columns as possible.....
in SQL server tables are stored on pages, 8 pages is an extent
in SQL server a page can hold about 8060 bytes, the more data you can fit on a page the less IOs you have to make to return the data
You probably want to normalize (AKA vertical partitioning) your database