A debate has stared at work regards a table design and auditing changes. We have a stock table that contains trucks we sell. The table has columns like mileage, location, price and stockdate to name just a few.
The databases is OLTP so quite a few reads and updates when change happens to an items of stock.
I'm happy to leave the table alone and have a shadow table auditing any inserts and updates. However, its been suggested to move most of the stock columns into separate table and to make these tables into slow moving dimensions
Personally I prefer all the data in one row. It seems a lot of hassle to join on 10 tables to bring back one stock record. And the updates will be pain because you'd have to check if each property has changed its value and do an insert if it has and update the last properties entries End Date. This can't be good for performance, can it?
Wouldn't it be better if you wanted to that level of auditing to leave the table alone and move the data to an OLAP?
I don't see the point in denormalizing your table just for auditing purposes.
You might want to look into change data capture to see if it solves your issue without further serious changes:
http://technet.microsoft.com/en-us/library/cc280519(v=sql.105).aspx
Related
I am rookie to DW. I have a Customer table with basic columns that rarely change like Name, JoinedOn etc. And another set of Columns that can change over time like "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner" etc. At the moment there is no history. In the DW I would like to track when the following columns change "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner". I feel it would be better if I create another table to track these, the table will have the following columns:
"CustomerId", "Status","CustomerType","PublishStatus","BusinessStatus","CurrentOwner", "ExpiredOn","IsCurrent"
Is this the right approach? And if yes then is this new table a fact or a slowly changing dimension? I would like to run queries like when did the CustomerType change from A to B? When was it Published? When the BusinessStatus changed who was the owner?
Which way of modeling the customers is better really depends on how are you going to use the corresponding dimension. If you would want, for example, to summarize some sales associated with the "Customer" dimension by "CustomerType" at the time of the sale, you could do it only if you keep the historical details as a part of slowly changing dimension.
You probably can run a lot of customer reports right on that table that represents a slowly changing "Customer" dimension. But if the number of your customers get into the millions, you'd be better off creating a separate fact table (or tables) for customer status changes.
So, to summarize: start off with a slowly changing dimension. If the number of customers grows too large and reports on customer statust changes become too slow, add a fact table for them and don't worry about duplicating data.
WHAT is the better practice?:
Keep history records in a separate history table
Keep history records in the active table with different status?
In my opinion I rather to keep a separate table to avoid creating one huge table with duplicate records which may cause unwanted lag time when querying the table.
My preference has always been to have a separate table with history, purely because it removes the need to have a "WHERE Status = 'LIVE'" or "WHERE CurrentRecord = 1" to get latest record (I won't get into one design that required an inline query to get max(version) to get the latest). It should mean that the current records table should remain smaller and access times may be improved, etc. In the worst case scenario, I've seen an ad-hoc query against a table pick up the wrong version of a record, causing all sorts of problems later on.
Also, if you are already getting the history from another table, you could shard the data, so all history from one year is in one table/db and all history from another is in another table/db and so on.
Pro:
If you keep the history in a separate table then this data will be accessed only when you need to search something from the past. Most of the times the main table will be used far more than the historical one. So this means faster results.
Con:
In a project I worked, I had one table with 350 columns (don't ask why.....). So this table became very large as the data was inputed. At a specific moment records went from 'active' to 'closed' status. I was tempted to move all the closed records to a new table (a historical one), but I realized that it was more slow - in a lot of queries I had to make unions....
As a final opinion I think it depends for every case, but I will always think first for the separate table.
I prefer to use one table and partioning. I also would set up a view for the active records and use that instead of the base table when querying active records.
I would go for a separate table, otherwise setting up UNIQUE and FK constraints may be still doable, but too involved.
I recently realized that I add some form of row creation timestamp and possibly a "updated on" field to most of my tables. Suddenly I started thinking that perhaps every table in the database should have a created and modified field that are set in the model behind the scenes.
Does this sound correct? Are there any types of high-load tables (like sessions) or massive sized tables that this wouldn't be a good idea for?
I wouldn't put those fields (which I generally call audit fields) on every database table. If it's a low-traffic, high-value table (like Users, for instance), it goes on, no question. I'd also add creator and modifier. If it's a table that gets hit a lot (an operation history table, say), then maybe the benefit isn't worth the cost of increased insert time and storage space.
It's a call you'll need to make separately for each table.
Obviously, there isn't a single rule.
Most of my tables have date-related things, DateCreated, DateModified, and occasionally a Revision to track changes and so on. Do whatever makes sense. Clearly, you can invent cases where it's appropriate and cases where it is not. If you're asking whether you should add them "by default" to most tables, I'd say "probably".
I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.
A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.
I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.
#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!
In a recent project I have seen a tables from 50 to 126 columns.
Should a table hold less columns per table or is it better to separate them out into a new table and use relationships? What are the pros and cons?
Generally it's better to design your tables first to model the data requirements and to satisfy rules of normalization. Then worry about optimizations like how many pages it takes to store a row, etc.
I agree with other posters here that the large number of columns is a potential red flag that your table is not properly normalized. But it might be fine in this case. We can't tell from your description.
In any case, splitting the table up just because the large number of columns makes you uneasy is not the right remedy. Is this really causing any defects or performance bottleneck? You need to measure to be sure, not suppose.
A good rule of thumb that I've found is simply whether or not a table is growing rows as a project continues,
For instance:
On a project I'm working on, the original designers decided to include site permissions as columns in the user table.
So now, we are constantly adding more columns as new features are implemented on the site. obviously this is not optimal. A better solution would be to have a table containing permissions and a join table between users and permissions to assign them.
However, for other more archival information, or tables that simply don't have to grow or need to be cached/minimize pages/can be filtered effectively, having a large table doesn't hurt too much as long as it doesn't hamper maintenance of the project.
At least that is my opinion.
Usually excess columns points to improper normalization, but it is hard to judge without having some more details about your requirements.
I can picture times when it might be necessary to have this many, or more columns. Examples would be if you had to denormalize and cache data - or for a type of row with many attributes. I think the keys are to avoid select * and make sure you are indexing the right columns and composites.
If you had an object detailing the data in the database, would you have a single object with 120 fields, or would you be looking through the data to extract data that is logically distinguishable? You can inline Address data with Customer data, but it makes sense to remove it and put it into an Addresses table, even if it keeps a 1:1 mapping with the Person.
Down the line you might need to have a record of their previous address, and by splitting it out you've removed one major problem refactoring your system.
Are any of the fields duplicated over multiple rows? I.e., are the customer's details replicated, one per invoice? In which case there should be one customer entry in the Customers table, and n entries in the Invoices table.
One place where you need to not fix broken normalisation is where you have a facts table (for auditing, etc) where the purpose is to aggregate data to run analyses on. These tables are usually populated from the properly normalised tables however (overnight for example).
It sounds like you have potential normalization issues.
If you really want to, you can create a new table for each of those columns (a little extreme) or group of related columns, and join it on the ID of each record.
It could certainly affect performance if people are running around with a lot of "Select * from GiantTableWithManyColumns"...
Here are the official statistics for SQL Server 2005
http://msdn.microsoft.com/en-us/library/ms143432.aspx
Keep in mind these are the maximums, and are not necessarily the best for usability.
Think about splitting the 126 columns into sections.
For instance, if it is some sort of "person" table
you could have
Person
ID, AddressNum, AddressSt, AptNo, Province, Country, PostalCode, Telephone, CellPhone, Fax
But you could separate that into
Person
ID, AddressID, PhoneID
Address
ID, AddressNum, AddressSt, AptNo, Province, Country, PostalCode
Phone
ID, Telephone, Cellphone, fax
In the second one, you could also save yourself from data replication by having all the people with the same address have the same addressId instead of copying the same text over and over.
The UserData table in SharePoint has 201 fields but is designed for a special purpose.
Normal tables should not be this wide in my opinion.
You could probably normalize some more. And read some posts on the web about table optimization.
It is hard to say without knowing a little bit more.
Well, I don't know how many columns are possible in sql but one thing for which I am very sure is that when you design table, each table is an entity means that each table should contain information either about a person, a place, an event or an object. So till in my life I don't know that a thing may have that much data/information.
Second thing that you should notice is that that there is a method called normalization which is basically used to divide data/information into sub section so that one can easily maintain database. I think this will clear your idea.
I'm in a similar position. Yes, there truly is a situation where a normalized table has, like in my case, about 90, columns: a work flow application that tracks many states that a case can have in addition to variable attributes to each state. So as each case (represented by the record) progresses, eventually all columns are filled in for that case. Now in my situation there are 3 logical groupings (15 cols + 10 cols + 65 cols). So do I keep it in one table (index is CaseID), or do I split into 3 tables connected by one-to-one relationship?
Columns in a table1 (merge publication)
246
Columns in a table2 (SQL Server snapshot or transactional publication)
1,000
Columns in a table2 (Oracle snapshot or transactional publication)
995
in a table, we can have maximum 246 column
http://msdn.microsoft.com/en-us/library/ms143432.aspx
A table should have as few columns as possible.....
in SQL server tables are stored on pages, 8 pages is an extent
in SQL server a page can hold about 8060 bytes, the more data you can fit on a page the less IOs you have to make to return the data
You probably want to normalize (AKA vertical partitioning) your database