I have a large database which will only be updated once a year. Every year of data will use the same schema (the data will not be adding any new variables). There's a 'main' table where most of the customer information lives. To keep track of what happens from year to year, is it better design to put a field in the main customer table that says what year it is, or have a 'year' table that relates to the main customer table?

I recommend having a year field in the customer table, that way it is all together. You could even use a timestamp to automatically input the date of user sign up.

To really answer, we'd need to see your schema, but it is almost never the right choice to make a new table for a new year. You probably want to relate years to customers.

Usually you would split off your archive data because you are doing OLTP stuff on your current data, because you want to mostly work on current data, and sometimes look at old stuff. But you have very few updates it seems. I guess the main driver is your queries, and what they 'usually' do, and what performance you need to get out of them. Its probably easier for you to have everything in one table - with a year column. But if most of your queries are for the current year, and they are tight on performance you may want to look at splitting the current data out - either using physical tables, or partitioning of the table (depending on the DB some can do this for you, whilst still being a single table)


How to deal with similar fields across SQL Server database tables in a dimension

I am working on a data warehouse solution, and I am trying to build a dimensional model from tables held in a SQL Server database. Some of the tables include but aren't limited to Customer, Customer Payments, Customer Address, etc.
All these tables in the DB have some fields that are repeated multiple times across each table i.e. Record update date, record creatuin date, active flag, closed flag and a few others. These tables all relate to the Customer in some way, but the tables can be updated independently.
I am in the process of building out a dimension(s) on the back of these tables, but I am struggling to see how best to deal with these repeated fields in an elegant way, as they are all used.
I'll appreciate any guidance from people who have experience with scenarios like this, as I ammjust starting out
If more details are needed, I am happy to provide
Before you even consider how to include them, ask if those metadata fields even need to be in your dimensional model? If no one will use the Customer Payment Update Date (vs Created Date or Payment Date), don't bring it into your model. If the customer model includes the current address, you won't need the CustomerAddress.Active flag included as well. You don't need every OLTP field in your model.
Make notes about how you talk about the fields in conversation. How do you identify the current customer address? Check the CurrentAddress flag (CustomerAddress.IsActive). When was the Customer's payment? Check the Customer Payment Date (CustomerPayment.PaymentDate or possibly CustomerPayment.CreatedDate). Try to describe them in common language terms. This will provide the best success in making your model discoverable by your users and intuitive to use.
Naming the columns in the model and source as similar as possible will also help with maintenance and troubleshooting.
Also, make sure you delineate the entities properly. A customer payment would likely be in a separate dimension from the customer. The current address may be in customer, but if there is any value to historical address details, it may make sense to put it into its own dimension, with the Active flag as well.

Database table structure for price list

I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
start_date DATE
end_date DATE
price DOUBLE
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.

Where should I break up my user records to keep track of revisions

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.
A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.
I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.
#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

Best approach to views on archive data with change logs

(Sorry about the vagueness of the title; I can't think how to really say what I'm looking for without writing a book.)
So in our app, we allow users to change key pieces of data. I'm keeping records of who changed what when in a log schema, but now the problem presents itself: how do I best represent that data in a view for reporting?
An example will help: a customer's data (say, billing address) changed on 4/4/09. Let's say that today, 10/19/09, I want to see all of their 2009 orders, before and after the change. I also want each order to display the billing address that was current as of the date of the order.
So I have 4 tables:
Orders (with order data)
Customers (with current customer data)
CustomerOrders (linking the two)
CustomerChange (which holds the date of the change, who made the change (employee id), what the old billing address was, and what they changed it to)
How do I best structure a view to be used by reporting so that the proper address is returned? Or am I better served by creating a reporting database and denormalizing the data there, which is what the reports group is requesting?
There is no need for a separate DB if this is the only thing you are going to do. You could just create a de-normalized table/cube...and populate and retrieve from it. If your data is voluminous apply proper indexes on this table.
Personally I would design this so you don't need the change table for the report. It is a bad practice to store an order without all the data as of the date of the order stored in a table. You lookup the address from the address table and store it with the order (same for partnumbers and company names and anything that changes over time.) You never get information on an order by joining to customer, address, part numbers, price tables etc.
Audit tables are more for fixing bad changes or looking up who made them than for reporting.

How do you avoid adding timestamp fields to your tables? [closed]

I have a question regarding the two additional columns (timeCreated, timeLastUpdated) for each record that we see in many solutions. My question: Is there a better alternative?
Scenario: You have a huge DB (in terms of tables, not records), and then the customer comes and asks you to add "timestamping" to 80% of your tables.
I believe this can be accomplished by using a separate table (TIMESTAMPS). This table would have, in addition to the obvious timestamp column, the table name and the primary key for the table being updated. (I'm assuming here that you use an int as primary key for most of your tables, but the table name would most likely have to be a string).
To picture this suppose this basic scenario. We would have two tables:
PAYMENT :- (your usual records)
TIMESTAMP :- {current timestamp} + {TABLE_UPDATED, id_of_entry_updated, timestamp_type}
Note that in this design you don't need those two "extra" columns in your native payment object (which, by the way, might make it thru your ORM solution) because you are now indexing by TABLE_UPDATED and id_of_entry_updated. In addition, timestamp_type will tell you if the entry is for insertion (e.g "1"), update (e.g "2"), and anything else you may want to add, like "deletion".
I would like to know what do you think about this design. I'm most interested in best practices, what works and scales over time. References, links, blog entries are more than welcome. I know of at least one patent (pending) that tries to address this problem, but it seems details are not public at this time.
While you're at it, also record the user who made the change.
The flaw with the separate-table design (in addition to the join performance highlighted by others) is that it makes the assumption that every table has an identity column for the key. That's not always true.
If you use SQL Server, the new 2008 version supports something they call Change Data Capture that should take away a lot of the pain you're talking about. I think Oracle may have something similar as well.
Update: Apparently Oracle calls it the same thing as SQL Server. Or rather, SQL Server calls it the same thing as Oracle, since Oracle's implementation came first ;)
I have used a design where each table to be audited had two tables:
create table NAME (
name_id int,
first_name varchar
last_name varchar
-- any other table/column constraints
create table NAME_AUDIT (
name_audit_id int
name_id int
first_name varchar
last_name varchar
update_type char(1) -- 'U', 'D', 'C'
update_date datetime
-- no table constraints really, outside of name_audit_id as PK
A database trigger is created that populates NAME_AUDIT everytime anything is done to NAME. This way you have a record of every single change made to the table, and when. The application has no real knowledge of this, since it is maintained by a database trigger.
It works reasonably well and doesn't require any changes to application code to implement.
I think I prefer adding the timestamps to the individual tables. Joining on your timestamp table on a composite key -- one of which is a string -- is going to be slower and if you have a large amount of data it will eventually be a real problem.
Also, a lot of the time when you are looking at timestamps, it's when you're debugging a problem in your application and you'll want the data right there, rather than always having to join against the other table.
One nightmare with your design is that every single insert, update or delete would have to hit that table. This can cause major performance and locking issues. It is a bad idea to generalize a table like that (not just for timestamps). It would also be a nightmare to get the data out of.
If your code would break at the GUI level from adding fields you don't want the user to see, you are incorrectly writing the code to your GUI which should specify only the minimum number of columns you need and never select *.
The advantage of the method you suggest is that it gives you the option of adding other fields to your TIMESTAMP table, like tracking the user who made the change. You can also track edits to sensitive fields, for example who repriced this contract?
Logging record changes in a separate file means you can show multiple changes to a record, like:
mm/dd/yy hh:mm:ss Added by XXX
mm/dd/yy hh:mm:ss Field PRICE Changed by XXX,
mm/dd/yy hh:mm:ss Record deleted by XXX
One disadvantage is the extra code the will perform inserts into your TIMESTAMPS table to reflect changes in your main tables.
If you set up the time-stamp stuff to run off of triggers, than any action that can set off a trigger (Reads?) can be logged. Also there might be some locking advantages.
(Take all that with a grain of salt, I'm no DBA or SQL guru)
Yes, I like that design, and use it with some systems. Usually, some variant of:
LogID int
Action varchar(1) -- ADDED (A)/UPDATED (U)/DELETED (D)
UserID varchar(20) -- UserID of culprit :)
Timestamp datetime -- Date/Time
TableName varchar(50) -- Table Name or Stored Procedure ran
UniqueID int -- Unique ID of record acted upon
Notes varchar(1000) -- Other notes Stored Procedure or Application may provide
I think the extra joins you will have to perform to get the Timestamps will be a slight performance hit and a pain the neck. Other than that I see no problem.
We did exactly what you did. It is great for the object model and the ability to add new stamps and differant types of stamps to our model with minimal code. We were also tracking the user that made the change, and a lot of our logic was heavily based on these stamps. It woked very well.
One drawback is reporting, and/or showing a lot of differant stamps on on screen. If you are doing it the way we did it, it caused a lot of joins. Also,back ending changes was a pain.
Our solution is to maintain a "Transaction" table, in addition to our "Session" table. UPDATE, INSERT and DELETE instructions are all managed through a "Transaction" object and each of these SQL instruction is stored in the "Transaction" table once it has been successfully executed on the database. This "Transaction" table has other fields such as transactiontType (I for INSERT, D for DELETE, U for UPDATE), transactionDateTime, etc, and a foreign key "sessionId", telling us finally who sent the instruction. It is even possible, through some code, to identify who did what and when (Gus created the record on monday, Tim changed the Unit Price on tuesday, Liz added an extra discount on thursday, etc).
Pros for this solution are:
you're able to tell "what who and when", and to show it to your users! (you'll need some code to analyse SQL statements)
if your data is replicated, and replication fails, you can rebuild your database through this table
Cons are
100 000 data updates per month mean 100 000 records in Tbl_Transaction
Finally, this table tends to be 99% of your database volume
Our choice: all records older than 90 days are automatically deleted every morning
Don't simply delete those older than 90 days, move them first to a separate DB or write them to text file, do something to preserve them, just move them out of the main production DB.
If ever comes down to it, most often it is a case of "he with the most documentation wins"!
