Best approach to views on archive data with change logs

Best approach to views on archive data with change logs - sql-server

(Sorry about the vagueness of the title; I can't think how to really say what I'm looking for without writing a book.)
So in our app, we allow users to change key pieces of data. I'm keeping records of who changed what when in a log schema, but now the problem presents itself: how do I best represent that data in a view for reporting?
An example will help: a customer's data (say, billing address) changed on 4/4/09. Let's say that today, 10/19/09, I want to see all of their 2009 orders, before and after the change. I also want each order to display the billing address that was current as of the date of the order.
So I have 4 tables:
Orders (with order data)
Customers (with current customer data)
CustomerOrders (linking the two)
CustomerChange (which holds the date of the change, who made the change (employee id), what the old billing address was, and what they changed it to)
How do I best structure a view to be used by reporting so that the proper address is returned? Or am I better served by creating a reporting database and denormalizing the data there, which is what the reports group is requesting?

There is no need for a separate DB if this is the only thing you are going to do. You could just create a de-normalized table/cube...and populate and retrieve from it. If your data is voluminous apply proper indexes on this table.

Personally I would design this so you don't need the change table for the report. It is a bad practice to store an order without all the data as of the date of the order stored in a table. You lookup the address from the address table and store it with the order (same for partnumbers and company names and anything that changes over time.) You never get information on an order by joining to customer, address, part numbers, price tables etc.
Audit tables are more for fixing bad changes or looking up who made them than for reporting.

Related

How to deal with similar fields across SQL Server database tables in a dimension

I am working on a data warehouse solution, and I am trying to build a dimensional model from tables held in a SQL Server database. Some of the tables include but aren't limited to Customer, Customer Payments, Customer Address, etc.
All these tables in the DB have some fields that are repeated multiple times across each table i.e. Record update date, record creatuin date, active flag, closed flag and a few others. These tables all relate to the Customer in some way, but the tables can be updated independently.
I am in the process of building out a dimension(s) on the back of these tables, but I am struggling to see how best to deal with these repeated fields in an elegant way, as they are all used.
I'll appreciate any guidance from people who have experience with scenarios like this, as I ammjust starting out
If more details are needed, I am happy to provide
Thanks

Before you even consider how to include them, ask if those metadata fields even need to be in your dimensional model? If no one will use the Customer Payment Update Date (vs Created Date or Payment Date), don't bring it into your model. If the customer model includes the current address, you won't need the CustomerAddress.Active flag included as well. You don't need every OLTP field in your model.
Make notes about how you talk about the fields in conversation. How do you identify the current customer address? Check the CurrentAddress flag (CustomerAddress.IsActive). When was the Customer's payment? Check the Customer Payment Date (CustomerPayment.PaymentDate or possibly CustomerPayment.CreatedDate). Try to describe them in common language terms. This will provide the best success in making your model discoverable by your users and intuitive to use.
Naming the columns in the model and source as similar as possible will also help with maintenance and troubleshooting.
Also, make sure you delineate the entities properly. A customer payment would likely be in a separate dimension from the customer. The current address may be in customer, but if there is any value to historical address details, it may make sense to put it into its own dimension, with the Active flag as well.

Database design for a yearly updated database (once a year)

I have a large database which will only be updated once a year. Every year of data will use the same schema (the data will not be adding any new variables). There's a 'main' table where most of the customer information lives. To keep track of what happens from year to year, is it better design to put a field in the main customer table that says what year it is, or have a 'year' table that relates to the main customer table?

I recommend having a year field in the customer table, that way it is all together. You could even use a timestamp to automatically input the date of user sign up.

To really answer, we'd need to see your schema, but it is almost never the right choice to make a new table for a new year. You probably want to relate years to customers.

Usually you would split off your archive data because you are doing OLTP stuff on your current data, because you want to mostly work on current data, and sometimes look at old stuff. But you have very few updates it seems. I guess the main driver is your queries, and what they 'usually' do, and what performance you need to get out of them. Its probably easier for you to have everything in one table - with a year column. But if most of your queries are for the current year, and they are tight on performance you may want to look at splitting the current data out - either using physical tables, or partitioning of the table (depending on the DB some can do this for you, whilst still being a single table)

Indicating primary/default record in database

I've been struggling with how I should indicate that a certain record in a database is the "fallback" or default entry. I've also been struggling with how to reduce my problem to a simple problem statement. I'm going to have to provide an example.
Suppose that you are building a very simple shipping application. You'll take orders and will need to decide which warehouse to ship them from.
Let's say that you have a few cities that have their own dedicated warehouses*; if an order comes in from one of those cities, you'll ship from that city's warehouse. If an order comes in from any other city, you want to ship from a certain other warehouse. We'll call that certain other warehouse the fallback warehouse.
You might decide on a schema like this:
Warehouses
WarehouseId
Name
WarehouseCities
WarehouseId
CityName
The solution must enforce zero or one fallback warehouses.
You need a way to indicate which warehouse should be used if there aren't any warehouses specified for the city in question. If it really matters, you're doing this on SQL Server 2008.
EDIT: To be clear, all valid cities are NOT present in the WarehouseCities table. It is possible for an order to be received for a City not listed in WarehouseCities. In such a case, we need to be able to select the fallback warehouse.
If any number of default warehouses were allowed, or if I was assigning default warehouses to, say, states, I would use a DefaultWarehouse table. I could use such a table here, but I would need to limit it to exactly one row, which doesn't feel right.
How would you indicate the fallback warehouse?
*Of course, in this example we discount the possibility that there might be multiple cities with the same name. The country you are building this application for rigorously enforces a uniqueness constraint on all city names.

I understand your problem, but have questions about parts of it, so I'll be a bit more general.
If at all possible I would store warehouse/backup warehouse data with your inventory data (either directly hanging of warehouses, or if it's product specific off the inventory tables).
If the setup has to be calculated through your business logic then the records should hang off the order/order_item table
In terms how to implement the structure in SQL, I'll assume that all orders ship out of a single warehouse and that the shipping must be hung off the orders table (but the ideas should be applicable elsewhere):
The older way to enforce zero/one backup warehouses would be to hang a Warehouse_Source record of the Orders table and include an "IsPrimary" field or "ShippingPriority" then include a composite unique index that includes OrderID and IsPrimary/ShippingPriority.
if you will only ever have one backup warehouse you could add ShippingSource_WareHouseID and ShippingSource_Backup_WareHouseID fields to the order. Although, this isn't the route I would go.
In SQL 2008 and up we have the wonderful addition of Filtered Indexes. These allow you to add a WHERE clause to your index -- resulting in a more compact index. It also has the added benefit of allowing you to accomplish some things that could only be done through triggers in the past.
You could put a Unique filtered index on OrderID & IsPrimary/ShippingPriority (WHERE IsPrimary = 0).
Add a comment or such if you want me to explain further.

re: how I should indicate that a certain record in a database is the "fallback" or default entry
use another column, isFallback, holding a binary value. I'm assuming your fallback warehouse won't have any cities associated with it.

As I see it, the fallback warehouse after all is just another warehouse and if I understood, every record in WarehouseCities has a reference to one record in Warehouses:
WarehouseCities(*)...(1)Warehouses
Which means that if there are a hundred cities without a dedicated warehouse they all will reference the id of an specific fallback warehouse. So I don't see any problem (which makes me thing I didn't understand the problem), even the model looks well defined.
Now you could identify if a warehouse is fallback warehouse with an attribute like type_warehouse on Warehouses.
EDIT after comment
Assuming there is only one fallback warehouse for the cities not present in WarehouseCities, I suggest to keep the fallback warehouse as just another warehouse and keep its Id (WarehouseId) as an application parameter (a table for parameters maybe?), of course, this solution is programmatically and not attached to your database platform.

Where should I break up my user records to keep track of revisions

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.

A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.

I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.

#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

Database design query

I'm trying to work out a sensible approach for designing a database where I need to store a wide range of continuously changing information about pets. The categories of data can be broken down into, for example, behaviour, illness etc. Data will be submitted on a regular basis relating to these categories, so i need to find a good way to design the db to efficiently accommodate this. A simple approach would just to store multiple records for each pet within each relevant table - e.g the behaviour table would store the behaviour data and would simply have a timestamp for each record along with the identifier for that pet. When querying the db, it would be straightforward to query the one table with the pet id, using the timestamps to output the correct history of submissions. Is there a more sensible way around this or does that make sense?

I would use a combination of lookup tables with a strong use of foreign keys. I think what you are suggesting is very common. For example, get me all the reported illnesses for a particluar pet during this data range would look something like:
Select *
from table_illness
where table_illness.pet_id = <value>
and date between table_illness.start_date and table_illness.finish_date
You could do that for any of the tables. The lookup tables will be a link between, for example, table_illness.illness_type and illness_types.illness_type. The illness_types table is where you would store the details on the types of illnesses.

When designing a database you should build your tables to mimic real-life objects or concepts. So in that sense the design you suggest makes sense. Each pet should have its own record in a pet table which doesn't change. Changing information should then be placed into the appropriate table which has the pet's id. The time stamp method you suggest is probably what I would do -- unless of course this is for a vet or something. Then I'd create an appointment table with the date and connect the illness or behavior to the appointment as well.