database archiving vs timeperiod based tables/fields - database

I am working on an employee objectives web application.
Lead/Manager sets objectives for team members after discussing with them. This is an yearly/half-yearly/quarterly depending on appraisal cycle the organization follows.
Now question is is better approach to add time period based fields or archive previous quarter's/year's data. When a user want to see previous objectives (not so frequent activity), the archive that belongs to that date may be restored in some temp table and shown to employee.
Points to start with
archiving: reduces db size, results in simpler db queries, adds an overhead when someone tried to see old data.
time-period based field/tables: one or more extra joins in queries, previous data is treated similar to current data so no overhead in retrieving old data.
PS: it is not space cost, my point is if we can achieve some optimization in terms of performance, as this is a web app and at peak times all the employees in an organization will be looking/updating it. so removing time period makes my queries a lot simpler.
Thanks

Assuming you're talking about data that changes over time, as opposed to logging-type data, then my preferred approach is to keep only the "latest" version of the data in your primary table(s), and to automatically copy the previous version of the data into a archive table. This archive table would mirror the primary, with the addition of versioned fields, such as timestamps. This archiving can be done with a trigger.
The main benefit that I see with this approach is that it doesn't compromise your database design. In particular, you don't have to worry about using composite keys that incorporate the version fields (in fact using time-based fields as keys may not even be permitted by your database).
If you need to go and look at the old data, you can run a select against the archive table and add version constraints to the query.

I would start off adding your time period fields and waiting until size becomes an issue. The kind of data you are describing does not sound like it is going to consume a lot of storage space.
Should it grow uncontrollably you can always look at the archive approach later - but the coding is going to take much longer than simply storing the relevant period with your data.

It seems to me that if you have the requirement that a user can look arbitrarily far back in the past, then you really must keep the data accessible.
This just won't be sustainable:
the archive that belongs to that date may be restored in some temp table and shown to employee.
My recommendation would be to periodically (read when absolutely necessary) move 'very old' data to another table for this purpose. Disk space is extremely cheap at this point, so keeping that data around is not nearly as expensive as implementing the system that can go back to an arbitrary time and restore an archive.

Related

Keeping history of data revisions - best practice?

Consider a database with several (3-4) tables with a lot of columns (from 15 to 40). In each table we have several thousand records generated per year and about a dozen of changes made for each record.
Right now we need to add a following functionality to our system: every time user makes a change to the record of one of our tables, the system needs to keep track of it - we need to have complete history of changes and also be able to restore row data to selected point.
For some reasons we cannot keep "final" and "historic" data in the same table (so we cannot add some columns to our tables to keep some kind of versioning information, i.e. like wordpress does when it comes to keeping edit history of posts).
What would be best approach to this problem? I was thinking about two solutions:
For each tracked table we have a mirror table with the same columns, and with additional columns where we keep information about versions (i.e. timestamps, id of "original" row etc...)
Pros:
we have data stored exactly in the same way it was in original tables
whenever we need to add a new column to the original table, we can do the same to mirror table
Cons:
we need to create one additional mirror table for each tracked table.
We create one table for "history" revisions. We keep some revisioning information like timestamps etc., and also we keep the track from which table the data originates. But the original data row is being stored in large text column in JSON.
Pros:
we have only one history table for all tracked tables
we don't need to create new mirror tables every time we add new tracked table,
Cons:
there can be some backward compatibility issues while trying to restore data after structure of the original table was changed (i.e. new column was added)
Maybe some other solution?
What would be the best way of keeping the history of versions in such system?
Additional information:
each of the tracked tables can change in the future (i.e. new columns added),
number of tracked tables can change in the future (i.e. new tables added).
FYI: we are using laravel 5.3 and mysql database.
How often do you need access to the auditing data? Is cost of storage ever a concern? Do you need it in the same system that you need the normal data?
Basically, having a table called foo and a second table called foo_log isn't uncommon. It also lets you store foo_log somewhere differently, even possibly a secondary DB. If foo_log is on a spindle disk and foo is on flash, you still get fast reads, but you get somewhat cheaper storage of the backups.
If you don't ever need to display this data, and just need it for legal reasons, or to figure out how something went wrong, the single-table isn't a terrible plan.
But if the issue is backups, which it sounds like it might be, why not just backup the MySQL database on a regular basis and store the backups elsewhere?

What type of fact table / loading solution for a reservation system?

Background
I am designing a Data Warehouse with SQL Server 2012 and SSIS. The source system handles hotel reservations. The reservations are split between two tables, header and header line item. The Fact table would be at the line item level with some data from the header.
The issue
The challenge I have is that the reservation (and its line items) can change over time.
An example would be:
The booking is created.
A room is added to the booking (as a header line item).
The customer arrives and adds food/drinks to their reservation (more line items).
A payment is added to the reservation (as a line item).
A room could be subsequently cancelled and removed from the booking (a line item is deleted).
The number of people in a room can change, affecting that line item.
The booking status changes from "Provisional" to "Confirmed" at a point in its life cycle.
Those last three points are key, I'm not sure how I would keep that line updated without looking up the record and updating it. The business would like to keep track of the updates and deletions.
I'm resisting updating because:
From what I've read about Fact tables, its not good practice to revisit rows once they've been written into the table.
I could do this with a look-up component but with upward of 45 million rows, is that the best approach?
The questions
What type of Fact table or loading solution should I go for?
Should I be updating the records, if so how can I best do that?
I'm open to any suggestions!
Additional Questions (following answer from ElectricLlama):
The fact does have a 1:1 relationship with the source. You talk about possible constraints on future development. Would you be able to elaborate on the type of constraints I would face?
Each line item will have a modified (and created date). Are you saying that I should delete all records from the fact table which have been modified since the last import and add them again (sounds logical)?
If the answer to 2 is "yes" then for auditing purposes would I write the current fact records to a separate table before deleting them?
In point one, you mention deleting/inserting the last x days bookings based on reservation date. I can understand inserting new bookings. I'm just trying to understand why I would delete?
If you effectively have a 1:1 between the source line and the fact, and you store a source system booking code in the fact (no dimensional modelling rules against that) then I suggest you have a two step load process.
delete/insert the last x days bookings based on reservation date (or whatever you consider to be the primary fact date),
delete/insert based on all source booking codes that have changed (you will of course have to know this beforehand)
You just need to consider what constraints this puts on future development, i.e. when you get additional source systems to add, you'll need to maintain the 1:1 fact to source line relationship to keep your load process consistent.
I've never updated a fact record in a dataload process, but always delete/insert a certain data domain (i.e. that domain might be trailing 20 days or source system booking code). This is effectively the same as an update but also takes cares of deletes.
With regards to auditing changes in the source, I suggest you write that to a different table altogether, not the main fact, as it's purpose will be audit, not analysis.
The requirement to identify changed records in the source (for data loads and auditing) implies you will need to create triggers and log tables in the source or enable native SQL Server CDC if possible.
At all costs avoid using the SSIS lookup component as it is totally ineffective and would certainly be unable to operate on 45 million rows.
Stick with the 'insert/delete a data portion' approach as it lends itself to SSIS ability to insert/delete (and its inability to efficiently update or lookup)
In answer to the follow up questions:
1:1 relationship in fact
What I'm getting at is you have no visibility on any future systems that need to be integrated, or any visibility on what future upgrades to your existing source system might do. This 1:1 mapping introduces a design constraint (its not really a constraint, more a framework). Thinking about it, any new system does not need to follow this particular load design, as long as it's data arrive in the fact consistently. I think implementing this 1:1 design is a good idea, just trying to consider any downside.
If your source has a reliable modified date then you're in luck as you can do a differential load - only load changed records. I suggest you:
Load all recently modified records (last 5 days?) into a staging table
Do a DELETE/INSERT based on the record key. Do the delete inside SSIS in an execute SQL task, don't mess about with feeding data flows into row-by-row delete statements.
Audit table:
The simplest and most accurate way to do this is simply implement triggers and logs in the source system and keep it totally separate to your star schema.
If you do want this captured as part of your load process, I suggest you do a comparison between your staging table and the existing audit table and only write new audit rows, i.e. reservation X last modified date in the audit table is 2 Apr, but reservation X last modified date in the staging table is 4 Apr, so write this change as a new record to the audit table. Note that if you do a daily load, any changes in between won't be recorded, that's why I suggest triggers and logs in the source.
DELETE/INSERT records in Fact
This is more about ensuring you have an overlapping window in your load process, so that if the process fails for a couple of days (as they always do), you have some contingency there, and it will seamlessly pick the process back up once it's working again. This is not so important in your case as you have a modified date to identify differential changes, but normally for example I would pick a transaction date and delete, say 7 trailing days. This means that my load process can be borken for 6 days, and if I fix it by the seventh day everything will reload properly without needing extra intervention to load the intermediate days.
I would suggest having a deleted flag and update that instead of deleting. Your performance will also be better.
This will enable you to perform an analysis on how the reservations are changing over a period of time. You will need to ensure that this flag is used in all the analysis to ensure that there is no confusion.

Am i right to sacrifice database design fundamentals in this case for the sake of speed?

I work in a company that uses single table Access database for its outbound cms, which I moved to a SQL server based system. There's a data list table (not normalized) and a calls table. This has about one update per second currently. All call outcomes along with date, time, and agent id are stored in the calls table. Agents have a predefined set of records that they will call each day (this comprises records from various data lists sorted to give an even spread throughout their set). Note a data list record is called once per day.
In order to ensure speed, live updates to this system are stored in a duplicate of the calls table fields in the data list table. These are then copied to the calls table in a batch process at the end of the day.
The reason for this is not obviously the speed at which a new record could be added to the calls table live, but when the user app is closed/opened and loads the user's data set again I need to check which records have not been called today - I would need to run a stored proc on the server that picked the last most call from the calls table and check if its calldate didn't match today's date. I believe a more expensive query than checking if a field in the data list table is NULL.
With this setup I only run the expensive query at the end of each day.
There are many pitfalls in this design, the main limitation is my inexperience. This is my first SQL server system. It's pretty critical, and I had to ensure it would work and I could easily dump data back to access db during a live failure. It has worked for 11 months now (and no live failure, less downtime than the old system).
I have created pretty well normalized databases for other things (with far fewer users), but I'm hesitant to implement this for the calling database.
Specifically, I would like to know your thoughts on whether the duplication of the calls fields in the data list table is necessary in my current setup or whether I should be able to use the calls table. Please try and answer this from my perspective. I know you DBAs may be cringing!
Redesigning an already working Database may become the major flaw here. Rather try to optimize what you have got running currently instead if starting from scratch. Think of indices, referential integrity, key assigning methods, proper usage of joins and the like.
In fact, have a look here:
Database development mistakes made by application developers
This outlines some very useful pointers.
The thing the "Normalisation Nazis" out there forget is that database design typically has two stages, the "Logical Design" and the "Physical Design". The logical design is for normalisation, and the physical design is for "now lets get the thing working", considering among other things the benefits of normalisation vs. the benefits of breaking nomalisation.
The classic example is an Order table and an Order-Detail table and the Order header table has "total price" where that value was derived from the Order-Detail and related tables. Having total price on Order in this case still make sense, but it breaks normalisation.
A normalised database is meant to give your database high maintainability and flexibility. But optimising for performance is one of the considerations that physical design considers. Look at reporting databases for example. And don't get me started about storing time-series data.
Ask yourself, has my maintainability or flexibility been significantly hindered by this decision? Does it cause me lots of code changes or data redesign when I change something? If not, and you're happy that your design is working as required, then I wouldn't worry.
I think whether to normalize it depends on how much you can do, and what may be needed.
For example, as Ian mentioned, it has been working for so long, is there some features they want to add that will impact the database schema?
If not, then just leave it as it is, but, if you need to add new features that change the database, you may want to see about normalizing it at that point.
You wouldn't need to call a stored procedure, you should be able to use a select statement to get the max(id) by the user id, or the max(id) in the table, depending on what you need to do.
Before deciding to normalize, or to make any major architectural changes, first look at why you are doing it. If you are doing it just because you think it needs to be done, then stop, and see if there is anything else you can do, perhaps add unit tests, so you can get some times for how long operations take. Numbers are good before making major changes, to see if there is any real benefit.
I would ask you to be a little more clear about the specific dilemma you face. If your system has worked so well for 11 months, what makes you think it needs any change?
I'm not sure you are aware of the fact that "Database design fundamentals" might relate to "logical database design fundamentals" as well as "physical database design fundamentals", nor whether you are aware of the difference.
Logical database design fundamentals should not (and actually cannot) be "sacrificed" for speed precisely because speed is only determined by physical design choices, the prime desision factor in which is precisely speed and performance.

How do I model data that slowly changes over time?

Let's say I'm getting a large (2 million rows?) amount of data that's supposed to be static and unchanging. Supposed to be. And this data gets republished monthly. What methods are available to 1) be aware of what data points have changed from month to month and 2) consume the data given a point in time?
Solution 1) Naively save every snapshot of data, annotated by date. Diff awareness is handled by some in-house program, but consumption of the data by date is trivial. Cons, space requirements balloon by an order of magnitude.
Solution 2A) Using an in-house program, track when the diffs happen and store them in an EAV table, annotated by date. Space requirements are low, but consumption integrated with the original data becomes unwieldly.
Solution 2B) Using an in-house program, track when the diffs happen and store them in a sparsely filled table that looks much like the original table, filled only with the data that's changed and the date when changed. Cons, model is sparse and consumption integrated with the original data is non-trivial.
I guess, basically, how do I integrate the dimension of time into a relational database, keeping in mind both the viewing of the data and awareness of differences between time periods?
Does this relate to data warehousing at all?
Smells like... Slowly changing dimension?
I had a similar problem - big flat files imported to the database once per day. Most of the data is unchanging.
Add two extra columns to the table, starting_date and ending_date. The default value for ending_date should be sometime in the future.
To compare one file to the next, sort them both by the key columns, then read one row from each file.
If the keys are equal: compare the rest of the columns to see if the data has changed. If the row data is equal, the row is already in the database and there's nothing to do; if it's different, update the existing row in the database with an ending_date of today and insert a new row with a starting_date of today. Read a new row from both files.
If the key from the old file is smaller: the row was deleted. Update ending_date to today. Read a new row from the old file.
If the key from the new file is smaller: a row was inserted. Insert the row into the database with a starting_date of today. Read a new row from the new file.
Repeat until you've read everything from both files.
Now to query for the rows that were valid at any date, just select with a where clause test_date between start_date and end_date.
You could also take a leaf from the datawarehousing book. There are basically three ways of of dealing with changing data.
Have a look at this wikipedia article for SCD's but it is in essence tables:
http://en.wikipedia.org/wiki/Slowly_changing_dimension
A lot of this depends on how you're storing the data. There are two factors to consider:
How oftne does the data change?
How much does the data change?
The distinction is important. If it changes often but not much then annotated snapshots are going to be extremely inefficient. If it changes infrequently but a lot then they're a better solution.
It also depends on if you need to see what the data looked like at a specific point in time.
If you're using Oracle, for example, you can use flashback queries to see a consistent view of the data at some arbitrary point.
Personally I think you're better off storing it incrementally and, at a minimum, using some form of auditing to track changes so you can recover an historic snapshot if it's ever required. But like I said, this depends on many factors.
If it was me, I'd save the whole thing every month (not necessarily in a database, but as a data file or text file off-line) - you will be glad you did. Even at a row size of 4096 bytes (wild ass guess), you are only talking about 8G of disk per month. You can save a LOT of months on a 300G drive. I did something similar for years, when I was getting over 1G per day in downloads to a datawarehouse.
This sounds to me rather like the problem faced by source code version control systems. These store patches which are used to create the changes as they occur. So if a file does not change, or only a few lines change, the patch that needs to be stored is relatively very small. The system also stores which version each patch contributes to. So, when viewing a particular version of a particular file, the initial version is recovered and all the patches, up to the version requested are applied.
In your, very general, situation, you need to divide up your data into chunks. Hopefully there are natural divisions you can use, but if this division has to be arbitrary that's should be OK. Whenever a change occurs, store the patch for the affected chunk and record a new version. Now, when you want to view a particular date, find the last version that predates the view date, apply the patches for the chunk that has been requested, and display.
Could you do the following:
1. Each month BCP all data into a temporary table
2. Run a script or stored procedure to update the primary table
(which has an additional DateTime column as part of a composite key),
with any changes made.
3. Repeat each month.
This should give you a table, which you can query data for at a particular date.
In addition each change will be logged, and the size of the table shouldn't change dramatically over time.
However, as a backup to this, I would store each data file as Brennan suggests.

How to gain performance when maintaining historical and current data?

I want to maintain last ten years of stock market data in a single table. Certain analysis need only data of the last one month data. When I do this short term analysis it takes a long time to complete the operation.
To overcome this I created another table to hold current year data alone. When I perform the analysis from this table it 20 times faster than the previous one.
Now my question is:
Is this the right way to have a separate table for this kind of problem. (Or we use separate database instead of table)
If I have separate table Is there any way to update the secondary table automatically.
Or we can use anything like dematerialized view or something like that to gain performance.
Note: I'm using Postgresql database.
You want table partitioning. This will automatically split the data between multiple tables, and will in general work much better than doing it by hand.
I'm working on near the exact same issue.
Table partitioning is definitely the way to go here. I would segment by more than year though, it would give you a greater degree of control. Just set up your partitions and then constrain them by months (or some other date). In your postgresql.conf you'll need to turn constraint_exclusion=on to really get the benefit. The additional benefit here is that you can only index the exact tables you really want to pull information from. If you're batch importing large amounts of data into this table, you may get slightly better results a Rule vs a Trigger and for partitioning, I find rules easier to maintain. But for smaller transactions, triggers are much faster. The postgresql manual has a great section on partitioning via inheritance.
I'm not sure about PostgreSQL, but I can confirm that you are on the right track. When dealing with large data volumes partitioning data into multiple tables and then using some kind of query generator to build your queries is absolutely the right way to go. This approach is well established in Data Warehousing, and specifically in your case stock market data.
However, I'm curious why do you need to update your historical data? If you're dealing with stock splits, it's common to implement that using a seperate multiplier table that is used in conjunction with the raw historical data to give an accurate price/share.
it is perfectly sensible to use separate table for historical records. It's much more problematic with separate database, as it's not simple to write cross-database queries
automatic updates - it's a tool for cronjob
you can use partial indexes for such things - they do wonderful job
Frankly, you should check your execution plans and try fixing your queries or indexing before taking more radical steps.
Indexing comes at very little cost (unless you do a lot of insertions) and your existing code will be faster (if you index properly) without modifying it.
Other measures such as partioning come after that...

Resources