At what scale of database growth does archiving become a necessity, and are there guidelines to show when it is required?
I manage an intranet which provides short news articles via about 40 targeted news groups. I have been asked to remove browsing access to articles older than 2 years, but to maintain access to these by an existing search interface.
One proposal is to hide records by using scheduled overnight tasks to remove out old news items to parallel archive tables. Given that the entire database is only about 5Gb, the entire set of 13000 news articles take up 17Mb, and there are indexes on the publication dates, is this approach advisable or will WHERE clauses based on dates suffise? Is there a rule of thumb here?
The db in question is SQL 2008, we add maybe 2000 news items per year, and there are no reported performance issues at present - this is purely 'future proofing'.
This definitely is a candidate for doing the simplest possible thing, because the data involved is quite manageable. A where clause should be enough. You should have an index on the date column you want to use in the where clause, as that will probably be done in an online fashion.
IDK about the setup you have, but 5GB is enough to load in memory and still have room to spare. So you're well within what the system can handle.
A) Is it possible to have Grandfather-Father-Son archiving? For example we would like to have the following precalculated at all times, and nothing else precalculated:
Daily totals of last week
Weekly totals of previous 5-6 weeks
Monthly totals of all previous months
Note that we don't want daily totals of a day that was 2 months ago for example. We want that daily total to be deleted.
Will indexed views be good enough for this purpose? We want all the fields to be precalculated and stored.
B) We would like to have some features like StackExchanges (and generally Wikis) versioning. Are there ways to archive the older versions somehow on the production environment, and keep the newer versions of stuff more readily available? We have been looking into Partitioning but it doesn't seem to handle such intricate scenario (we don't want ALL posts prior to X date to be partitioned, rather than we need all versions that are older than the newest version)
What are the best practices on these matter?
A: You are describing the storage of time-series data with a varying retention period/graularity, features SQL Server does not provide in any form natively. You can of course build this strategy yourself fairly easily, and luckily you have a few great examples in open source projects for guidance:
RRD Tool - time series database and graphing library.
Graphite - inspired by RRD Tool, built by
B: Its hard to speak in such general terms, but perhaps you can maintain an Version column on your table, where 0 always reflects the latest version. Each time you promote a new version you demote all the other versions by incrementing this value. Having versions increase as they get older allows you to create a filtered index on a deterministic value (Version = 0) making it very performant to get the current version.
You can then purge based on number of versions ago rather than date. Just delete yourTable where Version > 5. A partitioning scheme could work using this method as well.
I am working on a product (ASP.NET Web site) developed for educational institutions. There are around 20 educational inst. in my site. For each of them academic year start and end date varies. There are huge number of records in the database for attendance and results.
Now I need to show all previous years data (like attendance, results etc) whenever a student, teacher logs in. There are some reports which compares student performance in various academic years.
Now my problem is how to maintain that huge data ?
I wanted to go with 2 databases. 1 for current academic year, another for all previous yrs.
But my current year DB schema may change for enhancement. So whenever I move the current year data to archive database then it creates problem for me. Please suggest a good way to implement this.
Thanks,
seshu.
Have you thought about table partitioning? It allows you to rapidly move data through sliding windows - so that at the start of a new year, you slide last year's details into an archive partition. (You need to check the SQL Server edition you have to see whether it is enabled)
MSDN details:
http://msdn.microsoft.com/en-us/library/ms345146(SQL.90).aspx
If you want to keep two databases in sync, schema-wise, there are plenty of tools available for that. Here is mine, here is Red Gate's and here is Apex's. There are many more available, including one which comes with Visual Studio Team System Database edition (if you have that already - if you don't then one of the ones I have previously mentioned will be a lot cheaper).
I have a project involving a web voting system. The current values and related data is stored in several tables. Historical data will be an important aspect of this project so I've also created Audit Tables to which current data will be moved to on a regular basis.
I find this strategy highly inefficient. Even if I only archive data on a daily basis, the number of rows will become huge even if only 1 or 2 users make updates on a given day.
The next alternative I can think of is only storing entries that have changed. This will mean having to build logic to automatically create a view of a given day. This means less stored rows, but considerable complexity.
My final idea is a bit less conventional. Since the historical data will be for reporting purposes, there's no need for web users to have quick access. I'm thinking that my db could have no historical data in it. DB only represents current state. Then, daily, the entire db could be loaded into objects (number of users/data is relatively low) and then serialized to something like XML or JSON. These files could be diffed with the previous day and stored. In fact, SVN could do this for me. When I want the data for a given past day, the system has to retrieve the version for that day and deserialize into objects. This is obviously a costly operation but performance is not so much a concern here. I'm considering using LINQ for this which I think would simplify things. The serialization procedure would have to be pretty organized for the diff to work well.
Which approach would you take?
Thanks
If you're basically wondering how revisions of data are stored in relational databases, then I would look into how wikis do it.
Wikis are all about keeping detailed revision history. They use simple relational databases for storage.
Consider Wikipedia's database schema.
All you've told us about your system is that it involves votes. As long as you store timestamps for when votes were cast you should be able to generate a report describing the vote state tally at any point in time... no?
For example, say I have a system that tallies favorite features (eyes, smile, butt, ...). If I want to know how many votes there were for a particular feature as of a particular date, then I would simply tally all the votes for the feature with a timestamp smaller or equal to that date.
If you want to have a history of other things, then you would follow a similar approach.
I think this is the way it is done.
Have you considered using a real version control system rather than trying to shoehorn a database in its place? I myself am quite partial to git, but there are many options. They all have good support for differences between versions, and they tend to be well optimised for this kind of workload.
I was reading about temporal databases and it seems they have built in time aspects. I wonder why would we need such a model?
How different is it from a normal RDBMS? Can't we have a normal database i.e. RDBMS and say have a trigger which associates a time stamp with each transaction that happens? May be there would be a performance hit. But I'm still skeptical on temporal databases having a strong case in the market.
Does any of the present databases support such a feature?
Consider your appointment/journal diary - it goes from Jan 1st to Dec 31st. Now we can query the diary for appointments/journal entries on any day. This ordering is called the valid time. However, appointments/entries are not usually inserted in order.
Suppose I would like to know what appointments/entries were in my diary on April 4th. That is, all the records that existed in my diary on April 4th. This is the transaction time.
Given that appointments/entries can be created and deleted etc. A typical record has a beginning and end valid time that covers the period of the entry and a beginning and end transaction time that indicates the period during which the entry appeared in the diary.
This arrangement is necessary when the diary may undergo historical revision. Suppose on April 5th I realise that the appointment I had on Feb 14th actually occurred on February 12th i.e. I discover an error in my diary - I can correct the error so that the valid time picture is corrected, but now, my query of what was in the diary on April 4th would be wrong, UNLESS, the transaction times for appointments/entries are also stored. In that case if I query my diary as of April 4th it will show an appointment existed on February 14th but if I query as of April 6th it would show an appointment on February 12th.
This time travel feature of a temporal database makes it possible to record information about how errors are corrected in a database. This is necessary for a true audit picture of data that records when revisions were made and allows queries relating to how data have been revised over
time.
Most business information should be stored in this bitemporal scheme in order to provide a true audit record and to maximise business intelligence - hence the need for support in a relational database. Notice that each data item occupies a (possibly unbounded) square in the two dimensional time model which is why people often use a GIST index to implement bitemporal indexing. The problem here is that a GIST index is really designed for geographic data and the requirements for temporal data are somewhat different.
PostgreSQL 9.0 exclusion constraints should provide new ways of organising temporal data e.g. transaction and valid time PERIODs should not overlap for the same tuple.
A temporal database efficiently stores a time series of data, typically by having some fixed timescale (such as seconds or even milliseconds) and then storing only changes in the measured data. A timestamp in an RDBMS is a discretely stored value for each measurement, which is very inefficient. A temporal database is often used in real-time monitoring applications like SCADA. A well-established system is the PI database from OSISoft (http://www.osisoft.com/).
As I understand it (and over-simplifying enormously), a temporal database records facts about when the data was valid as well as the the data itself, and permits you to query on the temporal aspects. You end up dealing with 'valid time' and 'transaction time' tables, or 'bitemporal tables' involving both 'valid time' and 'transaction time' aspects. You should consider reading either of these two books:
Darwen, Date and Lorentzos "Temporal Data and the Relational Model" (out of print),
and (at a radically different extreme) "Developing Time-Oriented Database Applications in SQL", Richard T. Snodgrass, Morgan Kaufmann Publishers, Inc., San Francisco, July, 1999, 504+xxiii pages, ISBN 1-55860-436-7. That is out of print but available as PDF on his web site at cs.arizona.edu (so a Google search makes it pretty easy to find).
Temporal databases are often used in the financial services industry. One reason is that you are rarely (if ever) allowed to delete any data, so ValidFrom - ValidTo type fields on records are used to provide an indication of when a record was correct.
Besides "what new things can I do with it", it might be useful to consider "what old things does it unify?". The temporal database represents a particular generalization of the "normal" SQL database. As such, it may give you a unified solution to problems that previously appeared unrelated. For example:
Web Concurrency When your database has a web UI that lets multiple users perform standard Create/Update/Delete (CRUD) modifications, you have to face the concurrent web changes problem. Basically, you need to check that an incoming data modification is not affecting any records that have changed since that user last saw those records. But if you have a temporal database, it quite possibly already associates something like a "revision ID" with each record (due to the difficulty of making timestamps unique and monotonically ascending). If so, then that becomes the natural, "already built-in" mechanism for preventing the clobbering of other users' data during database updates.
Legal/Tax Records The legal system (including taxes) places rather more emphasis on historical data than most programmers do. Thus, you will often find advice about schemas for invoices and such that warns you to beware of deleting records or normalizing in a natural way--which can lead to an inability to answer basic legal questions like "Forget their current address, what address did you mail this invoice to in 2001?" With a temporal framework base, all the machinations to those problems (they usually are halfway steps to having a temporal database) go away. You just use the most natural schema, and delete when it make sense, knowing that you can always go back and answer historical questions accurately.
On the other hand, the temporal model itself is half-way to complete revision control, which could inspire further applications. For example, suppose you roll your own temporal facility on top of SQL and allow branching, as in revision control systems. Even limited branching could make it easy to offer "sandboxing" -- the ability to play with and modify the database with abandon without causing any visible changes to other users. That makes it easy to supply highly realistic user training on a complex database.
Simple branching with a simple merge facility could also simplify some common workflow problems. For example, a non-profit might have volunteers or low-paid workers doing data entry. Giving each worker their own branch could make it easy to allow a supervisor to review their work or enhance it (e.g., de-duplification) before merging it into the main branch where it would become visible to "normal" users. Branches could also simplify permissions. If a user is only granted permission to use/see their unique branch, you don't have to worry about preventing every possible unwanted modification; you'll only merge the changes that make sense anyway.
Apart from reading the Wikipedia article? A database that maintains an "audit log" or similar transaction log will have some properties of being "temporal". If you need answers to questions about who did what to whom and when then you've got a good candidate for a temporal database.
You can imagine a simple temporal database that just logs your GPS location every few seconds. The opportunities for compressing this data is great, a normal database you would need to store a timestamp for every row. If you have a great deal of throughput required, knowing the data is temporal and that updates and deletes to a row will never be required permits the program to drop a lot of the complexity inherit in a typical RDBMS.
Despite this, temporal data is usually just stored in a normal RDBMS. PostgreSQL, for example has some temporal extensions, which makes this a little easier.
Two reasons come to mind:
Some are optimized for insert and read only and can offer dramatic perf improvements
Some have better understandings of time than traditional SQL - allowing for grouping operations by second, minute, hour, etc
Just an update, Temporal database is coming to SQL Server 2016.
To clear all your doubts why one need a Temporal Database, rather than configuring with custom methods, and how efficiently & seamlessly SQL Server configures it for you, check the in-depth video and demo on Channel9.msdn here: https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016
MSDN link: https://msdn.microsoft.com/en-us/library/dn935015(v=sql.130).aspx
Currently with the CTP2 (beta 2) release of SQL Server 2016 you can play with it.
Check this video on how to use Temporal Tables in SQL Server 2016.
My understanding of temporal databases is that are geared towards storing certain types of temporal information. You could simulate that with a standard RDBMS, but by using a database that supports it you have built-in idioms for a lot of concepts and the query language might be optimized for these sort of queries.
To me this is a little like working with a GIS-specific database rather than an RDBMS. While you could shove coordinates in a run-of-the-mill RDBMS, having the appropriate representations (e.g., via grid files) may be faster, and having SQL primitives for things like topology is useful.
There are academic databases and some commercial ones. Timecenter has some links.
Another example of where a temporal database is useful is where data changes over time. I spent a few years working for an electricity retailer where we stored meter readings for 30 minute blocks of time. Those meter readings could be revised at any point but we still needed to be able to look back at the history of changes for the readings.
We therefore had the latest reading (our 'current understanding' of the consumption for the 30 minutes) but could look back at our historic understanding of the consumption. When you've got data that can be adjusted in such a way temporal databases work well.
(Having said that, we hand carved it in SQL, but it was a fair while ago. Wouldn't make that decision these days.)