In what cases are bitemporal tables actually used? - database

I am trying to collect information about temporal databases. I know it is not a modern technology, but I saw that many people who work with databases don't ever know how temporal approach works (I asked some senior programmers and system analysts about temporal databases and they answered something like "Huh?").
I know there are valid-time state tables and transaction-time state tables, along with bitemporal tables. I think that bitemporal tables are way too complex for most usages, because nowadays space is not a problem anymore, and it is more efficient to write the same information on 2 different tables, even if data is redundant. However, I made many searches online trying to see where bitemporal tables are actually used, but I didn't find anything useful.
Are there cases when use of a bitemporal table is really convenient than valid-time and transaction-time state tables separately? Are there real-world examples?

Of course! Take for example, balance sheet data. You will find that this information will change from WD1 (Working Day) to WD x due to late arriving data, adjustments, manual errors and suchlike.
In order to enable repeatable reporting, audit trail and temporal comparisons, a record must be kept of 'old' (invalid?) results. Bitemporal is a great way to manage such updates, especially on an intraday basis. I don't think it's that complicated from a user perspective - just another filter on the where clause.
I admit that the loading process is complicated, but it's not that bad.. I literally just finished writing a generic transform (in SAS, coping with all scenarios for a unique business key) and it took a single day.
Coming back to use cases.. Having both valid (business) time and transaction (version) time on the same table enables:
Repeatable results (having separate tables and corresponding updates could mean getting different results for the same query on two different days)
Comparable results (can answer questions such as "what was the value of X, as we knew it at time Y?")
Rapid results (only dealing with a single table, updated in a transparent and easy-to-query way).
In this sense it is an appropriate structure to use on many, if not all tables in a DWh.
UPDATE 2020: A bitemporal data transform for SAS (both SAS 9 and Viya) is available with Data Controller for SAS. A demo version is available: https://docs.datacontroller.io/dcc-tables/#var_busfrom-var_busto

I think your question raises more issues but it all comes down to how much is enough.
I developed a Bi_Temporal SQL Server engine that supports object versioning and relationship by time as well as all the other beautiful parts of Temporal DB's.
This was because the project needed to be able to be rewound to a place in time and see everything as it was at that time.
I mean everything including data, relationships and User access.
It was the most complex thing I have built but in the end it was so complex no-one else could maintain it, or understand what was happening.
So there was a real world use case and a deliverable.
Is not everyones cup of tea as you have to be able to think in time dimension as well as object version changes as all db's do.
Hope this helps someone. I know the post is old but as it was the first I found when searching Temporal DB's it might be of interest to someone.

Related

Database with "Open Schema" - Good or Bad Idea?

The co-founder of Reddit gave a presentation on issues they had while scaling to millions of users. A summary is available here.
What surprised me is point 3:
Instead, they keep a Thing Table and a Data Table. Everything in Reddit is a Thing: users, links, comments, subreddits, awards, etc. Things keep common attribute like up/down votes, a type, and creation date. The Data table has three columns: thing id, key, value. There’s a row for every attribute. There’s a row for title, url, author, spam votes, etc. When they add new features they didn’t have to worry about the database anymore. They didn’t have to add new tables for new things or worry about upgrades.
This seems like a terrible idea to me, but it seems to have worked out for Reddit. Is it a good idea in general, though? Or is it a peculiarity of Reddit that happened to work out for them?
This is a data model known as EAV for entity-attribute-value. It has its uses. A prime example is patient test data which is naturally sparse since there are hundreds of thousands of tests which might be run, but typically only a handful are present for a patient. A table with hundreds of thousands of columns is silly, but a table with EAV makes good sense.
Most of the really big web sites end up using some sort of incredibly simple on the database side of things. This has the advantage that it's fast and scalable. It has the disadvantage that all the relationships that you'd get the database to enforce automatically (via triggers and such) you need to enforce yourself in your client code instead. Maintaining consistency is a pain in the neck, and there's almost always at least some chance that your data will be inconsistent, at least for short periods of time.
For a social networking site, it's a worthwhile compromise. Data that's mostly right most of the time is adequate (e.g., who really cares if the number of up-votes you receive for an item is really 20 milliseconds out of date when it's sent), and keeping costs reasonable while scaling to support a gazillion users matters a lot.
I noted that they did not mention anything about the ease or difficulty in creating reports against that data. When used in a narrow set of circumstances, EAVs can be beneficial. As a central part of most systems it will become a nightmare when you hit reporting. The problem with EAVs is that most of the benefit is at the outset of the project and most of the pain is later in the analysis and reporting especially due to the severe lack of data integrity. "Not having to worry about foreign keys" to me sounds like a nightmare of orphaned rows. Add in the use of surrogate keys for everything and you have a tangled morass which generally ends in a complete rewrite
We worked in a similar problem not long ago, i could say at first it wasn't easy and fun, but after some point that you get used to it, it has it's own benefit, it's like developing another Database withing your tables, in some area it's an overkill task, but when you pass these levels, it provide you with lots of functionality, basically after one point, we didn't create any new table, and we just created dynamic forms for everything's, even for our own programming tasks.
as for performance, system didn't get million of rows to be fair comparison, but for daily usage i never noticed any differences.
some problems i want to share.
we didn't delete any rows we just hide them and setting a flag, and a nightly (weekly) service clean the physical rows
orphan rows, we basically didn't care about cleaning childs, we just set "IsDeleted" On father, and nightly service would clean every rows that is orphan or not needed anymore.
3.you should keep your indexes up to date, but you should skip of building them whenever possible (again nightly service keep index up to date)
we prepared our reporting data ahead of time (AOT) which means we were behind of actual data here :))
we tried hard no to jump in ocean of rows to calc some values by user demand. if we prepared it you can use it, if not then you cant
at the end, there are so many unique challenges to this approach that you should find a way to solve, problems you never faced before in your routine job, but after all of these you earn more flexibility that you can spend.

What should every developer know about databases? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Whether we like it or not, many if not most of us developers either regularly work with databases or may have to work with one someday. And considering the amount of misuse and abuse in the wild, and the volume of database-related questions that come up every day, it's fair to say that there are certain concepts that developers should know - even if they don't design or work with databases today.
What is one important concept that developers and other software professionals ought to know about databases?
The very first thing developers should know about databases is this: what are databases for? Not how do they work, nor how do you build one, nor even how do you write code to retrieve or update the data in a database. But what are they for?
Unfortunately, the answer to this one is a moving target. In the heydey of databases, the 1970s through the early 1990s, databases were for the sharing of data. If you were using a database, and you weren't sharing data you were either involved in an academic project or you were wasting resources, including yourself. Setting up a database and taming a DBMS were such monumental tasks that the payback, in terms of data exploited multiple times, had to be huge to match the investment.
Over the last 15 years, databases have come to be used for storing the persistent data associated with just one application. Building a database for MySQL, or Access, or SQL Server has become so routine that databases have become almost a routine part of an ordinary application. Sometimes, that initial limited mission gets pushed upward by mission creep, as the real value of the data becomes apparent. Unfortunately, databases that were designed with a single purpose in mind often fail dramatically when they begin to be pushed into a role that's enterprise wide and mission critical.
The second thing developers need to learn about databases is the whole data centric view of the world. The data centric world view is more different from the process centric world view than anything most developers have ever learned. Compared to this gap, the gap between structured programming and object oriented programming is relatively small.
The third thing developers need to learn, at least in an overview, is data modeling, including conceptual data modeling, logical data modeling, and physical data modeling.
Conceptual data modeling is really requirements analysis from a data centric point of view.
Logical data modeling is generally the application of a specific data model to the requirements discovered in conceptual data modeling. The relational model is used far more than any other specific model, and developers need to learn the relational model for sure. Designing a powerful and relevant relational model for a nontrivial requirement is not a trivial task. You can't build good SQL tables if you misunderstand the relational model.
Physical data modeling is generally DBMS specific, and doesn't need to be learned in much detail, unless the developer is also the database builder or the DBA. What developers do need to understand is the extent to which physical database design can be separated from logical database design, and the extent to which producing a high speed database can be accomplished just by tweaking the physical design.
The next thing developers need to learn is that while speed (performance) is important, other measures of design goodness are even more important, such as the ability to revise and extend the scope of the database down the road, or simplicity of programming.
Finally, anybody who messes with databases needs to understand that the value of data often outlasts the system that captured it.
Whew!
Good question. The following are some thoughts in no particular order:
Normalization, to at least the second normal form, is essential.
Referential integrity is also essential, with proper cascading delete and update considerations.
Good and proper use of check constraints. Let the database do as much work as possible.
Don't scatter business logic in both the database and middle tier code. Pick one or the other, preferably in middle tier code.
Decide on a consistent approach for primary keys and clustered keys.
Don't over index. Choose your indexes wisely.
Consistent table and column naming. Pick a standard and stick to it.
Limit the number of columns in the database that will accept null values.
Don't get carried away with triggers. They have their use but can complicate things in a hurry.
Be careful with UDFs. They are great but can cause performance problems when you're not aware how often they might get called in a query.
Get Celko's book on database design. The man is arrogant but knows his stuff.
First, developers need to understand that there is something to know about databases. They're not just magic devices where you put in the SQL and get out result sets, but rather very complicated pieces of software with their own logic and quirks.
Second, that there are different database setups for different purposes. You do not want a developer making historical reports off an on-line transactional database if there's a data warehouse available.
Third, developers need to understand basic SQL, including joins.
Past this, it depends on how closely the developers are involved. I've worked in jobs where I was developer and de facto DBA, where the DBAs were just down the aisle, and where the DBAs are off in their own area. (I dislike the third.) Assuming the developers are involved in database design:
They need to understand basic normalization, at least the first three normal forms. Anything beyond that, get a DBA. For those with any experience with US courtrooms (and random television shows count here), there's the mnemonic "Depend on the key, the whole key, and nothing but the key, so help you Codd."
They need to have a clue about indexes, by which I mean they should have some idea what indexes they need and how they're likely to affect performance. This means not having useless indices, but not being afraid to add them to assist queries. Anything further (like the balance) should be left for the DBA.
They need to understand the need for data integrity, and be able to point to where they're verifying the data and what they're doing if they find problems. This doesn't have to be in the database (where it will be difficult to issue a meaningful error message for the user), but has to be somewhere.
They should have the basic knowledge of how to get a plan, and how to read it in general (at least enough to tell whether the algorithms are efficient or not).
They should know vaguely what a trigger is, what a view is, and that it's possible to partition pieces of databases. They don't need any sort of details, but they need to know to ask the DBA about these things.
They should of course know not to meddle with production data, or production code, or anything like that, and they should know that all source code goes into a VCS.
I've doubtless forgotten something, but the average developer need not be a DBA, provided there is a real DBA at hand.
Basic Indexing
I'm always shocked to see a table or an entire database with no indexes, or arbitrary/useless indexes. Even if you're not designing the database and just have to write some queries, it's still vital to understand, at a minimum:
What's indexed in your database and what's not:
The difference between types of scans, how they're chosen, and how the way you write a query can influence that choice;
The concept of coverage (why you shouldn't just write SELECT *);
The difference between a clustered and non-clustered index;
Why more/bigger indexes are not necessarily better;
Why you should try to avoid wrapping filter columns in functions.
Designers should also be aware of common index anti-patterns, for example:
The Access anti-pattern (indexing every column, one by one)
The Catch-All anti-pattern (one massive index on all or most columns, apparently created under the mistaken impression that it would speed up every conceivable query involving any of those columns).
The quality of a database's indexing - and whether or not you take advantage of it with the queries you write - accounts for by far the most significant chunk of performance. 9 out of 10 questions posted on SO and other forums complaining about poor performance invariably turn out to be due to poor indexing or a non-sargable expression.
Normalization
It always depresses me to see somebody struggling to write an excessively complicated query that would have been completely straightforward with a normalized design ("Show me total sales per region.").
If you understand this at the outset and design accordingly, you'll save yourself a lot of pain later. It's easy to denormalize for performance after you've normalized; it's not so easy to normalize a database that wasn't designed that way from the start.
At the very least, you should know what 3NF is and how to get there. With most transactional databases, this is a very good balance between making queries easy to write and maintaining good performance.
How Indexes Work
It's probably not the most important, but for sure the most underestimated topic.
The problem with indexing is that SQL tutorials usually don't mention them at all and that all the toy examples work without any index.
Even more experienced developers can write fairly good (and complex) SQL without knowing more about indexes than "An index makes the query fast".
That's because SQL databases do a very good job working as black-box:
Tell me what you need (gimme SQL), I'll take care of it.
And that works perfectly to retrieve the correct results. The author of the SQL doesn't need to know what the system is doing behind the scenes--until everything becomes sooo slooooow.....
That's when indexing becomes a topic. But that's usually very late and somebody (some company?) is already suffering from a real problem.
That's why I believe indexing is the No. 1 topic not to forget when working with databases. Unfortunately, it is very easy to forget it.
Disclaimer
The arguments are borrowed from the preface of my free eBook "Use The Index, Luke". I am spending quite a lot of my time explaining how indexes work and how to use them properly.
I just want to point out an observation - that is that it seems that the majority of responses assume database is interchangeable with relational databases. There are also object databases, flat file databases. It is important to asses the needs of the of the software project at hand. From a programmer perspective the database decision can be delayed until later. Data modeling on the other hand can be achieved early on and lead to much success.
I think data modeling is a key component and is a relatively old concept yet it is one that has been forgotten by many in the software industry. Data modeling, especially conceptual modeling, can reveal the functional behavior of a system and can be relied on as a road map for development.
On the other hand, the type of database required can be determined based on many different factors to include environment, user volume, and available local hardware such as harddrive space.
Avoiding SQL injection and how to secure your database
Every developer should know that this is false: "Profiling a database operation is completely different from profiling code."
There is a clear Big-O in the traditional sense. When you do an EXPLAIN PLAN (or the equivalent) you're seeing the algorithm. Some algorithms involve nested loops and are O( n ^ 2 ). Other algorithms involve B-tree lookups and are O( n log n ).
This is very, very serious. It's central to understanding why indexes matter. It's central to understanding the speed-normalization-denormalization tradeoffs. It's central to understanding why a data warehouse uses a star-schema which is not normalized for transactional updates.
If you're unclear on the algorithm being used do the following. Stop. Explain the Query Execution plan. Adjust indexes accordingly.
Also, the corollary: More Indexes are Not Better.
Sometimes an index focused on one operation will slow other operations down. Depending on the ratio of the two operations, adding an index may have good effects, no overall impact, or be detrimental to overall performance.
I think every developer should understand that databases require a different paradigm.
When writing a query to get at your data, a set-based approach is needed. Many people with an interative background struggle with this. And yet, when they embrace it, they can achieve far better results, even though the solution may not be the one that first presented itself in their iterative-focussed minds.
Excellent question. Let's see, first no one should consider querying a datbase who does not thoroughly understand joins. That's like driving a car without knowing where the steering wheel and brakes are. You also need to know datatypes and how to choose the best one.
Another thing that developers should understand is that there are three things you should have in mind when designing a database:
Data integrity - if the data can't be relied on you essentially have no data - this means do not put required logic in the application as many other sources may touch the database. Constraints, foreign keys and sometimes triggers are necessary to data integrity. Don't fail to use them because you don't like them or don't want to be bothered to understand them.
Performance - it is very hard to refactor a poorly performing database and performance should be considered from the start. There are many ways to do the same query and some are known to be faster almost always, it is short-sighted not to learn and use these ways. Read some books on performance tuning before designing queries or database structures.
Security - this data is the life-blood of your company, it also frequently contains personal information that can be stolen. Learn to protect your data from SQL injection attacks and fraud and identity theft.
When querying a database, it is easy to get the wrong answer. Make sure you understand your data model thoroughly. Remember often actual decisions are made based on the data your query returns. When it is wrong, the wrong business decisions are made. You can kill a company from bad queries or loose a big customer. Data has meaning, developers often seem to forget that.
Data almost never goes away, think in terms of storing data over time instead of just how to get it in today. That database that worked fine when it had a hundred thousand records, may not be so nice in ten years. Applications rarely last as long as data. This is one reason why designing for performance is critical.
Your database will probaly need fields that the application doesn't need to see. Things like GUIDs for replication, date inserted fields. etc. You also may need to store history of changes and who made them when and be able to restore bad changes from this storehouse. Think about how you intend to do this before you come ask a web site how to fix the problem where you forgot to put a where clause on an update and updated the whole table.
Never develop in a newer version of a database than the production version. Never, never, never develop directly against a production database.
If you don't have a database administrator, make sure someone is making backups and knows how to restore them and has tested restoring them.
Database code is code, there is no excuse for not keeping it in source control just like the rest of your code.
Evolutionary Database Design. http://martinfowler.com/articles/evodb.html
These agile methodologies make database change process manageable, predictable and testable.
Developers should know, what it takes to refactor a production database in terms of version control, continious integration and automated testing.
Evolutionary Database Design process has administrative aspects, for example a column is to be dropped after some life time period in all databases of this codebase.
At least know, that Database Refactoring concept and methodologies exist.
http://www.agiledata.org/essays/databaseRefactoringCatalog.html
Classification and process description makes it possible to implement tooling for these refactorings too.
About the following comment to Walter M.'s answer:
"Very well written! And the historical perspective is great for people who weren't doing database work at that time (i.e. me)".
The historical perspective is in a certain sense absolutely crucial. "Those who forget history, are doomed to repeat it.". Cfr XML repeating the hierarchical mistakes of the past, graph databases repeating the network mistakes of the past, OO systems forcing the hierarchical model upon users while everybody with even just a tenth of a brain should know that the hierarchical model is not suitable for general-purpose representation of the real world, etcetera, etcetera.
As for the question itself:
Every database developer should know that "Relational" is not equal to "SQL". Then they would understand why they are being let down so abysmally by the DBMS vendors, and why they should be telling those same vendors to come up with better stuff (e.g. DBMS's that are truly relational) if they want to go on sucking hilarious amounts of money out of their customers for such crappy software).
And every database developer should know everything about the relational algebra. Then there would no longer be a single developer left who had to post these stupid "I don't know how to do my job and want someone else to do it for me" questions on Stack Overflow anymore.
From my experience with relational databases, every developer should know:
- The different data types:
Using the correct type for the correct job will make your DB design more robust, your queries faster and your life easier.
- Learn about 1xM and MxM:
This is the bread and butter for relational databases. You need to understand one-to-many and many-to-many relations and apply then when appropriate.
- "K.I.S.S." principle applies to the DB as well:
Simplicity always works best. Provided you have studied how DB work, you will avoid unnecessary complexity which will lead to maintenance and speed problems.
- Indices:
It's not enough if you know what they are. You need to understand when to used them and when not to.
also:
Boolean algebra is your friend
Images: Don't store them on the DB. Don't ask why.
Test DELETE with SELECT
I would like everyone, both DBAs and developer/designer/architects, to better understand how to properly model a business domain, and how to map/translate that business domain model into both a normalized database logical model, an optimized physical model, and an appropriate object oriented class model, each one of which is (can be) different, for various reasons, and understand when, why, and how they are (or should be) different from one another.
I would say strong basic SQL skills. I've seen a lot of developers so far who know a little about databases but are always asking for tips about how to formulate a quite simple query. Queries are not always that easy and simple. You do have to use multiple joins (inner, left, etc.) when querying a well normalized database.
I think a lot of the technical details have been covered here and I don't want to add to them. The one thing I want to say is more social than technical, don't fall for the "DBA knowing the best" trap as an application developer.
If you are having performance issues with query take ownership of the problem too. Do your own research and push for the DBAs to explain what's happening and how their solutions are addressing the problem.
Come up with your own suggestions too after you have done the research. That is, I try to find a cooperative solution to the problem rather than leaving database issues to the DBAs.
Simple respect.
It's not just a repository
You probably don't know better than the vendor or the DBAs
You won't support it at 3 a.m. with senior managers shouting at you
Consider Denormalization as a possible angel, not the devil, and also consider NoSQL databases as an alternative to relational databases.
Also, I think the Entity-Relation model is a must-know for every developper even if you don't design databases. It'll let you understand thoroughly what's your database all about.
Never insert data with the wrong text encoding.
Once your database becomes polluted with multiple encodings, the best you can do is apply some kind combination of heuristics and manual labor.
Aside from syntax and conceptual options they employ (such as joins, triggers, and stored procedures), one thing that will be critical for every developer employing a database is this:
Know how your engine is going to perform the query you are writing with specificity.
The reason I think this is so important is simply production stability. You should know how your code performs so you're not stopping all execution in your thread while you wait for a long function to complete, so why would you not want to know how your query will affect the database, your program, and perhaps even the server?
This is actually something that has hit my R&D team more times than missing semicolons or the like. The presumtion is the query will execute quickly because it does on their development system with only a few thousand rows in the tables. Even if the production database is the same size, it is more than likely going to be used a lot more, and thus suffer from other constraints like multiple users accessing it at the same time, or something going wrong with another query elsewhere, thus delaying the result of this query.
Even simple things like how joins affect performance of a query are invaluable in production. There are many features of many database engines that make things easier conceptually, but may introduce gotchas in performance if not thought of clearly.
Know your database engine execution process and plan for it.
For a middle-of-the-road professional developer who uses databases a lot (writing/maintaining queries daily or almost daily), I think the expectation should be the same as any other field: You wrote one in college.
Every C++ geek wrote a string class in college. Every graphics geek wrote a raytracer in college. Every web geek wrote interactive websites (usually before we had "web frameworks") in college. Every hardware nerd (and even software nerds) built a CPU in college. Every physician dissected an entire cadaver in college, even if she's only going to take my blood pressure and tell me my cholesterol is too high today. Why would databases be any different?
Unfortunately, they do seem different, today, for some reason. People want .NET programmers to know how strings work in C, but the internals of your RDBMS shouldn't concern you too much.
It's virtually impossible to get the same level of understanding from just reading about them, or even working your way down from the top. But if you start at the bottom and understand each piece, then it's relatively easy to figure out the specifics for your database. Even things that lots of database geeks can't seem to grok, like when to use a non-relational database.
Maybe that's a bit strict, especially if you didn't study computer science in college. I'll tone it down some: You could write one today, completely, from scratch. I don't care if you know the specifics of how the PostgreSQL query optimizer works, but if you know enough to write one yourself, it probably won't be too different from what they did. And you know, it's really not that hard to write a basic one.
The order of columns in a non-unique index is important.
The first column should be the column that has the most variability in its content (i.e. cardinality).
This is to aid SQL Server ability to create useful statistics in how to use the index at runtime.
Understand the tools that you use to program the database!!!
I wasted so much time trying to understand why my code was mysteriously failing.
If you're using .NET, for example, you need to know how to properly use the objects in the System.Data.SqlClient namespace. You need to know how to manage your SqlConnection objects to make sure they are opened, closed, and when necessary, disposed properly.
You need to know that when you use a SqlDataReader, it is necessary to close it separately from your SqlConnection. You need to understand how to keep connections open when appropriate to how to minimize the number of hits to the database (because they are relatively expensive in terms of computing time).
Basic SQL skills.
Indexing.
Deal with different incarnations of DATE/ TIME/ TIMESTAMP.
JDBC driver documentation for the platform you are using.
Deal with binary data types (CLOB, BLOB, etc.)
For some projects, and Object-Oriented model is better.
For other projects, a Relational model is better.
The impedance mismatch problem, and know the common deficiencies or ORMs.
RDBMS Compatibility
Look if it is needed to run the application in more than one RDBMS. If yes, it might be necessary to:
avoid RDBMS SQL extensions
eliminate triggers and store procedures
follow strict SQL standards
convert field data types
change transaction isolation levels
Otherwise, these questions should be treated separately and different versions (or configurations) of the application would be developed.
Don't depend on the order of rows returned by an SQL query.
Three (things) is the magic number:
Your database needs version control too.
Cursors are slow and you probably don't need them.
Triggers are evil*
*almost always

How many tables/sprocs/functions in a database is too many?

I'm interested in database refactoring. I deal with several databases that don't have a large amount of data, just a few GB with at most a few hundred thousand rows. However, they have hundreds -- sometimes many hundreds -- of tables, views, sprocs and functions. In some places a divide-and-rule strategy using schemas has been implemented which has helped some problems of seeing ownership/usage of tables. However, it hasn't really helped object coupling.
We all read that integration via shared database isn't A Good Thing, but we also know that it is, at least for a while , a very productive thing as everything is in the database. We just don't apply the Single Responsibility Principle to databases like we do to objects.
Edit: I should add that I have no database performance issues. The tables are not large, the biggest has only a few hundred thousand rows. There is no real database performance issue; except when the database schema/logic/implementation is grotesquely inefficient (say requiring a cursor to do a sproc execution for each row in a result set in order to pre-process data for a report). Before you say I should change these, that is the whole point: I can't because the database is no longer in a state where the impact of changes can be assessed.
Clearly at some point you say "Enough!" and divide into multiple databases connected by messages, ETL, application tiers etc etc
The question is: how many is too many? What is the absolute upper limit of the number of sprocs/tables/functions that you can have before you go insane?
First, stop trying to think of databases in object oriented terms. Principles of object oriented programming simply do NOT apply to relational databases.
Shared databases are a very good thing from a business perspective. Multiple databases storing information that has to be transferred between them quickly becomes way more complex than your piddly many hundreds of objects. Data that is consistent between enterprise applications is priceless. Trying to reconcile if GE Corp and General Electric Corporation are really the same entity between two databases can be a nightmare.
Refactoring datbases is a nice goal, but it is very complex in reality. Don't do it unless you have a major performance issue that needs to be addressed or unless you are willing to commit to a process of identifying all the code that might be affected by a change. Even then, consider if you can know all the code that might change (this is one reason why database people hate, hate, hate dynamic code!).
Often the best way to refactor is to add your change and start changing over to using your new field, sp etc while leaving the old one in place until a set expiration date. Since you are on an annual cycle, you will need to manage those dates over a long period of time. To see if sps are being used, you can identify the ones you aren't sure of and add some code to them to insert to a table everytime they are run. If after your whole year cycle, they haven't been run, you can safely eliminate them. The cycle may be shorter depending on the sp.
If I'm writing something that will only be run annually, I would normally put the word annual in the sp name. But that may not be true where you are, however, the function of the sp should give you an idea if it is something that should only be run periodically. I wouldn't expect usp_send email proc to only run once a year but I might expect that a usp_attendance_report might not be run often. Of course as I said, I would have named it something more like usp_annual_attendance_report and you can consider doing that sort of thing moving forward.
But be aware that any refactoring you do will have to take place on a long cycle to ensure that you don't delete something you need. If your code is in a source control system (and all database tables, sp, views, UDFs, triggers, etc should be), you can probably eliminate some things knowing that if they fail you can pretty instantly put them back. Again, I'd examine the object to determine the possible risk eliminating them would have.
Of course if you have good automated tests in place, eliminating something on dev and running the tests can help you find out if something is still being referenced.
If you are looking for an easy way to refactor, I don't know of one. Refactoring databses is a time-consuming, risky activity and one which may not show enough improvement for the powers that be to be willing to pay for it.
A good book on refactoring databases is:http://www.amazon.com/Refactoring-Databases-Evolutionary-Addison-Wesley-Signature/dp/0321293533
I'm not sure there is a magical limit for any of the things you mentioned. I prefer to keep things in one place so I don't have to remember that some records are in place and other records are in another.
I'd be more interested to know if all this work is impacting your performance? And if it's not then why change it? Unless it's impacting performance in some horrible way your customers won't see any benefit from your work and then what's the point?
Your customers might be better served if you just bought a new machine or upgraded your database server software.

Version Controlled Database with efficient use of diff

I have a project involving a web voting system. The current values and related data is stored in several tables. Historical data will be an important aspect of this project so I've also created Audit Tables to which current data will be moved to on a regular basis.
I find this strategy highly inefficient. Even if I only archive data on a daily basis, the number of rows will become huge even if only 1 or 2 users make updates on a given day.
The next alternative I can think of is only storing entries that have changed. This will mean having to build logic to automatically create a view of a given day. This means less stored rows, but considerable complexity.
My final idea is a bit less conventional. Since the historical data will be for reporting purposes, there's no need for web users to have quick access. I'm thinking that my db could have no historical data in it. DB only represents current state. Then, daily, the entire db could be loaded into objects (number of users/data is relatively low) and then serialized to something like XML or JSON. These files could be diffed with the previous day and stored. In fact, SVN could do this for me. When I want the data for a given past day, the system has to retrieve the version for that day and deserialize into objects. This is obviously a costly operation but performance is not so much a concern here. I'm considering using LINQ for this which I think would simplify things. The serialization procedure would have to be pretty organized for the diff to work well.
Which approach would you take?
Thanks
If you're basically wondering how revisions of data are stored in relational databases, then I would look into how wikis do it.
Wikis are all about keeping detailed revision history. They use simple relational databases for storage.
Consider Wikipedia's database schema.
All you've told us about your system is that it involves votes. As long as you store timestamps for when votes were cast you should be able to generate a report describing the vote state tally at any point in time... no?
For example, say I have a system that tallies favorite features (eyes, smile, butt, ...). If I want to know how many votes there were for a particular feature as of a particular date, then I would simply tally all the votes for the feature with a timestamp smaller or equal to that date.
If you want to have a history of other things, then you would follow a similar approach.
I think this is the way it is done.
Have you considered using a real version control system rather than trying to shoehorn a database in its place? I myself am quite partial to git, but there are many options. They all have good support for differences between versions, and they tend to be well optimised for this kind of workload.

Why do we need a temporal database?

I was reading about temporal databases and it seems they have built in time aspects. I wonder why would we need such a model?
How different is it from a normal RDBMS? Can't we have a normal database i.e. RDBMS and say have a trigger which associates a time stamp with each transaction that happens? May be there would be a performance hit. But I'm still skeptical on temporal databases having a strong case in the market.
Does any of the present databases support such a feature?
Consider your appointment/journal diary - it goes from Jan 1st to Dec 31st. Now we can query the diary for appointments/journal entries on any day. This ordering is called the valid time. However, appointments/entries are not usually inserted in order.
Suppose I would like to know what appointments/entries were in my diary on April 4th. That is, all the records that existed in my diary on April 4th. This is the transaction time.
Given that appointments/entries can be created and deleted etc. A typical record has a beginning and end valid time that covers the period of the entry and a beginning and end transaction time that indicates the period during which the entry appeared in the diary.
This arrangement is necessary when the diary may undergo historical revision. Suppose on April 5th I realise that the appointment I had on Feb 14th actually occurred on February 12th i.e. I discover an error in my diary - I can correct the error so that the valid time picture is corrected, but now, my query of what was in the diary on April 4th would be wrong, UNLESS, the transaction times for appointments/entries are also stored. In that case if I query my diary as of April 4th it will show an appointment existed on February 14th but if I query as of April 6th it would show an appointment on February 12th.
This time travel feature of a temporal database makes it possible to record information about how errors are corrected in a database. This is necessary for a true audit picture of data that records when revisions were made and allows queries relating to how data have been revised over
time.
Most business information should be stored in this bitemporal scheme in order to provide a true audit record and to maximise business intelligence - hence the need for support in a relational database. Notice that each data item occupies a (possibly unbounded) square in the two dimensional time model which is why people often use a GIST index to implement bitemporal indexing. The problem here is that a GIST index is really designed for geographic data and the requirements for temporal data are somewhat different.
PostgreSQL 9.0 exclusion constraints should provide new ways of organising temporal data e.g. transaction and valid time PERIODs should not overlap for the same tuple.
A temporal database efficiently stores a time series of data, typically by having some fixed timescale (such as seconds or even milliseconds) and then storing only changes in the measured data. A timestamp in an RDBMS is a discretely stored value for each measurement, which is very inefficient. A temporal database is often used in real-time monitoring applications like SCADA. A well-established system is the PI database from OSISoft (http://www.osisoft.com/).
As I understand it (and over-simplifying enormously), a temporal database records facts about when the data was valid as well as the the data itself, and permits you to query on the temporal aspects. You end up dealing with 'valid time' and 'transaction time' tables, or 'bitemporal tables' involving both 'valid time' and 'transaction time' aspects. You should consider reading either of these two books:
Darwen, Date and Lorentzos "Temporal Data and the Relational Model" (out of print),
and (at a radically different extreme) "Developing Time-Oriented Database Applications in SQL", Richard T. Snodgrass, Morgan Kaufmann Publishers, Inc., San Francisco, July, 1999, 504+xxiii pages, ISBN 1-55860-436-7. That is out of print but available as PDF on his web site at cs.arizona.edu (so a Google search makes it pretty easy to find).
Temporal databases are often used in the financial services industry. One reason is that you are rarely (if ever) allowed to delete any data, so ValidFrom - ValidTo type fields on records are used to provide an indication of when a record was correct.
Besides "what new things can I do with it", it might be useful to consider "what old things does it unify?". The temporal database represents a particular generalization of the "normal" SQL database. As such, it may give you a unified solution to problems that previously appeared unrelated. For example:
Web Concurrency When your database has a web UI that lets multiple users perform standard Create/Update/Delete (CRUD) modifications, you have to face the concurrent web changes problem. Basically, you need to check that an incoming data modification is not affecting any records that have changed since that user last saw those records. But if you have a temporal database, it quite possibly already associates something like a "revision ID" with each record (due to the difficulty of making timestamps unique and monotonically ascending). If so, then that becomes the natural, "already built-in" mechanism for preventing the clobbering of other users' data during database updates.
Legal/Tax Records The legal system (including taxes) places rather more emphasis on historical data than most programmers do. Thus, you will often find advice about schemas for invoices and such that warns you to beware of deleting records or normalizing in a natural way--which can lead to an inability to answer basic legal questions like "Forget their current address, what address did you mail this invoice to in 2001?" With a temporal framework base, all the machinations to those problems (they usually are halfway steps to having a temporal database) go away. You just use the most natural schema, and delete when it make sense, knowing that you can always go back and answer historical questions accurately.
On the other hand, the temporal model itself is half-way to complete revision control, which could inspire further applications. For example, suppose you roll your own temporal facility on top of SQL and allow branching, as in revision control systems. Even limited branching could make it easy to offer "sandboxing" -- the ability to play with and modify the database with abandon without causing any visible changes to other users. That makes it easy to supply highly realistic user training on a complex database.
Simple branching with a simple merge facility could also simplify some common workflow problems. For example, a non-profit might have volunteers or low-paid workers doing data entry. Giving each worker their own branch could make it easy to allow a supervisor to review their work or enhance it (e.g., de-duplification) before merging it into the main branch where it would become visible to "normal" users. Branches could also simplify permissions. If a user is only granted permission to use/see their unique branch, you don't have to worry about preventing every possible unwanted modification; you'll only merge the changes that make sense anyway.
Apart from reading the Wikipedia article? A database that maintains an "audit log" or similar transaction log will have some properties of being "temporal". If you need answers to questions about who did what to whom and when then you've got a good candidate for a temporal database.
You can imagine a simple temporal database that just logs your GPS location every few seconds. The opportunities for compressing this data is great, a normal database you would need to store a timestamp for every row. If you have a great deal of throughput required, knowing the data is temporal and that updates and deletes to a row will never be required permits the program to drop a lot of the complexity inherit in a typical RDBMS.
Despite this, temporal data is usually just stored in a normal RDBMS. PostgreSQL, for example has some temporal extensions, which makes this a little easier.
Two reasons come to mind:
Some are optimized for insert and read only and can offer dramatic perf improvements
Some have better understandings of time than traditional SQL - allowing for grouping operations by second, minute, hour, etc
Just an update, Temporal database is coming to SQL Server 2016.
To clear all your doubts why one need a Temporal Database, rather than configuring with custom methods, and how efficiently & seamlessly SQL Server configures it for you, check the in-depth video and demo on Channel9.msdn here: https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016
MSDN link: https://msdn.microsoft.com/en-us/library/dn935015(v=sql.130).aspx
Currently with the CTP2 (beta 2) release of SQL Server 2016 you can play with it.
Check this video on how to use Temporal Tables in SQL Server 2016.
My understanding of temporal databases is that are geared towards storing certain types of temporal information. You could simulate that with a standard RDBMS, but by using a database that supports it you have built-in idioms for a lot of concepts and the query language might be optimized for these sort of queries.
To me this is a little like working with a GIS-specific database rather than an RDBMS. While you could shove coordinates in a run-of-the-mill RDBMS, having the appropriate representations (e.g., via grid files) may be faster, and having SQL primitives for things like topology is useful.
There are academic databases and some commercial ones. Timecenter has some links.
Another example of where a temporal database is useful is where data changes over time. I spent a few years working for an electricity retailer where we stored meter readings for 30 minute blocks of time. Those meter readings could be revised at any point but we still needed to be able to look back at the history of changes for the readings.
We therefore had the latest reading (our 'current understanding' of the consumption for the 30 minutes) but could look back at our historic understanding of the consumption. When you've got data that can be adjusted in such a way temporal databases work well.
(Having said that, we hand carved it in SQL, but it was a fair while ago. Wouldn't make that decision these days.)

Resources