Are there any database implementations which keep all history? - database

Using a version control system for your source code (like subversion) makes sense because it allows you to back out of mistakes, audit changes, make painless snapshots, discover exactly where something went wrong so that you can improve your process etc. For the same reasons it makes sense to do change tracking of business data, and many systems do so.
There are already a few questions on how to implement this on top of a normal database:
Database structure to track change
history
Maintain history in a database
Database history for client
usage
How to version control a record in a
database
...
For a feature that is so useful and popular, it seems strange that we all need to reinvent the wheel. Are there any existing database implementations which already solved this problem? I'm imagining that such a system would extend the SQL syntax to allow easy querying of the history.

Take a look at temporal databases, such as TimeDB.

Not a relational database (you didn't say it had to be), but CouchDB has versioning built-in.

The space requirements would be prohibitive, so this is why you typically roll your own.
There are different solutions, depending on your toolkit:
Hibernate EnVers for plugging in Hibernate;
HBase has limited versioning built-in;

As far as the data goes, I believe it's called "change data capture".

Given that most countries require that all accounting transactions are logged, pretty well every database lets you record the history for auditing.

Related

Using LDAP server as a storage base, how practical is it?

I want to learn how practical using an LDAP server (say AD) as a storage base. To be more clear; how much does it make sense using an LDAP server instead of using RDBMS to store data?
I can guess that most you might just say "it doesn't" but there might be some reasons to make it meaningful (especially business wise);
A few points first;
Each table becomes a container entity and each row becomes a new entity as a child. Row entities contains attributes for columns. So you represent your data in this way. (This should be the most meaningful representation I think, suggestions are welcome)
So storing data like a DB server is possible but lack of FK and PK (not sure about PK) support is an issue. On the other hand it supports attribute (relates to a column) indexing (Not sure how efficient). So consistency of data is responsibility of the application layer.
Why would somebody do this ever?
Data that application uses/stores closely matches with the existing data in AD. (Users, Machines, Department Info etc.) (But still some customization is required to existing entity schema, and new schema definitions are needed for not very much related data.)
(I think strongest reason would be this: business related) Most mid-sized companies have very well configured AD servers (replicated, backed-up etc.) but they don't have such DB setup (you can make comment to this as much as you want). Say when you sell your software which requires a DB setup to these companies, they must manage their DB setup; but if you say "you don't need DB setup and management; you can just use existing AD", it sounds appealing.
Obviously there are many disadvantages of giving up using DB, feel free to mention them but let's assume they are acceptable. (I can mention more if question is not clear enough.)
LDAP is a terrible tool for maintaining most business data.
Think about a typical one-to-many relationship - say, customer and orders. One customer has many orders.
There is no good way to represent this data in an LDAP directory.
You could try having a mock "foreign key" by making every entry of that given object class have a "foreign key" attribute, but your referential integrity just went out the window. Cascade deletes are impossible.
You could try having a "customer" object that has "order" children. However, you've just introduced a specific hierachy - you're now tied to it.
And that's the simplest use case. Once you start getting into more complex relationships, you're basically re-inventing an RDBMS in a system explicity designed for a different purpose. The clue's in the name - directory.
If you're storing a phonebook, then sure, use LDAP. For anything else, use a real database.
For relatively small, flexible data sets I think an LDAP solution is workable. However an RDBMS provides a number significant advantages:
Backup and Recovery: just about any database will provide ACID properties. And, RDBMS backups are generally easy to script and provide several options (e.g. full vs. differential). Just don't know with LDAP, but I imagine these qualities are not as widespread.
Reporting: AFAIK LDAP doesn't offer a way to JOIN values easily, much the less do things like calculate summations. So you would put a lot of effort into application code to reproduce those behaviors when you do need reporting. And what application doesn't ultimately?
Indexing: looks like LDAP solutions have indexing, but again, seems hit or miss. Whereas seemingly all databases out there have put some real effort into getting this right.
I think any serious business system's storage should be backed up in the same fashion you believe LDAP is in most environments. If what you're really after is its flexibility in terms of representing hierarchy and ability to define dynamic schemas I'd suggest looking into NoSQL solutions or the Java Content Repository.
LDAP is very usefull for storing that information and if you want it, you may use it. RDMS is just more comfortable with ORM systems. Your persistence logic with LDAP will so complex.
And worth mentioning that this is not a standard approach -> people who will support the project will spend more time on analysis.
I've used this approach for fun, i generate a phonebook from Active Directory, but i don`t think that it's good idea to use LDAP as a store for business applications.
In short: Use the right tool for the right job.
When people see LDAP you already set an expectation on your system. Don't forget that the L Lightweight. LDAP was designed for accessing directories over a network.
With a “directory database” you can build a certain type of application. If you can map your data to a tree like data structure it will work. I surely would not want to steam videos from LDAP! You can probably hack something but I would prefer a steaming server..
There might be some hidden gotchas down the line if you use a tool not designed for what it is supposed to do. So, the downside is you'll have to test stuff that would have been a given in some cases.
It's not is not just a technical concern. Your operational support team might “frown” on your application as they would have certain expectations/preconceptions based on your applications architectural nature. Imagine their surprise if you give them CRM system (website + files and popped email etc.) in a LDAP server as database to maintain.
If I was in your position, I would steer towards one of the NoSQL db solutions rather than trying to use LDAP. LDAP is fine for things like storing user and employee information, but is terrible to interact with when you need to make changes. A NoSQL db will allow you to store your data how you want without the RDBMS overhead you would like to avoid.
The answer is actually easy. Think of CRUD (Create, Read, Update, Delete). If a lot of Read will be made in your system, you can think of using LDAP. Because LDAP is quick in read operations and designed so. If the other operations will be made more, the RDMS would be a better option.

CouchDB Versioning / Auditing

I'm attempting to use CouchDB for a system that requires full auditing of all data operations. Because of its built in revision-tracking, couch seemed like an ideal choice. But then I read in the O'Reilly textbook that "CouchDB does not guarantee that older versions are kept around."
I can't seem to find much more documentation on this point, or how couch deals with its revision-tracking internally. Is there any way to configure couch either on a per-database, or per-document level to keep all versions around forever? If so, how?
The revisions in CouchDB are not revisions in the way you are thinking of them. They are an artifact of the way it appends updated data to the database, and are cleaned up upon compaction. This is a common misunderstanding.
You need to implement the revision-tracking as part of the schema/document design of your application.
couch is storing all versions, but if you click in futon on "compact database" link, all previous versions will be deleted. So if notime click on compact database, all versions will be preserverd I think:)
There are two situations when the previous versions of documents in CouchDB are removed:
Replication
Compaction
Thus, if you do not want to store lots of data, meaning terabytes, then you probably do not need replication. Anyway, master to master replication in CouchDB is it one of the most important features. The size of CouchDB on disk is greater than a traditional database, so probably in future you will need compaction.
As aforementioned: you need to implement the revision-tracking as part of the schema/document design of your application.

Where are all the native revisioned databases?

I've read all the SO questions, the Coding Horror articles, and Googled my brains off searching for the best ways to revision control data. They all work and they all have their appropriate implementations based on use cases and so on. What I really want to know is why hasn't a database been written to natively support revisioning on the data-level?
What I am baffled with is that the API is already practically in place with transactions. We start a transaction, change some data, and commit. We are authenticating against the database too so blame is present. My company stores end of month versions of our entire database for accounting purposes which equate to tags. Does this not scream RCS?
Branching is something that databases could benefit from greatly too in regards to schema more than data. Since I really only care about data and this would increase the difficulty of implementation by a massive degree I'll stick to just tags and commits.
Now I know that databases are incredibly time-critical applications so any unnecessary overhead is shunned into oblivion and some databases are epic-level huge and revisions will only exponentiate that size. A per-table, opt-in revision control undoubtedly has a place in small to medium scale environments where there are milliseconds to spare and data history has a degree of importance. I want commits, I want logs, I want reverts, I want diffs, I want blame, I want tags, and I want checkouts. I want MF-ing revision control.
I have a question in there somewhere...
One native solution is Oracle's Flashback Database (aka Total Recall). It is a chargeable extra to the Enterprise Edition, but it is pretty cool. It transparently stores versions of the data for as long as we want to retain it, and supplies syntax to query old versions of the data. It can be enabled on a table-by-table basis.
Essentially Flashback DB is like using triggers to store records in tracking tables, but slick, performant and invisible to normal working.
You could read about temporal databases.
In "Temporal Data & the Relational Model" by Date, Darwen, and Lorentzos, the authors introduce a sixth normal form to account for issues in tracking temporal data.
Richard Snodgrass proposed TSQL2 as an extension to SQL to handle temporal data.
Implementations include:
Oracle Workspace Manager
TimeDB
Several DBMSs implement engine-level versioning mechanisms. Unfortunately there is no vendor-independent standard for this so they are all proprietary. Oracle flashback has already been mentioned. Microsoft's Change Data Capture feature in SQL Server is another one.
You forgot I want performance. A DBMS is a pretty low level data storage mechanism, and in systems with billions of rows, performance can be important. Therefore, if you want this sort of auditing system, you can build it yourself using the tools available to you (eg. triggers).
Just as in a filesystem, not all files are appropriate for version control, in a database not all rows would be appropriate for version control either.

Defining the database schema in the application or in the database?

I know that the title might sound a little contradictory, but what I'm asking is with regards to ORM frameworks (SQLAlchemy in this case, but I suppose this would apply to any of them) that allow you to define your schema within your application.
Is it better to change the database schema directly and then update the column types in your program manually, or does it make more sense to define the tables in your application and then use the ORM framework's table generation functions to make the schema and then build the tables on the database side for you?
Bear in mind that applications and databases tend to live in a M:M relationship in any but the most trivial cases. If your application is at all likely to have interfaces to other systems, reports, data extracts or loads, or data migrated onto or off it from another system then the database has more than one stakeholder.
Be nice to the other stakeholders in your application. Take the time and get the schema right and put some thought into data quality in the design of your application. Keep an eye on anyone else using the application and make sure you don't break bits of the schema that they depend on without telling them. This means that the database has a life of its own to a greater or lesser extent. The more integration, the more independent the database.
Of course, if nobody else uses or cares about the data, feel free to ignore my advice.
My personal belief is that you should design the database on its own merits. The database is the best place to handle things modeling your Domain data. The database is also the biggest source of slow down in applications and letting your ORM design your database seems like a bad idea to me. :)
Of course, I've only got a couple of big projects behind me. I'm still learning daily. :)
The best way to define your database schema is to start with modeling your application domain (domain driven design anyone?) and seeing what tables take shape based on the domain objects you define.
I think this is the best way because really the database is simply a place to persist information from the application, it should never lead the design. It's not the only place to persist information as well. We have users that want to work from flat files or the database for instance. They could also use XML files. So by starting with your domain objects and then generating tables (or flat file or XML schema or whatever) from there will lead to a much better design in the end.
While this may depend on you using an object-oriented language, using an ORM tool like Hibernate/NHibernate, SubSonic, etc. can really make this transition easy for you up to, and including generating the database creation scripts.
In reference to performance, performance should be one of the last things you look at in an application, it should never drive the design. After you get a good schema up and running based on your domain you can always make tweaks to improve its performance.
Alot depends on your skill level with the specific database product that you're going to use. Think of it as the difference between a "manual" and "automatic" transmission car. ORMs provide you with that "automatic" transmission, just start designing your classes, and let the ORM worry about getting it stored into the database somehow.
Sounds good. The problem with most ORMs is that in their quest to be PI "persistence ignorant", they often don't take advantage of specific database features that can provide elegant solutions for a given task. Notice, I didn't say ALL ORMs, just most.
My take is to design the conceptual data model first yourself. Then you can go in either direction, up towards the application space, or down towards the physical database. But remember, only YOU know if it's more advantageous to use a view instead of a table, should you normalize or de-normalize a table, what non-clustered index(es) make sense with this table, is a natural or surrogate key more appropriate for this table, etc... Of course, if you feel that these questions are beyond your grasp, then let the ORM help you out.
One more thing, you really need to seperate the application design from the database design. They are almost never the same. How important is that data? Could another application be designed to use that data? It's a lot easier to refactor an application than it is to refactor a database with a billion rows of data spread across thousands of tables.
Well, if you can get away with it, doing it in the application is probably the best way. Since it's a perfect example of the DRY principle.
Having said that however, getting away with it is always going to be hard to pull off since you're practically choosing to give up most database specific optimizations. (more so, with querying, but it still applies to schemas (indexes, etc)).
You'll probably end up changing the schema by hand anyway, and then you'll be stuck with a brittle database schema that's going to be the source of your worst nightmares :)
My 2 Cents
Design each based on their own requirements as much as possible. Trying to keep them in too rigid sync is a good illustration of increased coupling/decreased cohesion.
Come to think of it, ORMs can easily be used to spread coupling (even though it can be avoided to some degree).

The best technology to synchronize data between different database schemas?

I have an existing SQL Server 2005 database that runs our accounting/inventory application. We are looking at using a new on-line ordering framework - which has it's own database.
If we use this new framework, we will need to transfer the on-line ordering data (inventory, prices, orders, customers) - almost realtime - to and from, our existing inventory database. The transfer of data doesn't have to be real-time, but it has to be quick. Both databases will be in SQL Server.
So my question is... what is the best way to transer data back and forth between two databases, with have different schemas?
Replication? SSIS? What would you suggest, and why?
Any help would be appreciated!
The Business Rules are the Hard Part
One-way sync? Two-way sync? Real-time push? Nightly updates? Dump and reload? Compare and update? Conflict resolution? Which side wins? Push read-only info one way, and order info the other way? What about changes/cancellations/etc? Do order statuses get pushed back?
You can see where I'm going here. Technology is a secondary question.
Because of the business rules issue, and because the two systems have different schemas (and different purposes), this isn't a standard data move, and most of the "standard" answers (replication, log shipping, etc) are off the table.
There are frameworks out there designed to help with this, like Microsoft BizTalk or Scribe Insight. These are cumbersome and expensive, though.
It isn't too difficult to create a custom queue-ing system either based on SQL triggers, or scheduled pushes (depending on your needs) in C# or your favorite language. That's probably the route I would go. It would probably involve a third "transfer" database to hold the queue of changes made by one side, and a module to apply the business rules and push the data to the other.
Personally I would run away from this nightmare as fast as I could. Since you have not yet bought this online ordering I would suggest that keeping the data in synch with the existing application is a valid reason for not doing such a thing. If you buy this you will eternally regret how mucked up your data will become and how much time and money you spend trying to get things to work properly. This is a disaster waiting to happen. You''ll end up having people order items supposedly in iventory when there are none in the warehouse. Do not do this. This is a guarantee of angry customers and angry managers. Far far cheaper over time to hire some developers to put together your own online ordering that accesses your data base. If they go ahead over your objections, I'd update my resume.
From personal experience, I would only use replication if there was no other choice. You have to tear it down for any schema change and it has a tendency to just blow up.
For this, I'd most likely use SSIS. It's fairly easy to build a transformation package and fairly simple to maintain.
Replication works well, and if it's two way, it might be your only viable option, since conflict resolution is built in.
If you're going one-way, SSIS or triggers on tables would fine, and would push the data real-time (for triggers) or at whatever interval you want (SSIS). The upside to SSIS would be that it's a background process, whereas triggers could potentially hold up transactions on the supply side while they push the data.
If you're looking to move massive amounts of data, there are other products out there that can do it for you, but if it's not too much data, a solution using SQL Servers tools should do all you need.

Resources