I'm attempting to use CouchDB for a system that requires full auditing of all data operations. Because of its built in revision-tracking, couch seemed like an ideal choice. But then I read in the O'Reilly textbook that "CouchDB does not guarantee that older versions are kept around."
I can't seem to find much more documentation on this point, or how couch deals with its revision-tracking internally. Is there any way to configure couch either on a per-database, or per-document level to keep all versions around forever? If so, how?
The revisions in CouchDB are not revisions in the way you are thinking of them. They are an artifact of the way it appends updated data to the database, and are cleaned up upon compaction. This is a common misunderstanding.
You need to implement the revision-tracking as part of the schema/document design of your application.
couch is storing all versions, but if you click in futon on "compact database" link, all previous versions will be deleted. So if notime click on compact database, all versions will be preserverd I think:)
There are two situations when the previous versions of documents in CouchDB are removed:
Replication
Compaction
Thus, if you do not want to store lots of data, meaning terabytes, then you probably do not need replication. Anyway, master to master replication in CouchDB is it one of the most important features. The size of CouchDB on disk is greater than a traditional database, so probably in future you will need compaction.
As aforementioned: you need to implement the revision-tracking as part of the schema/document design of your application.
Related
I am new to YugaByte, but impressed by the things I have read about it so far.
One thing which might be beneficial for both YugaByte and Postgres community is to use YugaByte as one of the Pluggable Storages for PostgreSQL (>= version 12), thus taking advantage of this API and potentially making even more PG extensions work with YugaByte.
I am not sure if my understanding is correct, but if this integration is possible, I think it would make YugaByte even more interesting for large organizations.
Thanks for your interest in YugabyteDB! We did explore making it a Pluggable Store for PostgreSQL, however we found that this might not work because of the following points:
YugabyteDB replicates the system catalog as well, which is not pluggable in PostgreSQL. Without making this change, the system catalog will become a single point of failure.
We've had to change some internal workings given the data could be on a remote node, and therefore the change needed to be made at a layer higher than where the pure pluggable storage would live.
There are probably other reasons as well. But we're interested in making this possible over the long term, if its possible to provide enough "hooks" into PostgreSQL. For now, we plan to pull in newer PostgreSQL features as needed.
We are overhauling our product by completely moving from Microsoft and .NET family to open source (well one of the reasons is cost cutting and exponential increase in data).
We plan to move our data model completely from SQL Server (relational data) to Hadoop (the famous key-Value pair ecosystem).
In the beginning, we want to support both versions (say 1.0 and new v2.0). In order to maintain the data consistency, we plan to sync the data between both systems, which is a fairly challenging task and error prone, but we don't have any other option.
A bit confused where to start from, I am looking up to the community of experts.
Any strategy/existing literature or any other kind of guidance in this direction would be greatly helpful.
I am not entirely sure how your code is structured, but if you currently have a data or persistence layer, or at least a database access class where all your SQL is executed through, you could override the save functions to write changes to both databases. If you do not have a data layer, you may want to considering writing one before starting the transition.
Otherwise, you could add triggers in MSSQL to update Hadoop, not sure what you can do in Hadoop to keep MSSQL in-sync.
Or, you could have a process that runs every x minutes, that manually syncs the two databases.
Personally, I would try to avoid trying to maintain two databases of record. Moving changes from a new, experimental database to your stable database seems risky. You stand the chance of corrupting your stable system. Instead, I would write a convertor to move data from your relational DB to Hadoop. Then every night or so, copy your data into Hadoop and use it for the development and testing of your new system. I think test users would understand if you said your beta version is just a test playground, and won't effect your live product. If you plan on making major changes to your UI and fear some will not want to transition to 2.0, then you might be trying to tackle too much at once.
Those are the solutions I came up with... Good luck!
Consider using a queuing tool like Flume (http://www.cloudera.com/blog/2010/07/whats-new-in-cdh3b2-flume/) to split your input between both systems.
I've read all the SO questions, the Coding Horror articles, and Googled my brains off searching for the best ways to revision control data. They all work and they all have their appropriate implementations based on use cases and so on. What I really want to know is why hasn't a database been written to natively support revisioning on the data-level?
What I am baffled with is that the API is already practically in place with transactions. We start a transaction, change some data, and commit. We are authenticating against the database too so blame is present. My company stores end of month versions of our entire database for accounting purposes which equate to tags. Does this not scream RCS?
Branching is something that databases could benefit from greatly too in regards to schema more than data. Since I really only care about data and this would increase the difficulty of implementation by a massive degree I'll stick to just tags and commits.
Now I know that databases are incredibly time-critical applications so any unnecessary overhead is shunned into oblivion and some databases are epic-level huge and revisions will only exponentiate that size. A per-table, opt-in revision control undoubtedly has a place in small to medium scale environments where there are milliseconds to spare and data history has a degree of importance. I want commits, I want logs, I want reverts, I want diffs, I want blame, I want tags, and I want checkouts. I want MF-ing revision control.
I have a question in there somewhere...
One native solution is Oracle's Flashback Database (aka Total Recall). It is a chargeable extra to the Enterprise Edition, but it is pretty cool. It transparently stores versions of the data for as long as we want to retain it, and supplies syntax to query old versions of the data. It can be enabled on a table-by-table basis.
Essentially Flashback DB is like using triggers to store records in tracking tables, but slick, performant and invisible to normal working.
You could read about temporal databases.
In "Temporal Data & the Relational Model" by Date, Darwen, and Lorentzos, the authors introduce a sixth normal form to account for issues in tracking temporal data.
Richard Snodgrass proposed TSQL2 as an extension to SQL to handle temporal data.
Implementations include:
Oracle Workspace Manager
TimeDB
Several DBMSs implement engine-level versioning mechanisms. Unfortunately there is no vendor-independent standard for this so they are all proprietary. Oracle flashback has already been mentioned. Microsoft's Change Data Capture feature in SQL Server is another one.
You forgot I want performance. A DBMS is a pretty low level data storage mechanism, and in systems with billions of rows, performance can be important. Therefore, if you want this sort of auditing system, you can build it yourself using the tools available to you (eg. triggers).
Just as in a filesystem, not all files are appropriate for version control, in a database not all rows would be appropriate for version control either.
Using a version control system for your source code (like subversion) makes sense because it allows you to back out of mistakes, audit changes, make painless snapshots, discover exactly where something went wrong so that you can improve your process etc. For the same reasons it makes sense to do change tracking of business data, and many systems do so.
There are already a few questions on how to implement this on top of a normal database:
Database structure to track change
history
Maintain history in a database
Database history for client
usage
How to version control a record in a
database
...
For a feature that is so useful and popular, it seems strange that we all need to reinvent the wheel. Are there any existing database implementations which already solved this problem? I'm imagining that such a system would extend the SQL syntax to allow easy querying of the history.
Take a look at temporal databases, such as TimeDB.
Not a relational database (you didn't say it had to be), but CouchDB has versioning built-in.
The space requirements would be prohibitive, so this is why you typically roll your own.
There are different solutions, depending on your toolkit:
Hibernate EnVers for plugging in Hibernate;
HBase has limited versioning built-in;
As far as the data goes, I believe it's called "change data capture".
Given that most countries require that all accounting transactions are logged, pretty well every database lets you record the history for auditing.
I'm using SqlServer to drive a WPF application, I'm currently using NHibernate and pre-read all the data so it's cached for performance reasons. That works for a single client app, but I was wondering if there's an in memory database that I could use so I can share the information across multiple apps on the same machine. Ideally this would sit below my NHibernate stack, so my code wouldn't have to change. Effectively I'm looking to move my DB from it's traditional format on the server to be an in memory DB on the client.
Note I only need select functionality.
I would be incredibly surprised if you even need to load all your information in memory. I say this because, just as one example, I'm working on a Web app at the moment that (for various reasons) loads thousands of records on many pages. This is PHP + MySQL. And even so it can do it and render a page in well under 100ms.
Before you go down this route make sure that you have to. First make your database as performant as possible. Now obviously this includes things like having appropriate indexes and tuning your database but even though are putting the horse before the cart.
First and foremost you need to make sure you have a good relational data model: one that lends itself to performant queries. This is as much art as it is science.
Also, you may like NHibernate but ORMs are not always the best choice. There are some corner cases, for example, that hand-coded SQL will be vastly superior in.
Now assuming you have a good data model and assuming you've then optimized your indexes and database parameters and then you've properly configured NHibernate, then and only then should you consider storing data in memory if and only if performance is still an issue.
To put this in perspective, the only times I've needed to do this are on systems that need to perform millions of transactions per day.
One reason to avoid in-memory caching is because it adds a lot of complexity. You have to deal with issues like cache expiry, independent updates to the underlying data store, whether you use synchronous or asynchronous updates, how you give the client a consistent (if not up-to-date) view of your data, how you deal with failover and replication and so on. There is a huge complexity cost to be paid.
Assuming you've done all the above and you still need it, it sounds to me like what you need is a cache or grid solution. Here is an overview of Java grid/cluster solutions but many of them (eg Coherence, memcached) apply to .Net as well. Another choice for .Net is Velocity.
It needs to be pointed out and stressed that something like NHibernate is only consistent so long as nothing externally updates the database and that there is exactly one NHibernate-enabled process (barring clustered solutions). If two desktop apps on two different PCs are both updating the same database with NHibernate the caching simply won't work because the persistence units simply won't be aware of the changes the other is making.
http://www.db4o.com/ can be your friend!
Velocity is an out of process object caching server designed by Microsoft to do pretty much what you want although it's only in CTP form at the moment.
I believe there are also wrappers for memcached, which can also be used to cache objects.
You can use HANA, express edition. You can download it for free, it's in-memory, columnar and allows for further analytics capabilities such as text analytics, geospatial or predictive. You can also access with ODBC, JDBC, node.js hdb library, REST APIs among others.