How do you implement version control in a database application? - database

I'm working on a web based Java project that stores end user data in a MySql database. I'd like to implement something that allows the user to have functionality similar to what I have for my source code version control (e.g. Subversion). In other words, I'd like to implement code that allows the user to commit and rollback work and return to an existing branch. Is there an existing framework for this? It seems like putting the database data into version control and exposing the version control functionality to the end user (i.e. write code that allows the user to commit, rollback, etc.) could be a reasonable approach but it also seems their might be some problems with this approach. For example, how would you allow one user to view a rolled back version of the data (i.e. you can't just replace the data the database is pointing to if one user wants to look at a rolled back version of the data)? If given the choice of completely rebuilding the system using any persistence architecture what could be used to store the data that would make this type of functionality easy to implement?

There are 2 very common solutions for what you need:
http://www.liquibase.org/
https://flywaydb.org/

Branching and merging the user data
Your question is about solutions to version the user data in a application, to give your users capabilities such as branching and merging. You pondered about exposing a real version control such as svn.
The side-effects I can foresee are:
You will have to index things by directory and filename. Maybe using an abstraction of directories as entities and filenames as the primary key.
Operating systems (linux, mac and windows alike) does not handle well directories with millions of files. You will have to partition the entity. Usually hashing the ID (md5 for example) and taking the beginning of the hash to create an subdirectory. The number of digits to take from the hash depends on the expected size of the entity.
Operating systems (linux, mac and windows alike) are not prepared for huge quantity of files. I did a test on that. It took me days to backup and finally remove an file tree with hundreds of millions of files.
You will not be able to have additional indexes beyond the primary key, however you can work around that creating a data-mart, as I will describe below.
You will not have database constraints, but similar functionality can be implemented through git/svn/cvs triggers.
You will not have strong transactions, but similar functionality can be implemented through git/svn/cvs triggers.
You will have a working copy for each user, this will consume space depending on the size of the repositories. That way each user will be in a single point in time.
GIT is fast enough to switch from a branch to another, so go back in time and back will take only seconds (unless the user data is big, of course).
I saw a Linus interview where he warned about low performance in huge git repositories. Maybe it is best to have a repository to each user or other means to avoid your application having a single humongous repository.
Resolution of the changes. I bet that if you create gazillions of versions any version control will complaint. I do not what gazillions mean. You will have to test it.
Query database
A version control working copy will be limited to primary key queries using the "=" operator and sequential scans. This is not enough to make good reports and statistics on any usage pattern I can think off. That why you need to build a data-mart from your application data and you have two ways of doing that:
A batch process: that reads the whole repository history and builds cubes and other views to allow easier querying.
GIT/SVN/CVS triggers: can call programs made by you on file addition, modification, exclusion, branch creation and merging. This could be used to update the database when a change happen.
The batch is easier to implement but takes time to the reports and statistics be synchronized with the activity. You probably will want to go that way in the 1.0 version and in time moving to triggers to get things more dynamic.
Simulating constraints and transactions
GIT, SVN and CVS supports triggers that execute programs when a new version is submitted. Then the relationships and consistency can be checked to accept or not the change.
Alternative Solutions
Since you do not specified the kind of application you want, I will talk about blogs, content portals and online stores. For those kinds of applications I see no much reason to reinvent the wheel and build a custom database. Most of the versioning necessary can be predicted in the database model. A good event-oriented database design will be enough.
For example, a revision in a blog post could be modeled as marking the end date/time of the post and creating a new row for the revised post, increasing the version number and setting the previous version id. The same strategy can be used with sales and catalog of an online store. If you model your application with good logs you does not need version control.
Some developers also do a row level trigger that records everything that has changed on the database. This is a bit harder for an auditor that would need to reconstruct the past from bad designed logs. I personally do not like this way because is very difficult to index this kinds of queries. I prefer to make my whole applications around a good designed and meaningful log.
For example:
History Table
10/10/2010 [new process] process_id=1; name=john
11/10/2010 [change name] process_id=1; old_name=john; new_name=john doe
12/10/2010 [change name] process_id=1; old_name=john doe; new_name=john doe junior
Process Table after 12/10/2010.
proc_id=1 name=john doe junior
That way I can reconstruct almost everything on the past and still have my operational data in a easy-to-use format.
However, this is not close to the usage pattern you want (branching and merging)
Conclusion
The applicability of version control as a database seems to me very powerful on one hand and very limited and dangerous in another. It is very inspiring for auditing and error correction purposes. But my main concern would be scale and reliability.

It seems like you want version control for your data rather than the database schema. I could find two databases that implement most of the version control features such as fork, clone, branch, merge, push, and pull:
https://github.com/dolthub/dolt - SQL based
https://github.com/terminusdb/terminusdb - graph based

You mentioned Subversion, which is a Centralized Version Control System. But let us focus on Git, because of reasons. Git is a Decentralized Version Control System. A local copy of a Git repository is the same as a remote copy of the repository, if a remote copy exists at all (services such as GitLab and GitHub provide the remote housing and managing of Git projects). With Git you can have version control in an arbitrary directory in your machine. You can do whatever you are accustomed to doing with SVN, and more, in this arbitrary directory.
What I am getting at, is that you could possibly create per user directories/repositories in your server programmatically, and apply version control in these directories/repositories, keeping a separate repository per user (the specifics of the architecture would be decided later, though, depending on the structure of the user's "work"). Your application would be in charge of adding and removing files on behalf of the user (e.g. Biography, My Sample Project, etc.), editing files, committing the changes, presenting a file history, etc., essentially issuing Git commands. Your application would, thus, interface with the Git repository, exploiting the advanced version control that Git provides. Your database would just make sure that the user is linked to the directory/repository that contains their "work".
To provide a critical analogy, the GitLab project is an open source web-based Git repository manager with wiki and issue tracking features. GitLab is written in Ruby and uses PostgreSQL (preferably). It is a typical (as in Code - Database - Data directories and files) multiuser web-based application. Its purpose is to manage Git repositories. These Git repositories are stored in a designated directory in the server. Part of the code is responsible for accessing the Git repositories that the logged-in user is authorized to access (as the owner or as a collaborator). An interesting use case is of a user editing a file online, which will result in a commit in some branch in some repository. Another interesting use case is of a user checking the history of a file. A final interesting use case is of a user reverting a specific commit. All of these actions are performed online, via a web browser.
To provide an interesting real-world use case, Atlas by O'Reilly is an online platform for publishing-related collaboration using GitLab as the backend.
For Java there is JGit, a lightweight, pure Java library implementing the Git version control system. JGit is used by Eclipse for all actions related to managing Git repositories. Maybe you could look into it. It is an extremely active project, supported by many, Google included.
All of the above make sense, if the "work" you refer to is more than some fields in a database table, which the user will fill in and may later change the values of. For instance, it would make sense for structured text, HTML, etc.
If this "work" is not so large-scale, maybe doing something like what is described above is overkill. In that case, you could employ some of the version control concepts in your database design, such as calculating diffs and applying patches (also in reverse, for viewing past versions / rolling back). Your tables should allow for a tree-like structure, to store the diffs, so you could allow for branches. You could have the active version of a file readily available, as well as the active index (what Git calls HEAD), and navigate to another indexed/hashed/tagged version in the file's history by applying all patches sequentially, if moving forward, or applying patches in reverse, and in the reverse chronological order, if moving backwards. If this "work" is really small-scale, you could even ditch the diff concept, and store the whole version of the "work" in the tree-like structure.
Pure fun.

Related

How to handle database source in repositories of other applications?

This may be more opinionated than I like, but please forgive me. I'm searching for a definitive answer.
I am using GIT, JIRA (Issue Management), Bitbucket (online GIT projects) and SourceTree (GIT GUI client) for a project that involves multiple cross-code and cross-platform segments.
My issue is specifically with how to handle database source control in relation to the applications that utilize said database and it's objects?
For example, let's say you have a web based tool that was developed to pull data from said database using stored procedures. Would the database stored procedures be stored in the same repository as the web application?
In another example, let's say the same web application just used basic SQL queries. But, the systems that prepared the data such as a complex ETL system, helped make it happen. Would the ETL system source code be in the repository too?
(Note: I am not referring to database changes to the data types, indexes or schema. I'm referring to SQL scripts, stored procedures, SSIS packages, SSRS source and even possibly OLAP cube frameworks that are stored in the service. But of course, are not members of a DRC or CSM system for control outside of developer control.)
I hope this isn't too broad. There is just very little documentation out there in handling relational database objects in relation to a application or related systems. Databases by themselves do not seem to be that popular for DRC and CSM systems even though they are a critical part of the puzzle.
Unfortunately, there is no definite answer to this question. Where to store part of your system (or where to draw the line between systems) is hight context dependent.
Some considerations that might help you to decide are:
Who is developing the code?
If everything is created and maintained by the same team then it might be best to store everything together
At what rate is the code changing?
If the code to show the data is changing rapidly, while the code to create the data is only changing sometimes it might be best to separate the two code-bases
How is the code deployed/run?
If the deployment and running of various parts of a system is vastly different then it could make sense to store and handle them in a different way.
To link this to your examples, in the first situation I would probably suggest to keep everything together based on considerations 1 and 3. For the second example, all considerations together would suggest moving the ETL system to a separate repository.

Databases and "branch"

We are currently developping an application which use a database.
Every time we update the database structure, we have to provide a script to update the database from the previous version to the current one.
So the database has currently a number that gave us it's current version and then our software make an update when we want to use an "old" database.
The issue we are encountering is when we have branches:
When we create a new big feature, that will not be available for users(and not included in releases), we create a branch.
The main branch(trunk) will be merged regularly to ensure that the create brunch has the latest bug corrections.
Here is some illustration:
The issue is with our update scripts. They update from the previous version to the current one, then update the version number of the database.
Imagine that we have the DB version 17 when creating the branch.
We then do the branch, and make changes on the Trunk DB. The DB has now the version 18.
Then we make a db change on the branch. Since we know there has already been a new version "18", we create the version 19 and the updater 18->19.
Then the trunk is merged on the branch.
At this very moment we may have some updaters that will never runs.
If someone updated his database before the merge, his database will be flagged has having the version 19, the the update 17->18 will never be done.
We want to change this behavior but we can't find how:
Our constraints are:
We are unable to make all changes on the same branch
Sometimes we have more than just 2 branchs, and we can only merge from the trunk to the feature branch until the feature is finished
What can we do to ensure a continuity between our database branch?
I think the easiest way is to use the Ruby-on-rails approach. Every DB change is a separate script file, no matter how small. Each script file is numbered, and when you do an upgrade you simply run each script from the number your DB currently is to the last one.
What this means in practice is that your DB version system stops being v18 to v19, and starts being v18.0 to v18.01, then v18.02 etc. What you release to the customer may get rolled up into a big v19 upgrade script, but as you develop, you will be making many, many small upgrades.
You'll have to modify this slightly to work for your system, each script will either have to be renumbered as it gets merged to the branch or you will have to ensure the upgrade scripts don't simply track the last upgrade number, but track each upgrade number so missing holes will still get filled in as the script gets merged across.
You will also have to roll up these little upgrades into the next major number as you create the release tag (on the trunk first) to keep things sane.
edit: so fundamentally you first havew to get rid of the notion of using a upgrade sdcript to go from version to version. For example, if you start with a table, and trunk adds column A and the branch adds column B, then you merge trunk to branch - you cannot realistically "upgrade" to the version with both, unless the branch version number is always greater than the trunk's upgrade script, and that doesn't work if you subsequently merge trunk to the branch. So you must therefore scrap the idea of a "version" that applies to development branches. The only way round that is to update each change independently, and track each change individually. Then you can say you need the "last main release plus colA plus colB" (admittedly if you merge trunk in, you can take the current main release from trunk whether its v18 or v19, but you still need to apply each branch update individually).
So you start with trunk at DB v18. Branch and make changes. Then you merge trunk later, where the DB is at v19. Your earlier branch changes still need to be applied (or should already be applied, but you may need to write a branch-update script with all branch changes in it, if you re-create your DB). Note the branch does not have a "v20" version number at all, and the branches changes are not made to a single update script like you have on trunk. You can add these changes you make on branch as a single script if you like (or 1 script of 'since the last trunk merge' changes) or as many little scripts.
When the branch is complete, the very last task is to take all the DB changes made for the branch and toll them up into a script that can be applied to the master upgrader, and when it is merged onto trunk, that script is merged into the current upgrade script and the DB version number bumped.
There is an alternative that may work for you, but I found it to be a little flaky when you try to update DBs with data, sometimes it just couldn't manage to do the update and the DB had to be wiped and re-created (which, to be fair, is probably what would have had to happen if I used SQL scripts at the time). That's to use Visual Studio Database project. This stores every part of the schema as a file, so you'll have 1 script per table. These will be hidden from you by Visual Studio itself that will show you designers instead of scripts but they're stored as files in version control. VS can deploy the project and will try to upgrade your DB if it already exists. Be careful of the options, many defaults say "drop and create" instead of using alter to update an existing table.
These projects can generate a (largely machine-readable) SQL script for deployment, we used to generate these and deliver them to a DBA team who didn't use VS and only accepted SQL.
And lastly, there's Roundhouse which is not something I've used but it might help you to become the new upgrader "script". Its a free project and I've read its more powerful and easier to use than VS DB projects. Its a DB versioning and change management tool, integrates with VS, and uses SQL scripts.
We use the following procedure for about 1.5 years now. I don't know if this is the best solution, but we didn't have any trouble with it (except some human errors in a delta-file like forgetting a USE-statement).
It has some simularities with the answer that Krumia gave, but differs in the point that in this approach only new change scripts/delta files are executed. This makes it a lot easier to write those files.
Delta files
Write all the DB-changes you make for a feature in a delta-file. You can have multiple statements in one delta-file or split them up into multiple. Once committed that file it's best (and once merged it's necessary) to start a new one and leave the old one untouched.
Put all the delta-files in one directory and give them a name-pattern like YYYY-MM-DD-HH.mm.description.sql. It's essential that you can sort them in time (therefore the timestamp) so you know what file needs to be executed first. Besides that you don't want to have a merge conflict with those files so it should be unique (over all branches).
Merging/pulling
Create a merge-script (for examlpe a bash-script) that performs the following actions:
Note the current commit-hash
Do the actual merge (or pull)
Get a list of all the delta-files that are added with this merge (git diff --stat $old_hash..HEAD -- path/to/delta-files)
Execute those delta-files, in the order specified by the timestamp
By using git to determine what files are new (and thus what database-actions aren't executed yet on the current branch) you are not longer bound to version-numbering.
Alternating delta-files
It might happen that within one merge delta-files from different branches may be 'new to execute' and that those files alternate like this:
2014-08-04-delta-from-feature_A.sql
2014-08-05-delta-from-feature_B.sql
2014-08-06-delta-from-feature_A.sql
As the timestamp determines the execution-order there will be first added something from feature A, then feature B, then back again to feature A. When you write proper delta-files, that are executable by themself/stand-alone, that shouldn't be a problem.
We recently have started using the Sql Server Data Tools (SSDT), which replaced the Visual Studio Database Project type, to version control our SQL databases. It creates a project for each database, with items for views and stored procedures and the ability to create Data-Tier Applications (DACPAC) that can be deployed to SQL Server instances. SSDT also supports Unit Testing and Static Data, and offers developers the option of quick sandbox testing using a LocalDB instance. There is a a good TechEd video overview of the SSDT tools and a lot more resources online.
In your situation you would use SSDT to manage your database objects in version control along side your application code, using the same merging process to push features between branches. When it comes time to upgrade an existing install you would create the DACPACs and use the Data-Tier Application upgrade process to apply the changes. Alternatively you could also use database synchronization tools such as DBGhost or RedGate to apply updates to the existing schema.
You want database migrations. Many frameworks have plugins for this. For instance CakePHP uses a plugin from CakeDC to manage. Here are some generic tools: http://en.wikipedia.org/wiki/Schema_migration#Available_Tools.
If you want to roll your own, perhaps instead of keeping the current DB version in the database, you keep a list of which patches have been applied. So instead of version table with one row with value 19, you instead have a patches table with multiple rows:
Patches
1
2
3
4
5
8
Looking at this you need to apply patches 6 and 7.
I just stumbled upon an older article written in 2008 by Jeff Atwood; hopefully it is still relevant to your problem.
Get Your Database Under Version Control
It mentiones five part series written by K. Scott Allen:
Three rules for database work
The Baseline
Change Scripts
Views, Stored Procedures and the Like
Branching and Merging
There are tools specifically designed to deal with this type of problems.
One is DBSourceTools
DBSourceTools is a GUI utility to help developers bring SQL Server
databases under source control. A powerful database scripter, code
editor, sql generator, and database versioning tool. Compare Schemas,
create diff scripts, edit T-SQL with ease. Better than Management
Studio.
Another one:
neXtep Designer
NeXtep designer is an Integrated Development Environment for database
developers. The main concept behind the product is to take advantage
of versioning in order to compute the incremental SQL scripts you need
to deliver your developments.
This project aims at building a development platform that provides all
tools which a database developer needs while automating the tasks of
generating the deliveries (= SQL resulting from a development).
To learn more about the problematic of delivering database updates, we
invite you to read the Delivering database updates article which will
present you our vision of best and worst practices.
I think an approach which will satisfy most of your requirements is to embrace the "Database Refactoring" concept.
There is a good book on this topic Refactoring Databases: Evolutionary Database Design
A database refactoring is a small change to your database schema which
improves its design without changing its semantics (e.g. you don't add
anything nor do you break anything). The process of database
refactoring is the evolutionary improvement of your database schema so
as to improve your ability to support the new needs of your customers,
support evolutionary software development, and to fix existing legacy
database design problems.
The book describes database refactoring from the point of view of:
Technology. It includes full source code for how to implement each refactoring at the database level and for most refactorings we
show how the application would change to reflect the change in the
database. Our code examples are in Oracle, Java, and Hibernate
meta-data (the refactorings are easy to translate to other
environments, and sometimes we discuss vendor-specific features which
simplify some refactorings).
Process. It describes in detail the process of database refactoring in both the simple situation of a single application
accessing the database as well as the situation of the database being
accessed by many programs, many of which are out of the scope of your
authority. The technical examples assume the latter situation, so if
you're in the simple situation you may find some of our solutions to
be a little more complicated than you need (lucky you!).
Culture. Although it is technically simple to implement individual refactorings, and clearly possible (albeit a little
complicated) to adapt your internal processes to support database
refactoring, the fact is that cultural challenges within your
organization will likely prove to be the most difficult hurdle to
overcome.
This idea may or may not work, but reading about your work so far and the previous answer looks like reinventing the wheel. The "wheel" is source control, with it's branch, merge and version tracking features.
At the moment, for each DB schema change, you have a SQL file containing the changes from the previous one. You already mention the significant issues you have with this approach.
Replace your method with this one: Maintain ONE (and only ONE!) SQL file, which stores all DDL command for creating tables, indexes, and so on from scratch. You need to add a new field? Add a "ALTER TABLE" line in your SQL file. This way your source control tool will in effect manage your database schema, and each branch can have a different.
All of a sudden, the source code is in sync with the database schema, branching and merging works, and so on.
Note: Just to clarify the purpose of the script mentioned here is to recreate the database from scratch up to a specific version, every single time.
EDIT: I spent some time looking for material to support this approach. Here is one that looks particularly good, with a proven track record:
Database Schema Versioning Management 101
Have you seen this situation before?
Your team is writing an enterprise application around a database
Since everyone is building around the same database, the schema of the database is in flux
Everyone has their own "local" copies of the database
Every time someone changes the schema, all of these copies need the latest schema to work with the latest build of the code
Every time you deploy to a staging or production database, the schema needs to work with the latest build of the code
Factors such as schema dependencies, data changes, configuration changes, and remote developers muddy the water
How do you currently address this problem of keeping the database
versions in working order? Do you suspect this is taking more time
than necessary? There are many ways to approach this problem, and the
answer depends on the workflow in your environment. The following
article describes a distilled and simplistic methodology you can use
as a starting point.
Since it can be implemented with ANSI SQL, it is database agnostic
Since it depends on scripting, it requires negligible storage management, and it can fit in your current code version management
program
The database versioning method you are using is certainly wrong, in my opinion. If anything has to have versions, it should be the source code. The source code has versions. Your live environment is only an instance of the source code.
The answer is to apply database changes using redeployable change scripts.
All changes, no matter which branch it is on (even in master/trunk) should be done in a separate script.
Sequence your scripts, so that newer ones will not get executed first. Having a prefix with date in the format YYYYMMDD for filename has worked for us.
When this happens, the change is made to the source code, not the database. You can have as many instances/builds for various tags/branches in the VCS as you like. For example, separate live builds for each branch.
Then you only have to do the build for each instance (probably every day). The build should fetch the files from the relevant branch and perform compiling/deploying. Since the scripts are redeployable, old scripts make no effect on the database. Only the recent changes are deployed to the database.
But, how to make redeployable scripts?
This is a question that is hard to answer, since you have not specified which database you are using. So I will give you an example about how my organization does it.
Let me take a simple example: if we need to add a column to a particular table, we do not just write ALTER TABLE ... ADD COLUMN .... We write code to add a column, if and only if that column does not exist in the given table.
Now, we have separate API to handle all that existence-checking boilerplate code. So our scripts are simply calls to those APIs. You will have to write your own. These API's are not actually that hard (we're using Oracle RDBMS). But they give us a huge gain in version control and deployment.
But, that's only one scenario, there are gazillion ways a schema definition can change
Yes indeed. Data type of a column can change; A new table can be added; An attribute column can be merged into a primary key (very rare); Sequences can change; Constraints; Foreign keys; They all can change.
But it turns out that all this can be handled by API's with special privileges to read metadata tables. I am not saying it's easy, but I am saying that it is a one time cost.
But, how do you rollback a database change?
My personal experience is, if you put some real effort into designing before banging the keyboard to write ALTER TABLE statements, this scenario is extremely rare. And if there ever is a rollback, you should manually handle it. (e.g. manually remove added column).
Normally, changes to views and stored procedures are rather common, and changes to table definitions is rare.
Building the Database
As I said before, building the database can be done by running all the redeployable scripts. Pre-deployed scripts has no effect.
Your database deployment script should not start with DROP DATABASE. Your database has lots of data which was used for unit tests. Unless you make a really really simple system, these data will be valuable in the future for testing. Your testers will not be too happy about adding ten thousand records to various tables every time a database is upgraded.
Put testers aside, how are you planning to upgrade your client/customers production database without annihilating all their production data? This is why you must use redeployable change scripts.
You can try version number schemes such as 18.1-branchname etc... But they are really going to utterly fail. Because you can merge your source, not it's instances.
I think that the way you pose the problem is impossible to solve, but if change part of your process there is a solution. Let's start with the first part: why it is impossible to solve using just deltas. In the following I assume you have the main trunk and two branches dev-a and dev-b; both branches stem from the same point-in-time.
Why cannot work
Say Alice add a delta script to dev-a:
ALTER TABLE t1 (ALTER COLUMN col5 char(4))
and Bob add another script in dev-b
ALTER TABLE t1 (ALTER COLUMN col5 int)
The two scripts are clearly incompatible and you end up in breaking code in main when you merge back from any of the two. The merge tool cannot be of help if the script files have different names.
Possible solution
My suggestion is to describe your database in terms of both baseline and deltas: the delta scripts must always refer to a specific baseline, so you are able to compute a new baseline schema resulting from the application of successive deltas to a specific baseline.
An example
dev-a *--B.A1--D.1#A1--D2#A1--------B.A2--*--B.A3--
/ /
main -- B.0 --*--------------------------*--B.1---*----------
\ /
dev-b *--B.B1--D.1#B1--B.B2--*
note that after branching you immediately spin-off a new baseline, same before every merge. This way you may check that the baselines are compatible.
Final comment
Managing deltas in version control is kind of reinventing the wheel, as each delta script is functionally equivalent to saving different versions of the baseline script. That said I agree with you that they in practice they convey more value and force people to think what happens in production when you change the database.
If you opt store only baseline, you have plenty of tools to support.
Another option is to serialize work on the database, as a whole or partitioning the schema in separate areas with unique owners.

Using git repository as a database backend

I'm doing a project that deals with structured document database. I have a tree of categories (~1000 categories, up to ~50 categories on each level), each category contains several thousands (up to, say, ~10000) of structured documents. Each document is several kilobytes of data in some structured form (I'd prefer YAML, but it may just as well be JSON or XML).
Users of this systems do several types of operations:
retrieving of these documents by ID
searching for documents by some of the structured attributes inside them
editing documents (i.e. adding/removing/renaming/merging); each edit operation should be recorded as a transaction with some comment
viewing a history of recorded changes for particular document (including viewing who, when and why changed the document, getting earlier version - and probably reverting to this one if requested)
Of course, the traditional solution would be using some sort of document database (such as CouchDB or Mongo) for this problem - however, this version control (history) thing tempted me to a wild idea - why shouldn't I use git repository as a database backend for this application?
On the first glance, it could be solved like this:
category = directory, document = file
getting document by ID => changing directories + reading a file in a working copy
editing documents with edit comments => making commits by various users + storing commit messages
history => normal git log and retrieval of older transactions
search => that's a slightly trickier part, I guess it would require periodic export of a category into relational database with indexing of columns that we'll allow to search by
Are there any other common pitfalls in this solution? Have anyone tried to implement such backend already (i.e. for any popular frameworks - RoR, node.js, Django, CakePHP)? Does this solution have any possible implications on performance or reliability - i.e. is it proven that git would be much slower than traditional database solutions or there would be any scalability/reliability pitfalls? I presume that a cluster of such servers that push/pull each other's repository should be fairly robust & reliable.
Basically, tell me if this solution will work and why it will or won't do?
Answering my own question is not the best thing to do, but, as I ultimately dropped the idea, I'd like to share on the rationale that worked in my case. I'd like to emphasize that this rationale might not apply to all cases, so it's up to architect to decide.
Generally, the first main point my question misses is that I'm dealing with multi-user system that work in parallel, concurrently, using my server with a thin client (i.e. just a web browser). This way, I have to maintain state for all of them. There are several approaches to this one, but all of them are either too hard on resources or too complex to implement (and thus kind of kill the original purpose of offloading all the hard implementation stuff to git in the first place):
"Blunt" approach: 1 user = 1 state = 1 full working copy of a repository that server maintains for user. Even if we're talking about fairly small document database (for example, 100s MiBs) with ~100K of users, maintaining full repository clone for all of them makes disc usage run through the roof (i.e. 100K of users times 100MiB ~ 10 TiB). What's even worse, cloning 100 MiB repository each time takes several seconds of time, even if done in fairly effective maneer (i.e. not using by git and unpacking-repacking stuff), which is non acceptable, IMO. And even worse — every edit that we apply to a main tree should be pulled to every user's repository, which is (1) resource hog, (2) might lead to unresolved edit conflicts in general case.
Basically, it might be as bad as O(number of edits × data × number of users) in terms of disc usage, and such disc usage automatically means pretty high CPU usage.
"Only active users" approach: maintain working copy only for active users. This way, you generally store not a full-repo-clone-per-user, but:
As user logs in, you clone the repository. It takes several seconds and ~100 MiB of disc space per active user.
As user continues to work on the site, he works with the given working copy.
As user logs out, his repository clone is copied back to main repository as a branch, thus storing only his "unapplied changes", if there are any, which is fairly space-efficient.
Thus, disc usage in this case peaks at O(number of edits × data × number of active users), which is usually ~100..1000 times less than number of total users, but it makes logging in/out more complicated and slower, as it involves cloning of a per-user branch on every login and pulling these changes back on logout or session expiration (which should be done transactionally => adds another layer of complexity). In absolute numbers, it drops 10 TiBs of disc usage down to 10..100 GiBs in my case, that might be acceptable, but, yet again, we're now talking about fairly small database of 100 MiBs.
"Sparse checkout" approach: making "sparse checkout" instead of full-blown repo clone per active user doesn't help a lot. It might save ~10x of disc space usage, but at expense of much higher CPU/disc load on history-involving operations, which kind of kills the purpose.
"Workers pool" approach: instead of doing full-blown clones every time for active person, we might keep a pool of "worker" clones, ready to be used. This way, every time a users logs in, he occupies one "worker", pulling there his branch from main repo, and, as he logs out, he frees the "worker", which does clever git hard reset to become yet again just a main repo clone, ready to be used by another user logging in. Does not help much with disc usage (it's still pretty high — only full clone per active user), but at least it makes logging in/out faster, as expense of even more complexity.
That said, note that I intentionally calculated numbers of fairly small database and user base: 100K users, 1K active users, 100 MiBs total database + history of edits, 10 MiBs of working copy. If you'd look at more prominent crowd-sourcing projects, there are much higher numbers there:
│ │ Users │ Active users │ DB+edits │ DB only │
├──────────────┼───────┼──────────────┼──────────┼─────────┤
│ MusicBrainz │ 1.2M │ 1K/week │ 30 GiB │ 20 GiB │
│ en.wikipedia │ 21.5M │ 133K/month │ 3 TiB │ 44 GiB │
│ OSM │ 1.7M │ 21K/month │ 726 GiB │ 480 GiB │
Obviously, for that amounts of data/activity, this approach would be utterly unacceptable.
Generally, it would have worked, if one could use web browser as a "thick" client, i.e. issuing git operations and storing pretty much the full checkout on client's side, not on the server's side.
There are also other points that I've missed, but they're not that bad compared to the first one:
The very pattern of having "thick" user's edit state is controversial in terms of normal ORMs, such as ActiveRecord, Hibernate, DataMapper, Tower, etc.
As much as I've searched for, there's zero existing free codebase for doing that approach to git from popular frameworks.
There is at least one service that somehow manages to do that efficiently — that is obviously github — but, alas, their codebase is closed source and I strongly suspect that they do not use normal git servers / repo storage techniques inside, i.e. they basically implemented alternative "big data" git.
So, bottom line: it is possible, but for most current usecases it won't be anywhere near the optimal solution. Rolling up your own document-edit-history-to-SQL implementation or trying to use any existing document database would be probably a better alternative.
my 2 pence worth. A bit longing but ...... I had a similar requirement in one of my incubation projects. Similar to yours , my key requirements where a document database ( xml in my case),with document versioning. It was for a multi-user system with a lot of collaboration use cases. My preference was to use available opensource solutions that support most of the key requirements.
To cut to the chase, I could not find any one product that provided both, in a way that was scalable enough ( number of users, usage volumes, storage and compute resources).I was biased towards git for all the promising capability, and (probable) solutions one could craft out of it. As I toyed with git option more, moving from a single user perspective to a multi ( milli) user perspective became an obvious challenge. Unfortunately, I did not get to do substantial performance analysis like you did. ( .. lazy/ quit early ....for version 2, mantra) Power to you!. Anyway, my biased idea has since morphed to the next (still biased ) alternative: a mesh-up of tools that are the best in their separate spheres, databases and version control.
While still work in progress ( ...and slightly neglected ) the morphed version is simply this .
on the frontend: (userfacing ) use a database for the 1st level
storage ( interfacing with user applications )
on the backend,
use a version control system (VCS)(like git ) to perform
versioning of the data objects in database
In essence it would amount to adding a version control plugin to the database, with some integration glue, which you may have to develop, but may be a lot much easier.
How it would (supposed to ) work is that the primary multi-user interface data exchanges are through the database. The DBMS will handle all the fun and complex issues such as multi-user , concurrency e, atomic operations etc. On the backend the VCS would perform version control on a single set of data objects ( no concurrency, or multi-user issues). For each effective transactions on the database, version control is only performed on the data records that would have effectively changed.
As for the interfacing glue, it will be in the form of a simple interworking function between the database and the VCS. In terms of design, as simple approach would be an event driven interface, with data updates from the database triggering the version control procedures ( hint : assuming Mysql, use of triggers and sys_exec() blah blah ...) .In terms of implementation complexity, it will range from the simple and effective ( eg scripting ) to the complex and wonderful ( some programmed connector interface) . All depends on how crazy you want to go with it , and how much sweat capital you are willing to spend. I reckon simple scripting should do the magic. And to access the end result, the various data versions, a simple alternative is to populate a clone of the database ( more a clone of the database structure) with the data referenced by the version tag/id/hash in the VCS. again this bit will be a simple query/translate/map job of an interface.
There are still some challenges and unknowns to be dealt with, but I suppose the impact, and relevance of most of these will largely depend on your application requirements and use cases. Some may just end up being non issues. Some of the issues include performance matching between the 2 key modules, the database and the VCS, for an application with high frequency data update activity, Scaling of resources (storage and processing power ) over time on the git side as the data , and users grow: steady, exponential or eventually plateau's
Of the cocktail above, here is what I'm currently brewing
using Git for the VCS ( initially considered good old CVS for the due to the use of only changesets or deltas between 2 version )
using mysql ( due to the highly structured nature of my data, xml with strict xml schemas )
toying around with MongoDB (to try a NoSQl database, which closely matches the native database structure used in git )
Some fun facts
- git actually does clear things to optimize storage, such as compression, and storage of only deltas between revision of objects
- YES, git does store only changesets or deltas between revisions of data objects, where is it is applicable ( it knows when and how) . Reference : packfiles, deep in the guts of Git internals
- Review of the git's object storage ( content-addressable filesystem), shows stricking similarities ( from the concept perspective) with noSQL databases such mongoDB. Again, at the expense of sweat capital, it may provide more interesting possibilities for integrating the 2, and performance tweaking
If you got this far, let me if the above may be applicable to your case, and assuming it would be , how it would square up to some of the aspect in your last comprehensive performance analysis
An interesting approach indeed. I would say that if you need to store data, use a database, not a source code repository, which is designed for a very specific task. If you could use Git out-of-the-box, then it's fine, but you probably need to build a document repository layer over it. So you could build it over a traditional database as well, right? And if it's built-in version control that you're interested in, why not just use one of open source document repository tools? There are plenty to choose from.
Well, if you decide to go for Git backend anyway, then basically it would work for your requirements if you implemented it as described. But:
1) You mentioned "cluster of servers that push/pull each other" - I've thought about it for a while and still I'm not sure. You can't push/pull several repos as an atomic operation. I wonder if there could be a possibility of some merge mess during concurrent work.
2) Maybe you don't need it, but an obvious functionality of a document repository you did not list is access control. You could possibly restrict access to some paths(=categories) via submodules, but probably you won't be able to grant access on document level easily.
I implemented a Ruby library on top of libgit2 which makes this pretty easy to implement and explore. There are some obvious limitations, but it's also a pretty liberating system since you get the full git toolchain.
The documentation includes some ideas about performance, tradeoffs, etc.
As you mentioned, the multi-user case is a bit trickier to handle. One possible solution would be to use user-specific Git index files resulting in
no need for separate working copies (disk usage is restricted to changed files)
no need for time-consuming preparatory work (per user session)
The trick is to combine Git's GIT_INDEX_FILE environmental variable with the tools to create Git commits manually:
git hash-object
git update-index
git write-tree
git commit-tree
A solution outline follows (actual SHA1 hashes omitted from the commands):
# Initialize the index
# N.B. Use the commit hash since refs might changed during the session.
$ GIT_INDEX_FILE=user_index_file git reset --hard <starting_commit_hash>
#
# Change data and save it to `changed_file`
#
# Save changed data to the Git object database. Returns a SHA1 hash to the blob.
$ cat changed_file | git hash-object -t blob -w --stdin
da39a3ee5e6b4b0d3255bfef95601890afd80709
# Add the changed file (using the object hash) to the user-specific index
# N.B. When adding new files, --add is required
$ GIT_INDEX_FILE=user_index_file git update-index --cacheinfo 100644 <changed_data_hash> path/to/the/changed_file
# Write the index to the object db. Returns a SHA1 hash to the tree object
$ GIT_INDEX_FILE=user_index_file git write-tree
8ea32f8432d9d4fa9f9b2b602ec7ee6c90aa2d53
# Create a commit from the tree. Returns a SHA1 hash to the commit object
# N.B. Parent commit should the same commit as in the first phase.
$ echo "User X updated their data" | git commit-tree <new_tree_hash> -p <starting_commit_hash>
3f8c225835e64314f5da40e6a568ff894886b952
# Create a ref to the new commit
git update-ref refs/heads/users/user_x_change_y <new_commit_hash>
Depending on your data you could use a cron job to merge the new refs to master but the conflict resolution is arguably the hardest part here.
Ideas to make it easier are welcome.

database versioning

I work as a scm developer and I am currently tasked with a activity to which involves the database versioning. Although I have done source code management I am quite new to this. Hence I would like to have different views and experience on how to implement this.
What I mean by database(oracle/sybase) version is to capture the changes which happens to the database schema/triggers/etc and store it as revisions. Basically in our company there are some changes in the customer databases which we are not aware of or at least not able to identify when and who made a particular change. We are just trying to create a record of the changes which happens in the DB.
Note: I am not a DB guy.
The usual practice is to allow changes to go through a build process. Basically.. have a version control tool like CVS where users check in the changes that have to to go to the QA and Prod environments.
So.. let's say, there are a couple of columns added to a table, the developer would check in a .ddl script with the "Alter table ..." command and that will be "applied" to the database the next time you do a build.
Unless you restrict users (in this case.. Developers) from directly making changes and instead use a standard build-process, tracking changes to objects is almost impossible over time.
Consider necessary details like the user who made the change, Time of change, reason (Check-in comments, bug Number, new feature request etc) which you'd need later to understand why a change was made. All the changes are usually compiled using a standard user like "APPOWNER" and in the absence of a version control system, you only have access to the latest change (last_ddl_change ).
If your concern is to track changes to Data, you can use triggers or use an application like Golden Gate that will read through the redo-logs and get you the change capture records. From your Question, it looks like you are looking for a way to track object changes.
The best way to do it is to have some kind of db revision software which manages all changes and allows to easily apply it to multiple databases (up/downgrade).
It requires to save all changes to revision software, no direct db changes.
Maybe similar tools for PostgreSQL will help:
depesz scripts http://www.depesz.com/index.php/projects/.
Python tool: https://code.google.com/p/sqlalchemy-migrate/

How should you build your database from source control?

There has been some discussion on the SO community wiki about whether database objects should be version controlled. However, I haven't seen much discussion about the best-practices for creating a build-automation process for database objects.
This has been a contentious point of discussion for my team - particularly since developers and DBAs often have different goals, approaches, and concerns when evaluating the benefits and risks of an automation approach to database deployment.
I would like to hear some ideas from the SO community about what practices have been effective in the real world.
I realize that it is somewhat subjective which practices are really best, but I think a good dialog about what work could be helpful to many folks.
Here are some of my teaser questions about areas of concern in this topic. These are not meant to be a definitive list - rather a starting point for people to help understand what I'm looking for.
Should both test and production environments be built from source control?
Should both be built using automation - or should production by built by copying objects from a stable, finalized test environment?
How do you deal with potential differences between test and production environments in deployment scripts?
How do you test that the deployment scripts will work as effectively against production as they do in test?
What types of objects should be version controlled?
Just code (procedures, packages, triggers, java, etc)?
Indexes?
Constraints?
Table Definitions?
Table Change Scripts? (eg. ALTER scripts)
Everything?
Which types of objects shouldn't be version controlled?
Sequences?
Grants?
User Accounts?
How should database objects be organized in your SCM repository?
How do you deal with one-time things like conversion scripts or ALTER scripts?
How do you deal with retiring objects from the database?
Who should be responsible for promoting objects from development to test level?
How do you coordinate changes from multiple developers?
How do you deal with branching for database objects used by multiple systems?
What exceptions, if any, can be reasonable made to this process?
Security issues?
Data with de-identification concerns?
Scripts that can't be fully automated?
How can you make the process resilient and enforceable?
To developer error?
To unexpected environmental issues?
For disaster recovery?
How do you convince decision makers that the benefits of DB-SCM truly justify the cost?
Anecdotal evidence?
Industry research?
Industry best-practice recommendations?
Appeals to recognized authorities?
Cost/Benefit analysis?
Who should "own" database objects in this model?
Developers?
DBAs?
Data Analysts?
More than one?
Here are some some answers to your questions:
Should both test and production environments be built from source control? YES
Should both be built using automation - or should production by built by copying objects from a stable, finalized test environment?
Automation for both. Do NOT copy data between the environments
How do you deal with potential differences between test and production environments in deployment scripts?
Use templates, so that actually you would produce different set of scripts for each environment (ex. references to external systems, linked databases, etc)
How do you test that the deployment scripts will work as effectively against production as they do in test?
You test them on pre-production environment: test deployment on exact copy of production environment (database and potentially other systems)
What types of objects should be version controlled?
Just code (procedures, packages, triggers, java, etc)?
Indexes?
Constraints?
Table Definitions?
Table Change Scripts? (eg. ALTER scripts)
Everything?
Everything, and:
Do not forget static data (lookup lists etc), so you do not need to copy ANY data between environments
Keep only current version of the database scripts (version controlled, of course), and
Store ALTER scripts: 1 BIG script (or directory of scripts named liked 001_AlterXXX.sql, so that running them in natural sort order will upgrade from version A to B)
Which types of objects shouldn't be version controlled?
Sequences?
Grants?
User Accounts?
see 2. If your users/roles (or technical user names) are different between environments, you can still script them using templates (see 1.)
How should database objects be organized in your SCM repository?
How do you deal with one-time things like conversion scripts or ALTER scripts?
see 2.
How do you deal with retiring objects from the database?
deleted from DB, removed from source control trunk/tip
Who should be responsible for promoting objects from development to test level?
dev/test/release schedule
How do you coordinate changes from multiple developers?
try NOT to create a separate database for each developer. you use source-control, right? in this case developers change the database and check-in the scripts. to be completely safe, re-create the database from the scripts during nightly build
How do you deal with branching for database objects used by multiple systems?
tough one: try to avoid at all costs.
What exceptions, if any, can be reasonable made to this process?
Security issues?
do not store passwords for test/prod. you may allow it for dev, especially if you have automated daily/nightly DB rebuilds
Data with de-identification concerns?
Scripts that can't be fully automated?
document and store with the release info/ALTER script
How can you make the process resilient and enforceable?
To developer error?
tested with daily build from scratch, and compare the results to the incremental upgrade (from version A to B using ALTER). compare both resulting schema and static data
To unexpected environmental issues?
use version control and backups
compare the PROD database schema to what you think it is, especially before deployment. SuperDuperCool DBA may have fixed a bug that was never in your ticket system :)
For disaster recovery?
How do you convince decision makers that the benefits of DB-SCM truly justify the cost?
Anecdotal evidence?
Industry research?
Industry best-practice recommendations?
Appeals to recognized authorities?
Cost/Benefit analysis?
if developers and DBAs agree, you do not need to convince anyone, I think (Unless you need money to buy a software like a dbGhost for MSSQL)
Who should "own" database objects in this model?
Developers?
DBAs?
Data Analysts?
More than one?
Usually DBAs approve the model (before check-in or after as part of code review). They definitely own performance related objects. But in general the team own it [and employer, of course :)]
I treat the SQL as source-code when possible
If I can write it in standard's compliant SQL then it generally goes in a file in my source control. The file will define as much as possible such as SPs, Table CREATE statements.
I also include dummy data for testing in source control:
proj/sql/setup_db.sql
proj/sql/dummy_data.sql
proj/sql/mssql_specific.sql
proj/sql/mysql_specific.sql
And then I abstract out all my SQL queries so that I can build the entire project for MySQL, Oracle, MSSQL or anything else.
Build and test automation uses these build-scripts as they are as important as the app source and tests everything from integrity through triggers, procedures and logging.
We use continuous integration via TeamCity. At each checkin to source control, the database and all the test data is re-built from scratch, then the code, then the unit tests are run against the code. If you're using a code-generation tool like CodeSmith, it can also be placed into your build process to generate your data access layer fresh with each build, making sure that all your layers "match up" and do not produce errors due to mismatched SP parameters or missing columns.
Each build has its own collection of SQL scripts that are stored in the $project\SQL\ directory in source control, assigned a numerical prefix and executed in order. That way, we're practicing our deployment procedure at every build.
Depending on the lookup table, most of our lookup values are also stored in scripts and run to make sure the configuration data is what we expect for, say, "reason_codes" or "country_codes". This way we can make a lookup data change in dev, test it out and then "promote" it through QA and production, instead of using a tool to modify lookup values in production, which can be dangerous for uptime.
We also create a set of "rollback" scripts that undo our database changes, in case a build to production goes screwy. You can test the rollback scripts by running them, then re-running the unit tests for the build one version below yours, after its deployment scripts run.
+1 for Liquibase:
LiquiBase is an open source (LGPL), database-independent library for tracking, managing and applying database changes. It is built on a simple premise: All database changes (structure and data) are stored in an XML-based descriptive manner and checked into source control.
The good point, that DML changes are stored semantically, not just diff, so that you could track the purpose of the changes.
It could be combined with GIT version control for better interaction. I'm going to configure our dev-prod enviroment to try it out.
Also you could use Maven, Ant build systems for building production code from scripts.
Tha minus is that LiquiBase doesnt integrate into widespread SQL IDE's and you should do basic operations yourself.
In adddition to this you could use DBUnit for DB testing - this tool allows data generation scripts to be used for testing your production env with cleanup aftewards.
IMHO:
Store DML in files so that you could
version them.
Automate schema build process from
source control.
For testing purposes developer could
use local DB builded from
source control via build system +
load testing Data with scripts, or
DBUnit scripts (from Source
Control).
LiquiBase allows you to provide "run
sequence" of scripts to respect
dependences.
There should be DBA team that checks master
brunch with ALL changes
before production use. I mean they
check trunk/branch from other DBA's
before committing into MASTER trunk.
So that master is always consistent
and production ready.
We faced all mentioned problems with code changes, merging, rewriting in our billing production database. This topic is great for discovering all that stuff.
By asking "teaser questions" you seem to be more interested in a discussion than someone's opinion of final answers. The active (>2500 members) mailing list agileDatabases has addressed many of these questions and is, in my experience, a sophisticated and civil forum for this kind of discussion.
I basically agree with every answer given by van. Fore more insight, my baseline for database management is K. Scott Allen series (a must read, IMHO. And Jeff's opinion too it seems).
Database objects can always be rebuilt from scratch by launching a single SQL file (that can itself call other SQL files) : Create.sql. This can include static data insertion (lists...).
The SQL scripts are parameterized so that no environment-dependent and/or sensitive information is stored in plain files.
I use a custom batch file to launch Create.sql : Create.cmd. Its goal is mainly to check for pre-requisites (tools, environment variables...) and send parameters to the SQL script. It can also bulk-load static data from CSV files for performance issues.
Typically, system user credentials would be passed as a parameter to the Create.cmd file.
IMHO, dynamic data loading should require another step, depending on your environment. Developers will want to load their database with test, junk or no data at all, while at the other end production managers will want to load production data. I would consider storing test data in source control as well (to ease unit testing, for instance).
Once the first version of the database has been put into production, you will need not only build scripts (mainly for developers), but also upgrade scripts (based on the same principles) :
There must be a way to retrieve the version from the database (I use a stored procedure, but a table would do as well).
Before releasing a new version, I create an Upgrade.sql file (that can call other ones) that allows upgrading version N-1 to version N (N being the version being released). I store this script under a folder named N-1.
I have a batch file that does the upgrade : Upgrade.cmd. It can retrieve the current version (CV) of the database via a simple SELECT statement, launch the Upgrade.sql script stored under the CV folder, and loop until no folder is found. This way, you can automatically upgrade from, say, N-3 to N.
Problems with this are :
It is difficult to automatically compare database schemas, depending on database vendors. This can lead to incomplete upgrade scripts.
Every change to the production environment (usually by DBAs for performance tuning) should find its way to the source control as well. To make sure of this, it is usually possible to log every modification to the database via a trigger. This log is reset after every upgrade.
More ideally, though, DBA initiated changes should be part of the release/upgrade process when possible.
As to what kind of database objects do you want to have under source control ? Well, I would say as much as possible, but not more ;-) If you want to create users with passwords, get them a default password (login/login, practical for unit testing purposes), and make the password change a manual operation. This happens a lot with Oracle where schemas are also users...
We have our Silverlight project with MSSQL database in Git version control. The easiest way is to make sure you've got a slimmed down database (content wise), and do a complete dump from f.e. Visual Studio. Then you can do 'sqlcmd' from your build script to recreate the database on each dev machine.
For deployment this is not possible since the databases are too large: that's the main reason for having them in a database in the first place.
I strongly believe that a DB should be part of source control and to a large degree part of the build process. If it is in source control then I have the same coding safe guards when writing a stored procedure in SQL as I do when writing a class in C#. I do this by including a DB scripts directory under my source tree. This script directory doesn't necessarily have one file for one object in the database. That would be a pain in the butt! I develop in my db just a I would in my code project. Then when I am ready to check in I do a diff between the last version of my database and the current one I am working on. I use SQL Compare for this and it generates a script of all the changes. This script is then saved to my db_update directory with a specific naming convention 1234_TasksCompletedInThisIteration where the number is the next number in the set of scripts already there, and the name describes what is being done in this check in. I do this this way because as part of my build process I start with a fresh database that is then built up programatically using the scripts in this directory. I wrote a custom NAnt task that iterates through each script executing its contents on the bare db. Obviously if I need some data to go into the db then I have data insert scripts too. This has many benefits too it. One, all of my stuff is versioned. Two, each build is a fresh build which means that there won't be any sneaky stuff eking its way into my development process (such as dirty data that causes oddities in the system). Three, when a new guy is added to the dev team, they simply need to get latest and their local dev is built for them on the fly. Four, I can run test cases (I didn't call it a "unit test"!) on my database as the state of the database is reset with each build (meaning I can test my repositories without worrying about adding test data to the db).
This is not for everyone.
This is not for every project. I usually work on green field projects which allows me this convenience!
Rather than get into white tower arguments, here's a solution that has worked very well for me on real world problems.
Building a database from scratch can be summarised as managing sql scripts.
DBdeploy is a tool that will check the current state of a database - e.g. what scripts have been previously run against it, what scripts are available to be run and therefore what scripts are needed to be run.
It will then collate all the needed scripts together and run them. It then records which scripts have been run.
It's not the prettiest tool or the most complex - but with careful management it can work very well. It's open source and easily extensible. Once the running of the scripts is handled nicely adding some extra components such as a shell script that checks out the latest scripts and runs dbdeploy against a particular instance is easily achieved.
See a good introduction here:
http://code.google.com/p/dbdeploy/wiki/GettingStarted
You might find that Liquibase handles a lot of what you're looking for.
Every developer should have their own local database, and use source code control to publish to the team. My solution is here : http://dbsourcetools.codeplex.com/
Have fun,
- Nathan

Resources