how to tap into PostgreSQL DDL parser? - database

The DDL / SQL scripts used to create my PostgreSQL database are under version control. In theory, any change to the database model is tracked in the source code repository.
In practice however, it happens that the structure of a live database is altered 'on the fly' and if any client scripts fail to insert / select / etc. data, I am put in charge of fixing the problem.
It would help me enormously if I could run a quick test to verify the database still corresponds to the creation scripts in the repo, i.e. is still the 'official' version.
I started using pgTAP for that purpose and so far, it works great. However, whenever a controlled, approved change is done to the DB, the test scripts need changing, too.
Therefore, I considered creating the test scripts automatically. One general approach could be to
run the scripts to create the DB
access DB metadata on the server
use that metadata to generate test code
I would prefer though not having to create the DB, but instead read the DB creation scripts directly. I tried to google a way to tap into the DDL parser and get some kind of metadata representation I could use, but so far, I have learned a lot about PostgreSQL internals, but couldn't really find a solution to the issue.
Can someone think of a way to have a PostgreSQL DDL script parsed ?

Here is my method for ensuring that the live database schema matches the schema definition under version control: As part of the "build" routine of your database schema, set up a temporary database instance, load in all the schema creation scripts the way it was intended, then run a pg_dump -s off that, and compare that with a schema dump of your production database. Depending your exact circumstances, you might need to run a little bit of sed over the final product to get an exact match, but it's usually possible.
You can automate this procedure completely. Run the database "build" on SCM checking (using a build bot, continuous integration server, or similar), and get the dumps from the live instance by a cron job. Of course, this way you'd get an alert every time someone checks in a database change, so you'll have to tweak the specifics a little.
There is no pgTAP involved there. I love pgTAP and use it for unit testing database functions and the like (also done on the CI server), but not for verifying schema properties, because the above procedure makes that unnecessary. (Also, generating tests automatically from what they are supposed to test seems a little bit like the wrong way around.)

There is a lot of database metadata to be concerned about here. I've been poking around the relevant database internals for a few years, and I wouldn't consider the project you're considering feasible to build without dumping a few man months of programming time just to get a rough alpha quality tool that handles some specific subset of changes you're concerned about supporting. If this were easy, there wouldn't be a long standing (as in: people have wanted it for a decade) open item to build DDL Triggers into the database, which is exactly the thing you'd like to have here.
In practice, there are two popular techniques people use to make this class of problem easier to deal with:
Set log_statement to 'ddl' and try to parse the changes it records.
Use pg_dump --schema-only to make a regular snapshot of the database structure. Put that under version control, and use changes in its diff to find the information you're looking for.
Actually taking either of these and deriving the pgTAP scripts you want directly is its own challenge. If the change made is small enough, you might be able to automate that to some degree. At least you'd be starting with a reasonably sized problem to approach from that angle.

Related

Databases and "branch"

We are currently developping an application which use a database.
Every time we update the database structure, we have to provide a script to update the database from the previous version to the current one.
So the database has currently a number that gave us it's current version and then our software make an update when we want to use an "old" database.
The issue we are encountering is when we have branches:
When we create a new big feature, that will not be available for users(and not included in releases), we create a branch.
The main branch(trunk) will be merged regularly to ensure that the create brunch has the latest bug corrections.
Here is some illustration:
The issue is with our update scripts. They update from the previous version to the current one, then update the version number of the database.
Imagine that we have the DB version 17 when creating the branch.
We then do the branch, and make changes on the Trunk DB. The DB has now the version 18.
Then we make a db change on the branch. Since we know there has already been a new version "18", we create the version 19 and the updater 18->19.
Then the trunk is merged on the branch.
At this very moment we may have some updaters that will never runs.
If someone updated his database before the merge, his database will be flagged has having the version 19, the the update 17->18 will never be done.
We want to change this behavior but we can't find how:
Our constraints are:
We are unable to make all changes on the same branch
Sometimes we have more than just 2 branchs, and we can only merge from the trunk to the feature branch until the feature is finished
What can we do to ensure a continuity between our database branch?
I think the easiest way is to use the Ruby-on-rails approach. Every DB change is a separate script file, no matter how small. Each script file is numbered, and when you do an upgrade you simply run each script from the number your DB currently is to the last one.
What this means in practice is that your DB version system stops being v18 to v19, and starts being v18.0 to v18.01, then v18.02 etc. What you release to the customer may get rolled up into a big v19 upgrade script, but as you develop, you will be making many, many small upgrades.
You'll have to modify this slightly to work for your system, each script will either have to be renumbered as it gets merged to the branch or you will have to ensure the upgrade scripts don't simply track the last upgrade number, but track each upgrade number so missing holes will still get filled in as the script gets merged across.
You will also have to roll up these little upgrades into the next major number as you create the release tag (on the trunk first) to keep things sane.
edit: so fundamentally you first havew to get rid of the notion of using a upgrade sdcript to go from version to version. For example, if you start with a table, and trunk adds column A and the branch adds column B, then you merge trunk to branch - you cannot realistically "upgrade" to the version with both, unless the branch version number is always greater than the trunk's upgrade script, and that doesn't work if you subsequently merge trunk to the branch. So you must therefore scrap the idea of a "version" that applies to development branches. The only way round that is to update each change independently, and track each change individually. Then you can say you need the "last main release plus colA plus colB" (admittedly if you merge trunk in, you can take the current main release from trunk whether its v18 or v19, but you still need to apply each branch update individually).
So you start with trunk at DB v18. Branch and make changes. Then you merge trunk later, where the DB is at v19. Your earlier branch changes still need to be applied (or should already be applied, but you may need to write a branch-update script with all branch changes in it, if you re-create your DB). Note the branch does not have a "v20" version number at all, and the branches changes are not made to a single update script like you have on trunk. You can add these changes you make on branch as a single script if you like (or 1 script of 'since the last trunk merge' changes) or as many little scripts.
When the branch is complete, the very last task is to take all the DB changes made for the branch and toll them up into a script that can be applied to the master upgrader, and when it is merged onto trunk, that script is merged into the current upgrade script and the DB version number bumped.
There is an alternative that may work for you, but I found it to be a little flaky when you try to update DBs with data, sometimes it just couldn't manage to do the update and the DB had to be wiped and re-created (which, to be fair, is probably what would have had to happen if I used SQL scripts at the time). That's to use Visual Studio Database project. This stores every part of the schema as a file, so you'll have 1 script per table. These will be hidden from you by Visual Studio itself that will show you designers instead of scripts but they're stored as files in version control. VS can deploy the project and will try to upgrade your DB if it already exists. Be careful of the options, many defaults say "drop and create" instead of using alter to update an existing table.
These projects can generate a (largely machine-readable) SQL script for deployment, we used to generate these and deliver them to a DBA team who didn't use VS and only accepted SQL.
And lastly, there's Roundhouse which is not something I've used but it might help you to become the new upgrader "script". Its a free project and I've read its more powerful and easier to use than VS DB projects. Its a DB versioning and change management tool, integrates with VS, and uses SQL scripts.
We use the following procedure for about 1.5 years now. I don't know if this is the best solution, but we didn't have any trouble with it (except some human errors in a delta-file like forgetting a USE-statement).
It has some simularities with the answer that Krumia gave, but differs in the point that in this approach only new change scripts/delta files are executed. This makes it a lot easier to write those files.
Delta files
Write all the DB-changes you make for a feature in a delta-file. You can have multiple statements in one delta-file or split them up into multiple. Once committed that file it's best (and once merged it's necessary) to start a new one and leave the old one untouched.
Put all the delta-files in one directory and give them a name-pattern like YYYY-MM-DD-HH.mm.description.sql. It's essential that you can sort them in time (therefore the timestamp) so you know what file needs to be executed first. Besides that you don't want to have a merge conflict with those files so it should be unique (over all branches).
Merging/pulling
Create a merge-script (for examlpe a bash-script) that performs the following actions:
Note the current commit-hash
Do the actual merge (or pull)
Get a list of all the delta-files that are added with this merge (git diff --stat $old_hash..HEAD -- path/to/delta-files)
Execute those delta-files, in the order specified by the timestamp
By using git to determine what files are new (and thus what database-actions aren't executed yet on the current branch) you are not longer bound to version-numbering.
Alternating delta-files
It might happen that within one merge delta-files from different branches may be 'new to execute' and that those files alternate like this:
2014-08-04-delta-from-feature_A.sql
2014-08-05-delta-from-feature_B.sql
2014-08-06-delta-from-feature_A.sql
As the timestamp determines the execution-order there will be first added something from feature A, then feature B, then back again to feature A. When you write proper delta-files, that are executable by themself/stand-alone, that shouldn't be a problem.
We recently have started using the Sql Server Data Tools (SSDT), which replaced the Visual Studio Database Project type, to version control our SQL databases. It creates a project for each database, with items for views and stored procedures and the ability to create Data-Tier Applications (DACPAC) that can be deployed to SQL Server instances. SSDT also supports Unit Testing and Static Data, and offers developers the option of quick sandbox testing using a LocalDB instance. There is a a good TechEd video overview of the SSDT tools and a lot more resources online.
In your situation you would use SSDT to manage your database objects in version control along side your application code, using the same merging process to push features between branches. When it comes time to upgrade an existing install you would create the DACPACs and use the Data-Tier Application upgrade process to apply the changes. Alternatively you could also use database synchronization tools such as DBGhost or RedGate to apply updates to the existing schema.
You want database migrations. Many frameworks have plugins for this. For instance CakePHP uses a plugin from CakeDC to manage. Here are some generic tools: http://en.wikipedia.org/wiki/Schema_migration#Available_Tools.
If you want to roll your own, perhaps instead of keeping the current DB version in the database, you keep a list of which patches have been applied. So instead of version table with one row with value 19, you instead have a patches table with multiple rows:
Patches
1
2
3
4
5
8
Looking at this you need to apply patches 6 and 7.
I just stumbled upon an older article written in 2008 by Jeff Atwood; hopefully it is still relevant to your problem.
Get Your Database Under Version Control
It mentiones five part series written by K. Scott Allen:
Three rules for database work
The Baseline
Change Scripts
Views, Stored Procedures and the Like
Branching and Merging
There are tools specifically designed to deal with this type of problems.
One is DBSourceTools
DBSourceTools is a GUI utility to help developers bring SQL Server
databases under source control. A powerful database scripter, code
editor, sql generator, and database versioning tool. Compare Schemas,
create diff scripts, edit T-SQL with ease. Better than Management
Studio.
Another one:
neXtep Designer
NeXtep designer is an Integrated Development Environment for database
developers. The main concept behind the product is to take advantage
of versioning in order to compute the incremental SQL scripts you need
to deliver your developments.
This project aims at building a development platform that provides all
tools which a database developer needs while automating the tasks of
generating the deliveries (= SQL resulting from a development).
To learn more about the problematic of delivering database updates, we
invite you to read the Delivering database updates article which will
present you our vision of best and worst practices.
I think an approach which will satisfy most of your requirements is to embrace the "Database Refactoring" concept.
There is a good book on this topic Refactoring Databases: Evolutionary Database Design
A database refactoring is a small change to your database schema which
improves its design without changing its semantics (e.g. you don't add
anything nor do you break anything). The process of database
refactoring is the evolutionary improvement of your database schema so
as to improve your ability to support the new needs of your customers,
support evolutionary software development, and to fix existing legacy
database design problems.
The book describes database refactoring from the point of view of:
Technology. It includes full source code for how to implement each refactoring at the database level and for most refactorings we
show how the application would change to reflect the change in the
database. Our code examples are in Oracle, Java, and Hibernate
meta-data (the refactorings are easy to translate to other
environments, and sometimes we discuss vendor-specific features which
simplify some refactorings).
Process. It describes in detail the process of database refactoring in both the simple situation of a single application
accessing the database as well as the situation of the database being
accessed by many programs, many of which are out of the scope of your
authority. The technical examples assume the latter situation, so if
you're in the simple situation you may find some of our solutions to
be a little more complicated than you need (lucky you!).
Culture. Although it is technically simple to implement individual refactorings, and clearly possible (albeit a little
complicated) to adapt your internal processes to support database
refactoring, the fact is that cultural challenges within your
organization will likely prove to be the most difficult hurdle to
overcome.
This idea may or may not work, but reading about your work so far and the previous answer looks like reinventing the wheel. The "wheel" is source control, with it's branch, merge and version tracking features.
At the moment, for each DB schema change, you have a SQL file containing the changes from the previous one. You already mention the significant issues you have with this approach.
Replace your method with this one: Maintain ONE (and only ONE!) SQL file, which stores all DDL command for creating tables, indexes, and so on from scratch. You need to add a new field? Add a "ALTER TABLE" line in your SQL file. This way your source control tool will in effect manage your database schema, and each branch can have a different.
All of a sudden, the source code is in sync with the database schema, branching and merging works, and so on.
Note: Just to clarify the purpose of the script mentioned here is to recreate the database from scratch up to a specific version, every single time.
EDIT: I spent some time looking for material to support this approach. Here is one that looks particularly good, with a proven track record:
Database Schema Versioning Management 101
Have you seen this situation before?
Your team is writing an enterprise application around a database
Since everyone is building around the same database, the schema of the database is in flux
Everyone has their own "local" copies of the database
Every time someone changes the schema, all of these copies need the latest schema to work with the latest build of the code
Every time you deploy to a staging or production database, the schema needs to work with the latest build of the code
Factors such as schema dependencies, data changes, configuration changes, and remote developers muddy the water
How do you currently address this problem of keeping the database
versions in working order? Do you suspect this is taking more time
than necessary? There are many ways to approach this problem, and the
answer depends on the workflow in your environment. The following
article describes a distilled and simplistic methodology you can use
as a starting point.
Since it can be implemented with ANSI SQL, it is database agnostic
Since it depends on scripting, it requires negligible storage management, and it can fit in your current code version management
program
The database versioning method you are using is certainly wrong, in my opinion. If anything has to have versions, it should be the source code. The source code has versions. Your live environment is only an instance of the source code.
The answer is to apply database changes using redeployable change scripts.
All changes, no matter which branch it is on (even in master/trunk) should be done in a separate script.
Sequence your scripts, so that newer ones will not get executed first. Having a prefix with date in the format YYYYMMDD for filename has worked for us.
When this happens, the change is made to the source code, not the database. You can have as many instances/builds for various tags/branches in the VCS as you like. For example, separate live builds for each branch.
Then you only have to do the build for each instance (probably every day). The build should fetch the files from the relevant branch and perform compiling/deploying. Since the scripts are redeployable, old scripts make no effect on the database. Only the recent changes are deployed to the database.
But, how to make redeployable scripts?
This is a question that is hard to answer, since you have not specified which database you are using. So I will give you an example about how my organization does it.
Let me take a simple example: if we need to add a column to a particular table, we do not just write ALTER TABLE ... ADD COLUMN .... We write code to add a column, if and only if that column does not exist in the given table.
Now, we have separate API to handle all that existence-checking boilerplate code. So our scripts are simply calls to those APIs. You will have to write your own. These API's are not actually that hard (we're using Oracle RDBMS). But they give us a huge gain in version control and deployment.
But, that's only one scenario, there are gazillion ways a schema definition can change
Yes indeed. Data type of a column can change; A new table can be added; An attribute column can be merged into a primary key (very rare); Sequences can change; Constraints; Foreign keys; They all can change.
But it turns out that all this can be handled by API's with special privileges to read metadata tables. I am not saying it's easy, but I am saying that it is a one time cost.
But, how do you rollback a database change?
My personal experience is, if you put some real effort into designing before banging the keyboard to write ALTER TABLE statements, this scenario is extremely rare. And if there ever is a rollback, you should manually handle it. (e.g. manually remove added column).
Normally, changes to views and stored procedures are rather common, and changes to table definitions is rare.
Building the Database
As I said before, building the database can be done by running all the redeployable scripts. Pre-deployed scripts has no effect.
Your database deployment script should not start with DROP DATABASE. Your database has lots of data which was used for unit tests. Unless you make a really really simple system, these data will be valuable in the future for testing. Your testers will not be too happy about adding ten thousand records to various tables every time a database is upgraded.
Put testers aside, how are you planning to upgrade your client/customers production database without annihilating all their production data? This is why you must use redeployable change scripts.
You can try version number schemes such as 18.1-branchname etc... But they are really going to utterly fail. Because you can merge your source, not it's instances.
I think that the way you pose the problem is impossible to solve, but if change part of your process there is a solution. Let's start with the first part: why it is impossible to solve using just deltas. In the following I assume you have the main trunk and two branches dev-a and dev-b; both branches stem from the same point-in-time.
Why cannot work
Say Alice add a delta script to dev-a:
ALTER TABLE t1 (ALTER COLUMN col5 char(4))
and Bob add another script in dev-b
ALTER TABLE t1 (ALTER COLUMN col5 int)
The two scripts are clearly incompatible and you end up in breaking code in main when you merge back from any of the two. The merge tool cannot be of help if the script files have different names.
Possible solution
My suggestion is to describe your database in terms of both baseline and deltas: the delta scripts must always refer to a specific baseline, so you are able to compute a new baseline schema resulting from the application of successive deltas to a specific baseline.
An example
dev-a *--B.A1--D.1#A1--D2#A1--------B.A2--*--B.A3--
/ /
main -- B.0 --*--------------------------*--B.1---*----------
\ /
dev-b *--B.B1--D.1#B1--B.B2--*
note that after branching you immediately spin-off a new baseline, same before every merge. This way you may check that the baselines are compatible.
Final comment
Managing deltas in version control is kind of reinventing the wheel, as each delta script is functionally equivalent to saving different versions of the baseline script. That said I agree with you that they in practice they convey more value and force people to think what happens in production when you change the database.
If you opt store only baseline, you have plenty of tools to support.
Another option is to serialize work on the database, as a whole or partitioning the schema in separate areas with unique owners.

Database Insert into script preparation for tests (logging data state?)

It is hard to formulate question in title so I'll try to explain.
I need to do automatic UI tests for application which is already written. These application has a big testing database loaded with a lot of data. Sometimes it is hard to understand relationships between tables as they are not trivial also there are foreign keys missing (logic is implemented in Java application, some logic is in stored procedures).
Problem is, that I can run my tests only once: after they are finished, some data is moved and some of data is deleted by application. So I need to prepare scripts and do Insert Into statements before every test.
Is it possible to make such a script preparing easier? Of course, good solution will be to investigate all db structure and dependencies (or review application logic in Java), but it would take a lot of time. I cannot save database data and look for the changes after tests are finished as it will take a lot of time for data export/import in SQL Developer. Maybe DB administrator have another Oracle DB tools for doing this?
I suggest you obtain and use a good testing framework. They're usually free. For Java the standard is JUnit. For PL/SQL, I like utPlsql but there are several others available. Using the testing framework you can insert the needed data into the database at the start of the test, run the test, and make sure the end results are as expected, then clean up the data so you can run the test again cleanly. This requires a fair amount of coding to make it all work right, but it's much easier to do this coding once than to perform the same tasks manually many times over.
Share and enjoy.

YAGNI and database creation scripts

Right now, I have code which creates the database (just a few CREATE queries on a SQLite database) in my main database access class. This seems unnecessary as I have no intention of ever using the code. I would just need it if something went wrong and I needed to recreate the database. Should I...
Leave things as they are, even though the database creation code is about a quarter of my file size.
Move the database-creation code to a separate script. It's likely I'll be running it manually if I ever need to run it again anyway, and that would put it out-of-sight-out-of-mind while working on the main code.
Delete the database-creation code and rely on revision control if I ever find myself needing it again.
I think it is best to keep the code. Even more importantly, you should maintain this code (or generate it) every time the database schema changes.
It is important for the following reasons.
You will probably be surprised how many times you need it. If you need to migrate your server, or setup another environment (e.g. TEST or DEMO), and so on.
I also find that I refer to the DDL SQL quite often when coding, particularly if I have not touched the system for a while.
You have a reference for the decisions you made, such as the indexes you created, unique keys, etc etc.
If you do not have a disciplined approach to this, I have found that the database schema can drift over time as ad-hoc changes are made, and this can cause obscure issues that are not found until you hit the database. Even worse without a disciplined approach (i.e. a reference definition of the schema) you may find that different databases have subtly different schema.
I would just need it if something went
wrong and I needed to recreate the
database.
Recreating the database is absolutely not an exceptional case. That code is part of your deployment process on a new / different system, and it represents the DB structure your code expects to work with. You should actually have integration tests that confirm this. Working indefinitely with a single DB server whose schema was created incrementally via manually dispatched SQL statements during development is not something you should rely on.
But yes, it should be separated from the access code; so option 2 is correct. The separate script can then be used by tests as well as for deployment.

How should you build your database from source control?

There has been some discussion on the SO community wiki about whether database objects should be version controlled. However, I haven't seen much discussion about the best-practices for creating a build-automation process for database objects.
This has been a contentious point of discussion for my team - particularly since developers and DBAs often have different goals, approaches, and concerns when evaluating the benefits and risks of an automation approach to database deployment.
I would like to hear some ideas from the SO community about what practices have been effective in the real world.
I realize that it is somewhat subjective which practices are really best, but I think a good dialog about what work could be helpful to many folks.
Here are some of my teaser questions about areas of concern in this topic. These are not meant to be a definitive list - rather a starting point for people to help understand what I'm looking for.
Should both test and production environments be built from source control?
Should both be built using automation - or should production by built by copying objects from a stable, finalized test environment?
How do you deal with potential differences between test and production environments in deployment scripts?
How do you test that the deployment scripts will work as effectively against production as they do in test?
What types of objects should be version controlled?
Just code (procedures, packages, triggers, java, etc)?
Indexes?
Constraints?
Table Definitions?
Table Change Scripts? (eg. ALTER scripts)
Everything?
Which types of objects shouldn't be version controlled?
Sequences?
Grants?
User Accounts?
How should database objects be organized in your SCM repository?
How do you deal with one-time things like conversion scripts or ALTER scripts?
How do you deal with retiring objects from the database?
Who should be responsible for promoting objects from development to test level?
How do you coordinate changes from multiple developers?
How do you deal with branching for database objects used by multiple systems?
What exceptions, if any, can be reasonable made to this process?
Security issues?
Data with de-identification concerns?
Scripts that can't be fully automated?
How can you make the process resilient and enforceable?
To developer error?
To unexpected environmental issues?
For disaster recovery?
How do you convince decision makers that the benefits of DB-SCM truly justify the cost?
Anecdotal evidence?
Industry research?
Industry best-practice recommendations?
Appeals to recognized authorities?
Cost/Benefit analysis?
Who should "own" database objects in this model?
Developers?
DBAs?
Data Analysts?
More than one?
Here are some some answers to your questions:
Should both test and production environments be built from source control? YES
Should both be built using automation - or should production by built by copying objects from a stable, finalized test environment?
Automation for both. Do NOT copy data between the environments
How do you deal with potential differences between test and production environments in deployment scripts?
Use templates, so that actually you would produce different set of scripts for each environment (ex. references to external systems, linked databases, etc)
How do you test that the deployment scripts will work as effectively against production as they do in test?
You test them on pre-production environment: test deployment on exact copy of production environment (database and potentially other systems)
What types of objects should be version controlled?
Just code (procedures, packages, triggers, java, etc)?
Indexes?
Constraints?
Table Definitions?
Table Change Scripts? (eg. ALTER scripts)
Everything?
Everything, and:
Do not forget static data (lookup lists etc), so you do not need to copy ANY data between environments
Keep only current version of the database scripts (version controlled, of course), and
Store ALTER scripts: 1 BIG script (or directory of scripts named liked 001_AlterXXX.sql, so that running them in natural sort order will upgrade from version A to B)
Which types of objects shouldn't be version controlled?
Sequences?
Grants?
User Accounts?
see 2. If your users/roles (or technical user names) are different between environments, you can still script them using templates (see 1.)
How should database objects be organized in your SCM repository?
How do you deal with one-time things like conversion scripts or ALTER scripts?
see 2.
How do you deal with retiring objects from the database?
deleted from DB, removed from source control trunk/tip
Who should be responsible for promoting objects from development to test level?
dev/test/release schedule
How do you coordinate changes from multiple developers?
try NOT to create a separate database for each developer. you use source-control, right? in this case developers change the database and check-in the scripts. to be completely safe, re-create the database from the scripts during nightly build
How do you deal with branching for database objects used by multiple systems?
tough one: try to avoid at all costs.
What exceptions, if any, can be reasonable made to this process?
Security issues?
do not store passwords for test/prod. you may allow it for dev, especially if you have automated daily/nightly DB rebuilds
Data with de-identification concerns?
Scripts that can't be fully automated?
document and store with the release info/ALTER script
How can you make the process resilient and enforceable?
To developer error?
tested with daily build from scratch, and compare the results to the incremental upgrade (from version A to B using ALTER). compare both resulting schema and static data
To unexpected environmental issues?
use version control and backups
compare the PROD database schema to what you think it is, especially before deployment. SuperDuperCool DBA may have fixed a bug that was never in your ticket system :)
For disaster recovery?
How do you convince decision makers that the benefits of DB-SCM truly justify the cost?
Anecdotal evidence?
Industry research?
Industry best-practice recommendations?
Appeals to recognized authorities?
Cost/Benefit analysis?
if developers and DBAs agree, you do not need to convince anyone, I think (Unless you need money to buy a software like a dbGhost for MSSQL)
Who should "own" database objects in this model?
Developers?
DBAs?
Data Analysts?
More than one?
Usually DBAs approve the model (before check-in or after as part of code review). They definitely own performance related objects. But in general the team own it [and employer, of course :)]
I treat the SQL as source-code when possible
If I can write it in standard's compliant SQL then it generally goes in a file in my source control. The file will define as much as possible such as SPs, Table CREATE statements.
I also include dummy data for testing in source control:
proj/sql/setup_db.sql
proj/sql/dummy_data.sql
proj/sql/mssql_specific.sql
proj/sql/mysql_specific.sql
And then I abstract out all my SQL queries so that I can build the entire project for MySQL, Oracle, MSSQL or anything else.
Build and test automation uses these build-scripts as they are as important as the app source and tests everything from integrity through triggers, procedures and logging.
We use continuous integration via TeamCity. At each checkin to source control, the database and all the test data is re-built from scratch, then the code, then the unit tests are run against the code. If you're using a code-generation tool like CodeSmith, it can also be placed into your build process to generate your data access layer fresh with each build, making sure that all your layers "match up" and do not produce errors due to mismatched SP parameters or missing columns.
Each build has its own collection of SQL scripts that are stored in the $project\SQL\ directory in source control, assigned a numerical prefix and executed in order. That way, we're practicing our deployment procedure at every build.
Depending on the lookup table, most of our lookup values are also stored in scripts and run to make sure the configuration data is what we expect for, say, "reason_codes" or "country_codes". This way we can make a lookup data change in dev, test it out and then "promote" it through QA and production, instead of using a tool to modify lookup values in production, which can be dangerous for uptime.
We also create a set of "rollback" scripts that undo our database changes, in case a build to production goes screwy. You can test the rollback scripts by running them, then re-running the unit tests for the build one version below yours, after its deployment scripts run.
+1 for Liquibase:
LiquiBase is an open source (LGPL), database-independent library for tracking, managing and applying database changes. It is built on a simple premise: All database changes (structure and data) are stored in an XML-based descriptive manner and checked into source control.
The good point, that DML changes are stored semantically, not just diff, so that you could track the purpose of the changes.
It could be combined with GIT version control for better interaction. I'm going to configure our dev-prod enviroment to try it out.
Also you could use Maven, Ant build systems for building production code from scripts.
Tha minus is that LiquiBase doesnt integrate into widespread SQL IDE's and you should do basic operations yourself.
In adddition to this you could use DBUnit for DB testing - this tool allows data generation scripts to be used for testing your production env with cleanup aftewards.
IMHO:
Store DML in files so that you could
version them.
Automate schema build process from
source control.
For testing purposes developer could
use local DB builded from
source control via build system +
load testing Data with scripts, or
DBUnit scripts (from Source
Control).
LiquiBase allows you to provide "run
sequence" of scripts to respect
dependences.
There should be DBA team that checks master
brunch with ALL changes
before production use. I mean they
check trunk/branch from other DBA's
before committing into MASTER trunk.
So that master is always consistent
and production ready.
We faced all mentioned problems with code changes, merging, rewriting in our billing production database. This topic is great for discovering all that stuff.
By asking "teaser questions" you seem to be more interested in a discussion than someone's opinion of final answers. The active (>2500 members) mailing list agileDatabases has addressed many of these questions and is, in my experience, a sophisticated and civil forum for this kind of discussion.
I basically agree with every answer given by van. Fore more insight, my baseline for database management is K. Scott Allen series (a must read, IMHO. And Jeff's opinion too it seems).
Database objects can always be rebuilt from scratch by launching a single SQL file (that can itself call other SQL files) : Create.sql. This can include static data insertion (lists...).
The SQL scripts are parameterized so that no environment-dependent and/or sensitive information is stored in plain files.
I use a custom batch file to launch Create.sql : Create.cmd. Its goal is mainly to check for pre-requisites (tools, environment variables...) and send parameters to the SQL script. It can also bulk-load static data from CSV files for performance issues.
Typically, system user credentials would be passed as a parameter to the Create.cmd file.
IMHO, dynamic data loading should require another step, depending on your environment. Developers will want to load their database with test, junk or no data at all, while at the other end production managers will want to load production data. I would consider storing test data in source control as well (to ease unit testing, for instance).
Once the first version of the database has been put into production, you will need not only build scripts (mainly for developers), but also upgrade scripts (based on the same principles) :
There must be a way to retrieve the version from the database (I use a stored procedure, but a table would do as well).
Before releasing a new version, I create an Upgrade.sql file (that can call other ones) that allows upgrading version N-1 to version N (N being the version being released). I store this script under a folder named N-1.
I have a batch file that does the upgrade : Upgrade.cmd. It can retrieve the current version (CV) of the database via a simple SELECT statement, launch the Upgrade.sql script stored under the CV folder, and loop until no folder is found. This way, you can automatically upgrade from, say, N-3 to N.
Problems with this are :
It is difficult to automatically compare database schemas, depending on database vendors. This can lead to incomplete upgrade scripts.
Every change to the production environment (usually by DBAs for performance tuning) should find its way to the source control as well. To make sure of this, it is usually possible to log every modification to the database via a trigger. This log is reset after every upgrade.
More ideally, though, DBA initiated changes should be part of the release/upgrade process when possible.
As to what kind of database objects do you want to have under source control ? Well, I would say as much as possible, but not more ;-) If you want to create users with passwords, get them a default password (login/login, practical for unit testing purposes), and make the password change a manual operation. This happens a lot with Oracle where schemas are also users...
We have our Silverlight project with MSSQL database in Git version control. The easiest way is to make sure you've got a slimmed down database (content wise), and do a complete dump from f.e. Visual Studio. Then you can do 'sqlcmd' from your build script to recreate the database on each dev machine.
For deployment this is not possible since the databases are too large: that's the main reason for having them in a database in the first place.
I strongly believe that a DB should be part of source control and to a large degree part of the build process. If it is in source control then I have the same coding safe guards when writing a stored procedure in SQL as I do when writing a class in C#. I do this by including a DB scripts directory under my source tree. This script directory doesn't necessarily have one file for one object in the database. That would be a pain in the butt! I develop in my db just a I would in my code project. Then when I am ready to check in I do a diff between the last version of my database and the current one I am working on. I use SQL Compare for this and it generates a script of all the changes. This script is then saved to my db_update directory with a specific naming convention 1234_TasksCompletedInThisIteration where the number is the next number in the set of scripts already there, and the name describes what is being done in this check in. I do this this way because as part of my build process I start with a fresh database that is then built up programatically using the scripts in this directory. I wrote a custom NAnt task that iterates through each script executing its contents on the bare db. Obviously if I need some data to go into the db then I have data insert scripts too. This has many benefits too it. One, all of my stuff is versioned. Two, each build is a fresh build which means that there won't be any sneaky stuff eking its way into my development process (such as dirty data that causes oddities in the system). Three, when a new guy is added to the dev team, they simply need to get latest and their local dev is built for them on the fly. Four, I can run test cases (I didn't call it a "unit test"!) on my database as the state of the database is reset with each build (meaning I can test my repositories without worrying about adding test data to the db).
This is not for everyone.
This is not for every project. I usually work on green field projects which allows me this convenience!
Rather than get into white tower arguments, here's a solution that has worked very well for me on real world problems.
Building a database from scratch can be summarised as managing sql scripts.
DBdeploy is a tool that will check the current state of a database - e.g. what scripts have been previously run against it, what scripts are available to be run and therefore what scripts are needed to be run.
It will then collate all the needed scripts together and run them. It then records which scripts have been run.
It's not the prettiest tool or the most complex - but with careful management it can work very well. It's open source and easily extensible. Once the running of the scripts is handled nicely adding some extra components such as a shell script that checks out the latest scripts and runs dbdeploy against a particular instance is easily achieved.
See a good introduction here:
http://code.google.com/p/dbdeploy/wiki/GettingStarted
You might find that Liquibase handles a lot of what you're looking for.
Every developer should have their own local database, and use source code control to publish to the team. My solution is here : http://dbsourcetools.codeplex.com/
Have fun,
- Nathan

Database source control with Oracle

I have been looking during hours for a way to check in a database into source control. My first idea was a program for calculating database diffs and ask all the developers to imlement their changes as new diff scripts. Now, I find that if I can dump a database into a file I cound check it in and use it as just antother type of file.
The main conditions are:
Works for Oracle 9R2
Human readable so we can use diff to see the diferences. (.dmp files doesn't seem readable)
All tables in a batch. We have more than 200 tables.
It stores BOTH STRUCTURE AND DATA
It supports CLOB and RAW Types.
It stores Procedures, Packages and its bodies, functions, tables, views, indexes, contraints, Secuences and synonims.
It can be turned into an executable script to rebuild the database into a clean machine.
Not limitated to really small databases (Supports least 200.000 rows)
It is not easy. I have downloaded a lot of demos that does fail in one way or another.
EDIT: I wouldn't mind alternatives aproaches provided that they allows us to check a working system against our release DATABASE STRUCTURE AND OBJECTS + DATA in a batch mode.
By the way. Our project has been developed for years. Some aproaches can be easily implemented when you make a fresh start but seem hard at this point.
EDIT: To understand better the problem let's say that some users can sometimes do changes to the config data in the production eviroment. Or developers might create a new field or alter a view without notice in the realease branch. I need to be aware of this changes or it will be complicated to merge the changes into production.
So many people try to do this sort of thing (diff schemas). My opinion is
Source code goes into a version control tool (Subversion, CSV, GIT, Perforce ...). Treat it as if it was Java or C code, its really no different. You should have an install process that checks it out and applies it to the database.
DDL IS SOURCE CODE. It goes into the version control tool too.
Data is a grey area - lookup tables maybe should be in a version control tool. Application generated data certainly should not.
The way I do things these days is to create migration scripts similar to Ruby on Rails migrations. Put your DDL into scripts and run them to move the database between versions. Group changes for a release into a single file or set of files. Then you have a script that moves your application from version x to version y.
One thing I never ever do anymore (and I used to do it until I learned better) is use any GUI tools to create database objects in my development environment. Write the DDL scripts from day 1 - you will need them anyway to promote the code to test, production etc. I have seen so many people who use the GUIs to create all the objects and come release time there is a scrabble to attempt to produce scripts to create/migrate the schema correctly that are often not tested and fail!
Everyone will have their own preference to how to do this, but I have seen a lot of it done badly over the years which formed my opinions above.
Oracle SQL Developer has a "Database Export" function. It can produce a single file which contains all DDL and data.
I use PL/SQL developer with a VCS Plug-in that integrates into Team Foundation Server, but it only has support for database objects, and not with the data itself, which usually is left out of source control anyways.
Here is the link: http://www.allroundautomations.com/bodyplsqldev.html
It may not be as slick as detecting the diffs, however we use a simple ant build file. In our current CVS branch, we'll have the "base" database code broken out into the ddl for tables and triggers and such. We'll also have the delta folder, broken out in the same manner. Starting from scratch, you can run "base" + "delta" and get the current state of the database. When you go to production, you'll simply run the "delta" build and be done. This model doesn't work uber-well if you have a huge schema and you are changing it rapidly. (Note: At least among database objects like tables, indexes and the like. For packages, procedures, functions and triggers, it works well.) Here is a sample ant task:
<target name="buildTables" description="Build Tables with primary keys and sequences">
<sql driver="${conn.jdbc.driver}" password="${conn.user.password}"
url="${conn.jdbc.url}" userid="${conn.user.name}"
classpath="${app.base}/lib/${jdbc.jar.name}">
<fileset dir="${db.dir}/ddl">
<include name="*.sql"/>
</fileset>
</sql>
</target>
I think this is a case of,
You're trying to solve a problem
You've come up with a solution
You don't know how to implement the solution
so now you're asking for help on how to implement the solution
The better way to get help,
Tell us what the problem is
ask for ideas for solving the problem
pick the best solution
I can't tell what the problem you're trying to solve is. Sometimes it's obvious from the question, this one certainly isn't. But I can tell you that this 'solution' will turn into its own maintenance nightmare. If you think developing the database and the app that uses it is hard. This idea of versioning the entire database in a human readable form is nothing short of insane.
Have you tried Oracle's Workspace Manager? Not that I have any experience with it in a production database, but I found some toy experiments with it promising.
Don't try to diff the data. Just write a trigger to store whatever-you-want-to-get when the data is changed.
Expensive though it may be, a tool like TOAD for Oracle can be ideal for solving this sort of problem.
That said, my preferred solution is to start with all of the DDL (including Stored Procedure definitions) as text, managed under version control, and write scripts that will create a functioning database from source. If someone wants to modify the schema, they must, must, must commit those changes to the repository, not just modify the database directly. No exceptions! That way, if you need to build scripts that reflect updates between versions, it's a matter of taking all of the committed changes, and then adding whatever DML you need to massage any existing data to meet the changes (adding default values for new columns for existing rows, etc.) With all of the DDL (and prepopulated data) as text, collecting differences is as simple as diffing two source trees.
At my last job, I had NAnt scripts that would restore test databases, run all of the upgrade scripts that were needed, based upon the version of the database, and then dump the end result to DDL and DML. I would do the same for an empty database (to create one from scratch) and then compare the results. If the two were significantly different (the dump program wasn't perfect) I could tell immediately what changes needed to be made to the update / creation DDL and DML. While I did use database comparison tools like TOAD, they weren't as useful as hand-written SQL when I needed to produce general scripts for massaging data. (Machine-generated code can be remarkably brittle.)
Try RedGate's Source Control for Oracle. I've never tried the Oracle version, but the MSSQL version is really great.

Resources