What are the best practices for database scripts under code control - database

We are currently reviewing how we store our database scripts (tables, procs, functions, views, data fixes) in subversion and I was wondering if there is any consensus as to what is the best approach?
Some of the factors we'd need to consider include:
Should we checkin 'Create' scripts or checkin incremental changes with 'Alter' scripts
How do we keep track of the state of the database for a given release
It should be easy to build a database from scratch for any given release version
Should a table exist in the database listing the scripts that have run against it, or the version of the database etc.
Obviously it's a pretty open ended question, so I'm keen to hear what people's experience has taught them.

After a few iterations, the approach we took was roughly like this:
One file per table and per stored procedure. Also separate files for other things like setting up database users, populating look-up tables with their data.
The file for a table starts with the CREATE command and a succession of ALTER commands added as the schema evolves. Each of these commands is bracketed in tests for whether the table or column already exists. This means each script can be run in an up-to-date database and won't change anything. It also means that for any old database, the script updates it to the latest schema. And for an empty database the CREATE script creates the table and the ALTER scripts are all skipped.
We also have a program (written in Python) that scans the directory full of scripts and assembles them in to one big script. It parses the SQL just enough to deduce dependencies between tables (based on foreign-key references) and order them appropriately. The result is a monster SQL script that gets the database up to spec in one go. The script-assembling program also calculates the MD5 hash of the input files, and uses that to update a version number that is written in to a special table in the last script in the list.
Barring accidents, the result is that the database script for a give version of the source code creates the schema this code was designed to interoperate with. It also means that there is a single (somewhat large) SQL script to give to the customer to build new databases or update existing ones. (This was important in this case because there would be many instances of the database, one for each of their customers.)

There is an interesting article at this link:
https://blog.codinghorror.com/get-your-database-under-version-control/
It advocates a baseline 'create' script followed by checking in 'alter' scripts and keeping a version table in the database.

The upgrade script option
Store each change in the database as a separate sql script. Store each group of changes in a numbered folder. Use a script to apply changes a folder at a time and record in the database which folders have been applied.
Pros:
Fully automated, testable upgrade path
Cons:
Hard to see full history of each individual element
Have to build a new database from scratch, going through all the versions

I tend to check in the initial create script. I then have a DbVersion table in my database and my code uses that to upgrade the database on initial connection if necessary. For example, if my database is at version 1 and my code is at version 3, my code will apply the ALTER statements to bring it to version 2, then to version 3. I use a simple fallthrough switch statement for this.
This has the advantage that when you deploy a new version of your application, it will automatically upgrade old databases and you never have to worry about the database being out of sync with the software. It also maintains a very visible change history.
This isn't a good idea for all software, but variations can be applied.

You could get some hints by reading how this is done with Ruby On Rails' migrations.
The best way to understand this is probably to just try it out yourself, and then inspecting the database manually.
Answers to each of your factors:
Store CREATE scripts. If you want to checkout version x.y.z then it'd be nice to simply run your create script to setup the database immediately. You could add ALTER scripts as well to go from the previous version to the next (e.g., you commit version 3 which contains a version 3 CREATE script and a version 2 → 3 alter script).
See the Rails migration solution. Basically they keep the table version number in the database, so you always know.
Use CREATE scripts.
Using version numbers would probably be the most generic solution — script names and paths can change over time.
My two cents!

We create a branch in Subversion and all of the database changes for the next release are scripted out and checked in. All scripts are repeatable so you can run them multiple times without error.
We also link the change scripts to issue items or bug ids so we can hold back a change set if needed. We then have an automated build process that looks at the issue items we are releasing and pulls the change scripts from Subversion and creates a single SQL script file with all of the changes sorted appropriately.
This single file is then used to promote the changes to the Test, QA and Production environments. The automated build process also creates database entries documenting the version (branch plus build id.) We think this is the best approach with enterprise developers. More details on how we do this can be found HERE

The create script option:
Use create scripts that will build you the latest version of the database from scratch, which is empty except the default lookup data.
Use standard version control techniques to store,branch,tag versions and view histories of your objects.
When upgrading a live database (where you don't want to loose data), create a blank second copy of the database at the new version and use a tool like red-gate's link text
Pros:
Changes to files are tracked in a standard source-code like manner
Cons:
Reliance on manual use of a 3rd party tool to do actual upgrades (no/little automation)

Our company checks them in simply because someone decided to put it in some SOX document that we do. It makes no sense to me at all, except possible as a reference document. I can't see a time we'd pull them out and try and use them again, and if we did we'd have to know which one ran first and which one to run after which. Backing up the database is much more important then keeping the Alter scripts.

for every release we need to give one update.sql file which contains all the new table scripts, alter statements, new/modified packages,roles,etc. This file is used to upgrade the database from 1 version to 2.
What ever we include in update.sql file above one all this statements need to go to individual respective files. like alter statement has to go to table as a new column (table script has to be modifed not Alter statement is added after create table script in the file) in the same way new tables, roles etc.
So whenever if user wants to upgrade he will use the first update.sql file to upgrade.
If he want to build from scrach then he will use the build.sql which already having all the above statements, it makes the database in sync.
sriRamulu
Sriramis4u#yahoo.com

In my case, I build a SH script for this work: https://github.com/reduardo7/db-version-updater

How is an open question
In my case I am trying to create something simple that is easy to use for developers and I do it under the following scheme
Things I tested:
File-based script handling in git using GitlabCI
It does not work, collisions are created and the Administration part has to be done by hand in case of disaster and the development part is too complicated
Use of permissions and access via mysql clients
There is no traceability on changes to the database and the transition to production is manual
Use of programs mentioned here
They require uploading the structures and many adaptations and usually you end up with change control just like the word
Repository usage
Could not control the DRP part
I could not properly control the backups
I don't think it is a good idea to have the backups on the same server and you generate high lasgs for the process
This was what worked best
Manage permissions per user and generate traceability of everything that is sent to the database
Multi platform
Use of development-Production-QA database
Always support before each modification
Manage an open repository for change control
Multi-server
Deactivate / Activate access to the web page or App through Endpoints
the initial project is in:
In case the comment manager reads this part, I understand the self-promotion but please just remove this part and leave the rest since I think it complies with the answer to the question reacted in the post ...
https://hub.docker.com/r/arelis/gitdb
I hope this reaches you since I see that several

There is an interesting article with new URL at: https://blog.codinghorror.com/get-your-database-under-version-control/
It a bit old but the concepts are still there. Good Read!

Related

How to handle multiple db alter scripts coming from different Git feature branches?

A bit complex to describe, but I'll do my best. Basically we're using the Git workflow, meaning we have the following branches:
production, which is the live branch. Everything is production is running in the live web environment.
integration, in which all new functionality is integrated. This branch is merged to production every week.
one or more feature branches, in which developers or development teams develop new functionality. After this is done, developers merge their feature branch to integration.
So, nothing really complex here. But, since our application is a web application running against a MySQL database, new functionality often requires changes to the database scheme. To automate this, we're using dbdeploy, which allows us to create alter scripts, given a number. E.g. 00001.sql, 00002.sql, etc. Upon merging to the integration branch, dbdeploy will check which alter scripts have a higher number than the latest executed one on that specific database, and will execute those.
Now assume the following.
- integration has alter scripts up until 00200.sql. All of these are executed on the integration database.
- developer John has a feature branch featureX, which was created when integration still had 00199.sql as the highest alter script.
John creates 00200.sql because of some required db schema changes.
Now, at some point John will merge his modifications back to the integration branch. John will get a merge conflict and will see that his 00200.sql already exists in integration. This means he needs to open the conflicting file, extract his contents, reset that file back to 'mine' (the original state as in integration) and put his own contents in a new file.
Now, since we're working with ten developers, we get this situation daily. And while we do understand the reasons behind this, it's sometimes very cumbersome. John renames his script, does a merge commit to integration, pushes the changes to the upstream only to see that somebody else already created a 00201.sql, requiring John to do the proces again.
Surely there must be more teams using the Git workflow and using a database change management tool for automating database schema changes?
So, in short, my questions are:
How to automate database schema changes, when working on different feature branches, that operate on different instances of the same db?
How to prevent merge conflicts all the time, while still having the option to have a fixed order in the executed alter scripts? E.g. 00199.sql must be executed before 00200.sql, because 00200.sql might be depending on something done in 00199.sql.
Any other tips are most welcome ofcourse.
Rails used to do this, with exactly the problems you describe. They changed to the following scheme: the files (rails calls them migrations) are labelled with a utc timestamp of when the file was created, eg
20140723069701_add_foo_to_bar
(The second part of the name doesn't contribute to the ordering).
Rails records the timestamps of all the migrations that have been run. When you ask it to run pending migrations it selects all the migration files whose timestamp isn't in the list of already run migrations and runs them in numerical order.
You'll no longer get merge conflicts unless two people create one at exactly the same point in time.
Files still get executed in the order you wrote them, but possibly interleaved with someone else's work. In theory you can still have problems - eg developer a decides to rename a table that I had decided to add a column too. That is much less common than 2 developers both making any changes to the db and you would have problems even not considering the schema changes presumably I have just written code that queries a no longer existant table - at some point developers working on related stuff will have to talk to each other!
A few suggestions:
1 - have a look at Liquibase, each version gets a file that references the changes that need to happen, then the change files can be named using a meaningful string rather than by number.
2 - have a central location for getting the next available number, then people use the latest number.
I've used Liquibase in the past, pretty successfully, and we didn't have the problem you describe.
As Frederick Cheung suggested, use timestamps rather than a serial number. Applying schema changes by order of datestamp should work, because schema changes can only depend on changes of a prior date.
In addition, include the name of the developer in the name of the alter script. This will prevent merge conflicts 100%.
Your merge hook should just look for newly added alter scripts (present in the merged branch but not in the upstream branch) and execute them by order of timestamp.
I've used two different approaches to overcome your problem in the past.
The first is to use a n ORM which can handle the schema updates.
The other approach is to create a script, which incrementally builds the database schema. This way if a developer needs to an additional row in a table, he should add the appropriate sql statement after the table is create. Likewise if he needs a new table, he should add the sql statement for that. Then merging becomes a question of making sure things happen in the correct order. This is basically what the database update process in an ORM does. Such a script needs to be coded very defensively, and each statement should check if its perquisites exists.
For the dbvc commandline tool, I use git log to determine the order of the update scripts.
git log -c --no-merges --pretty="format:" --name-status -p dev/db/updates/ | \
grep '^A' | awk '{print $2}' | tac
In this case the way the order of your commits will determine the sequence in which the updates are run. Which is most likely what you want.
If you run git merge b, the updates from master will be run first and than from B.
If you run git rebase b, the update from B will run first and than from master.

How can I put a database under git (version control)?

I'm doing a web app, and I need to make a branch for some major changes, the thing is, these changes require changes to the database schema, so I'd like to put the entire database under git as well.
How do I do that? is there a specific folder that I can keep under a git repository? How do I know which one? How can I be sure that I'm putting the right folder?
I need to be sure, because these changes are not backward compatible; I can't afford to screw up.
The database in my case is PostgreSQL
Edit:
Someone suggested taking backups and putting the backup file under version control instead of the database. To be honest, I find that really hard to swallow.
There has to be a better way.
Update:
OK, so there' no better way, but I'm still not quite convinced, so I will change the question a bit:
I'd like to put the entire database under version control, what database engine can I use so that I can put the actual database under version control instead of its dump?
Would sqlite be git-friendly?
Since this is only the development environment, I can choose whatever database I want.
Edit2:
What I really want is not to track my development history, but to be able to switch from my "new radical changes" branch to the "current stable branch" and be able for instance to fix some bugs/issues, etc, with the current stable branch. Such that when I switch branches, the database auto-magically becomes compatible with the branch I'm currently on.
I don't really care much about the actual data.
Take a database dump, and version control that instead. This way it is a flat text file.
Personally I suggest that you keep both a data dump, and a schema dump. This way using diff it becomes fairly easy to see what changed in the schema from revision to revision.
If you are making big changes, you should have a secondary database that you make the new schema changes to and not touch the old one since as you said you are making a branch.
I'm starting to think of a really simple solution, don't know why I didn't think of it before!!
Duplicate the database, (both the schema and the data).
In the branch for the new-major-changes, simply change the project configuration to use the new duplicate database.
This way I can switch branches without worrying about database schema changes.
EDIT:
By duplicate, I mean create another database with a different name (like my_db_2); not doing a dump or anything like that.
Use something like LiquiBase this lets you keep revision control of your Liquibase files. you can tag changes for production only, and have lb keep your DB up to date for either production or development, (or whatever scheme you want).
Irmin (branching + time travel)
Flur.ee (immutable + time travel + graph query)
XTDB (formerly called 'CruxDB') (time travel + query)
TerminusDB (immutable + branching + time travel + Graph Query!)
DoltDB (branching + time-travel + SQL query)
Quadrable (branching + remote state verification)
EdgeDB (no real time travel, but migrations derived by the compiler after schema changes)
Migra (diffing for Postgres schemas/data. Auto-generate migration scripts, auto-sync db state)
ImmuDB (immutable + time-travel)
I've come across this question, as I've got a similar problem, where something approximating a DB based Directory structure, stores 'files', and I need git to manage it. It's distributed, across a cloud, using replication, hence it's access point will be via MySQL.
The gist of the above answers, seem to similarly suggest an alternative solution to the problem asked, which kind of misses the point, of using Git to manage something in a Database, so I'll attempt to answer that question.
Git is a system, which in essence stores a database of deltas (differences), which can be reassembled, in order, to reproduce a context. The normal usage of git assumes that context is a filesystem, and those deltas are diff's in that file system, but really all git is, is a hierarchical database of deltas (hierarchical, because in most cases each delta is a commit with at least 1 parents, arranged in a tree).
As long as you can generate a delta, in theory, git can store it. The problem is normally git expects the context, on which it's generating delta's to be a file system, and similarly, when you checkout a point in the git hierarchy, it expects to generate a filesystem.
If you want to manage change, in a database, you have 2 discrete problems, and I would address them separately (if I were you). The first is schema, the second is data (although in your question, you state data isn't something you're concerned about). A problem I had in the past, was a Dev and Prod database, where Dev could take incremental changes to the schema, and those changes had to be documented in CVS, and propogated to live, along with additions to one of several 'static' tables. We did that by having a 3rd database, called Cruise, which contained only the static data. At any point the schema from Dev and Cruise could be compared, and we had a script to take the diff of those 2 files and produce an SQL file containing ALTER statements, to apply it. Similarly any new data, could be distilled to an SQL file containing INSERT commands. As long as fields and tables are only added, and never deleted, the process could automate generating the SQL statements to apply the delta.
The mechanism by which git generates deltas is diff and the mechanism by which it combines 1 or more deltas with a file, is called merge. If you can come up with a method for diffing and merging from a different context, git should work, but as has been discussed you may prefer a tool that does that for you. My first thought towards solving that is this https://git-scm.com/book/en/v2/Customizing-Git-Git-Configuration#External-Merge-and-Diff-Tools which details how to replace git's internal diff and merge tool. I'll update this answer, as I come up with a better solution to the problem, but in my case I expect to only have to manage data changes, in-so-far-as a DB based filestore may change, so my solution may not be exactly what you need.
There is a great project called Migrations under Doctrine that built just for this purpose.
Its still in alpha state and built for php.
http://docs.doctrine-project.org/projects/doctrine-migrations/en/latest/index.html
Take a look at RedGate SQL Source Control.
http://www.red-gate.com/products/sql-development/sql-source-control/
This tool is a SQL Server Management Studio snap-in which will allow you to place your database under Source Control with Git.
It's a bit pricey at $495 per user, but there is a 28 day free trial available.
NOTE
I am not affiliated with RedGate in any way whatsoever.
I've released a tool for sqlite that does what you're asking for. It uses a custom diff driver leveraging the sqlite projects tool 'sqldiff', UUIDs as primary keys, and leaves off the sqlite rowid. It is still in alpha so feedback is appreciated.
Postgres and mysql are trickier, as the binary data is kept in multiple files and may not even be valid if you were able to snapshot it.
https://github.com/cannadayr/git-sqlite
I want to make something similar, add my database changes to my version control system.
I am going to follow the ideas in this post from Vladimir Khorikov "Database versioning best practices". In summary i will
store both its schema and the reference data in a source control system.
for every modification we will create a separate SQL script with the changes
In case it helps!
You can't do it without atomicity, and you can't get atomicity without either using pg_dump or a snapshotting filesystem.
My postgres instance is on zfs, which I snapshot occasionally. It's approximately instant and consistent.
I think X-Istence is on the right track, but there are a few more improvements you can make to this strategy. First, use:
$pg_dump --schema ...
to dump the tables, sequences, etc and place this file under version control. You'll use this to separate the compatibility changes between your branches.
Next, perform a data dump for the set of tables that contain configuration required for your application to operate (should probably skip user data, etc), like form defaults and other data non-user modifiable data. You can do this selectively by using:
$pg_dump --table=.. <or> --exclude-table=..
This is a good idea because the repo can get really clunky when your database gets to 100Mb+ when doing a full data dump. A better idea is to back up a more minimal set of data that you require to test your app. If your default data is very large though, this may still cause problems though.
If you absolutely need to place full backups in the repo, consider doing it in a branch outside of your source tree. An external backup system with some reference to the matching svn rev is likely best for this though.
Also, I suggest using text format dumps over binary for revision purposes (for the schema at least) since these are easier to diff. You can always compress these to save space prior to checking in.
Finally, have a look at the postgres backup documentation if you haven't already. The way you're commenting on backing up 'the database' rather than a dump makes me wonder if you're thinking of file system based backups (see section 23.2 for caveats).
What you want, in spirit, is perhaps something like Post Facto, which stores versions of a database in a database. Check this presentation.
The project apparently never really went anywhere, so it probably won't help you immediately, but it's an interesting concept. I fear that doing this properly would be very difficult, because even version 1 would have to get all the details right in order to have people trust their work to it.
This question is pretty much answered but I would like to complement X-Istence's and Dana the Sane's answer with a small suggestion.
If you need revision control with some degree of granularity, say daily, you could couple the text dump of both the tables and the schema with a tool like rdiff-backup which does incremental backups. The advantage is that instead of storing snapshots of daily backups, you simply store the differences from the previous day.
With this you have both the advantage of revision control and you don't waste too much space.
In any case, using git directly on big flat files which change very frequently is not a good solution. If your database becomes too big, git will start to have some problems managing the files.
Here is what i am trying to do in my projects:
separate data and schema and default data.
The database configuration is stored in configuration file that is not under version control (.gitignore)
The database defaults (for setting up new Projects) is a simple SQL file under version control.
For the database schema create a database schema dump under the version control.
The most common way is to have update scripts that contains SQL Statements, (ALTER Table.. or UPDATE). You also need to have a place in your database where you save the current version of you schema)
Take a look at other big open source database projects (piwik,or your favorite cms system), they all use updatescripts (1.sql,2.sql,3.sh,4.php.5.sql)
But this a very time intensive job, you have to create, and test the updatescripts and you need to run a common updatescript that compares the version and run all necessary update scripts.
So theoretically (and thats what i am looking for) you could
dumped the the database schema after each change (manually, conjob, git hooks (maybe before commit))
(and only in some very special cases create updatescripts)
After that in your common updatescript (run the normal updatescripts, for the special cases) and then compare the schemas (the dump and current database) and then automatically generate the nessesary ALTER Statements. There some tools that can do this already, but haven't found yet a good one.
What I do in my personal projects is, I store my whole database to dropbox and then point MAMP, WAMP workflow to use it right from there.. That way database is always up-to-date where ever I need to do some developing. But that's just for dev! Live sites is using own server for that off course! :)
Storing each level of database changes under git versioning control is like pushing your entire database with each commit and restoring your entire database with each pull.
If your database is so prone to crucial changes and you cannot afford to loose them, you can just update your pre_commit and post_merge hooks.
I did the same with one of my projects and you can find the directions here.
That's how I do it:
Since your have free choise about DB type use a filebased DB like e.g. firebird.
Create a template DB which has the schema that fits your actual branch and store it in your repository.
When executing your application programmatically create a copy of your template DB, store it somewhere else and just work with that copy.
This way you can put your DB schema under version control without the data. And if you change your schema you just have to change the template DB
We used to run a social website, on a standard LAMP configuration. We had a Live server, Test server, and Development server, as well as the local developers machines. All were managed using GIT.
On each machine, we had the PHP files, but also the MySQL service, and a folder with Images that users would upload. The Live server grew to have some 100K (!) recurrent users, the dump was about 2GB (!), the Image folder was some 50GB (!). By the time that I left, our server was reaching the limit of its CPU, Ram, and most of all, the concurrent net connection limits (We even compiled our own version of network card driver to max out the server 'lol'). We could not (nor should you assume with your website) put 2GB of data and 50GB of images in GIT.
To manage all this under GIT easily, we would ignore the binary folders (the folders containing the Images) by inserting these folder paths into .gitignore. We also had a folder called SQL outside the Apache documentroot path. In that SQL folder, we would put our SQL files from the developers in incremental numberings (001.florianm.sql, 001.johns.sql, 002.florianm.sql, etc). These SQL files were managed by GIT as well. The first sql file would indeed contain a large set of DB schema. We don't add user-data in GIT (eg the records of the users table, or the comments table), but data like configs or topology or other site specific data, was maintained in the sql files (and hence by GIT). Mostly its the developers (who know the code best) that determine what and what is not maintained by GIT with regards to SQL schema and data.
When it got to a release, the administrator logs in onto the dev server, merges the live branch with all developers and needed branches on the dev machine to an update branch, and pushed it to the test server. On the test server, he checks if the updating process for the Live server is still valid, and in quick succession, points all traffic in Apache to a placeholder site, creates a DB dump, points the working directory from 'live' to 'update', executes all new sql files into mysql, and repoints the traffic back to the correct site. When all stakeholders agreed after reviewing the test server, the Administrator did the same thing from Test server to Live server. Afterwards, he merges the live branch on the production server, to the master branch accross all servers, and rebased all live branches. The developers were responsible themselves to rebase their branches, but they generally know what they are doing.
If there were problems on the test server, eg. the merges had too many conflicts, then the code was reverted (pointing the working branch back to 'live') and the sql files were never executed. The moment that the sql files were executed, this was considered as a non-reversible action at the time. If the SQL files were not working properly, then the DB was restored using the Dump (and the developers told off, for providing ill-tested SQL files).
Today, we maintain both a sql-up and sql-down folder, with equivalent filenames, where the developers have to test that both the upgrading sql files, can be equally downgraded. This could ultimately be executed with a bash script, but its a good idea if human eyes kept monitoring the upgrade process.
It's not great, but its manageable. Hope this gives an insight into a real-life, practical, relatively high-availability site. Be it a bit outdated, but still followed.
Update Aug 26, 2019:
Netlify CMS is doing it with GitHub, an example implementation can be found here with all information on how they implemented it netlify-cms-backend-github
I say don't. Data can change at any given time. Instead you should only commit data models in your code, schema and table definitions (create database and create table statements) and sample data for unit tests. This is kinda the way that Laravel does it, committing database migrations and seeds.
I would recommend neXtep (Link removed - Domain was taken over by a NSFW-Website) for version controlling the database it has got a good set of documentation and forums that explains how to install and the errors encountered. I have tested it for postgreSQL 9.1 and 9.3, i was able to get it working for 9.1 but for 9.3 it doesn't seems to work.
Use a tool like iBatis Migrations (manual, short tutorial video) which allows you to version control the changes you make to a database throughout the lifecycle of a project, rather than the database itself.
This allows you to selectively apply individual changes to different environments, keep a changelog of which changes are in which environments, create scripts to apply changes A through N, rollback changes, etc.
I'd like to put the entire database under version control, what
database engine can I use so that I can put the actual database under
version control instead of its dump?
This is not database engine dependent. By Microsoft SQL Server there are lots of version controlling programs. I don't think that problem can be solved with git, you have to use a pgsql specific schema version control system. I don't know whether such a thing exists or not...
Use a version-controlled database, of which there are now several.
https://www.dolthub.com/blog/2021-09-17-database-version-control/
These products don't apply version control on top of another type of database -- they are their own database engines that support version control operations. So you need to migrate to them or start building on them in the first place.
I write one of them, DoltDB, which combines the interfaces of MySQL and Git. Check it out here:
https://github.com/dolthub/dolt
I wish it were simpler. Checking in the schema as a text file is a good start to capture the structure of the DB. For the content, however, I have not found a cleaner, better method for git than CSV files. One per table. The DB can then be edited on multiple branches and merges extremely well.

What is a good way to implement an agile database process, which is in synch with the code base, especially in regards to continuous integration? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
The project I am working on were are trying to come up with a solution for having the database and code be agile and be able to be built and deployed together.
Since the application is a combination of code plus the database schema, and database code tables, you can not truly have a full build of the application unless you have a database that is versioned along with the code.
We have not yet been able to come up with a good agile method of doing the database development along with the code in an agile/scrum environment.
Here are some of my requirements:
I want to be able to have a svn revision # that corresponds to a complete build of the system.
I do not want to check in binary files into source control for the database.
Developers need to be able to commit code to the continuous integration server and build the entire system and database together.
Must be able to automate deployment to different environments without doing a rebuild other than the original build on the build server.
(Update)
I'll add some more info here to explain a bit further.
No OR/M tool, since its a legacy project with a huge amount of code.
I have read the agile database design information, and that process in isolation seems to work, but I am talking about combining it with active code development.
Here are two scenario's
Developer checks in a code change, that requires a database change. The developer should be able to check in a database change at the same time, so that the automated build doesn't fail.
Developer checks in a DB change, that should break code. The automated build needs to run and fail.
The biggest problem is, how do these things synch up. There is no such thing as "checking in a database change". Right now the application of the DB changes is a manual process someone has to do, while code change are constantly being made. They need to be made together and checked together, the build system needs to be able to build the entire system.
(Update 2)
One more add here:
You can't bring down production, you must patch it. Its not acceptable to rebuild the entire production database.
You need a build process that constructs the database schema and adds any necessary bootstrapping data. If you're using an O/R tool that supports schema generation, most of that work is done for you. Whatever is not tool-generated, keep in scripts.
For continuous integration, ideally a "build" should include a complete rebuild of the database and a reload of static testing data.
I just saw that you have no ORM tool... here's what we had at a company I used to work for
db/
db/Makefile (run `make` to rebuild db from scratch, `make clean` to close db)
db/01_type.sql
db/02_table.sql
db/03_function.sql
db/04_view.sql
db/05_index.sql
db/06_data.sql
Arrange however necessary... each of those *.sql scripts would be run in order to generate the structure. Developers each had local copies of the DB, and any DB change was just another code change, nothing special.
If you're working on a project that already has a build process (Java, C, C++), this is second nature. If you're using scripts in such a way that there is no build process at all, this'll be a bit of extra work.
"There is no such thing as "checking in a database change"."
Actually, I think you can check in database change. The trick is to stop using simple -- unversioned -- schema and table names.
If you have a version number attached to a schema as a whole (or a table), then you can easily have a version check-in.
Note that database versions doesn't have fancy major-minor-release. The "major" revision in application software usually reflects a basic level of compatibility. That basic level of compatibility should be defined as "uses the same data model".
So app version 2.23 and 2.24 use the version 2 of a the database schema.
The version check-in has two parts.
The new table. For example, MyTable_8 is version 8 of a given table.
The migration script. For example MyTable_8 includes a MyTable_7 to MyTable_8 script which moves the data, providing defaults or whatever is required.
There are several ways this is used.
Compatible upgrades. When merely altering a table to add a column that permits nulls, the version number stays the same.
Incompatible upgrades. When adding non-null columns (that need initial values) or changing the fundamental shape of tables or data types of columns, you're making a big change and you have a migration script.
Note that the old data stays in place until explicitly dropped at the end of the change procedure. You have to run tests to assure that everything worked.
You might have two-part drop -- first rename, then (a week later) finally drop.
Make sure that your O/R-Mapping tool is able to build the necessary tables out of the default configuration it has and also add missing columns. This should cover 90% of your cases.
The other 10% are
coping with missing values for columns that where added after the data was inserted
write data-migration scripts for the rare case where you need to do more fundamental changes between versions
See the DBDeploy open source project. http://dbdeploy.com/
It allows you to check in database change scripts. It will then produce a consolidated change script including all changes that have not been applied.
The site describes the process pretty well.
This project is based on the techniques in the Martin Fowler article that was mentioned before. I was on the project that Martin based the article on. DbDeploy is a pretty good implementation of the process we used.
The migrations facility of Ruby on Rails was developed to handle exactly this need. If you're not using Rails for your application, you might see if this same concept has been ported to the framework of your choice, or read up on it and determine whether you could write some quick scripts that implement the same sort of functionality.

How to keep Stored Procedures and other scripts in SVN/Other repository?

Can anyone provide some real examples as to how best to keep script files for views, stored procedures and functions in a SVN (or other) repository.
Obviously one solution is to have the script files for all the different components in a directory or more somewhere and simply using TortoiseSVN or the like to keep them in SVN, Then whenever a change is to be made I load the script up in Management Studio etc. I don't really want this.
What I'd really prefer is some kind of batch script that I can run periodically (nightly?) that would export all the stored procedures / views etc that had changed in a given timeframe and then commit them to SVN.
Ideas?
Sounds like you're not wanting to use Revision Control properly, to me.
Obviously one solution is to have the
script files for all the different
components in a directory or more
somewhere and simply using TortoiseSVN
or the like to keep them in SVN
This is what should be done. You would have your local copy you are working on (Developing new, Tweaking old, etc) and as single components/procedures/etc get finished, you would commit them individually until you have to start the process over.
Committing half-done code just because it's been 'X' time since it was last committed is sloppy and guaranteed to cause anyone else using the repository grief.
I find it best to treat Stored Procedures just like any other compilable code: Code lives in the repository, you check it out to make changes and load it in your development tool to compile or deploy the code.
You can create a batch file and schedule it:
delete the contents of your scripts directory
using something like ExportSQLScript to export all objects to script/scripts
svn commit
Please note: That although you'll have the objects under source control, you'll not have the data or it's progression (is that a renamed field, or 1 new field and 1 deleted?).
This approach is fine for maintaining change history. But, of course, you should never be automatically committing to the "production build" (unless you like broken builds).
Although you didn't ask for it: This approach also won't produce a set of scripts that will upgrade a current DB. You'll only have initial creation scripts. Recording data progression and creation upgrade scripts is beyond basic source control systems.
I'd recommend Redgate SQL Compare for this - it allows you to compare database versions and generate change scripts - it's also fairly easily scriptable.
Based on your expanded question, you really want to use DDL triggers. Check out this article that details how to create a changelog system for your database.
Not sure on your price range, however DB Ghost could be an option for you.
I don't work for this company (or own the product) but in my researching of the same issue, this product looked quite promising.
I should've been a little more descriptive. The database in question is for an internal ERP system and thus we don't have many versions of our database, just Production/Testing/Development. When we've done a change request, some new fancy feature or something, we simply execute a script or series of scripts to update the procedures in question on the Testing database, if that is all good, then we do the same to Production.
So I'm not really after a full schema script per se, just something that can keep track of the various edits to the stored procedures over time. For example, PROCESS_INVOICE does stuff. It gets updated in some minor way in March. Some time later in say May it is discovered that in a rare case customers get double invoiced (or some other crazy corner case). I'd like to be able to see what has happened over time to this procedure. Currently the way the development environment is setup here I don't have that, which I'm trying to change.
I can recommend DBPro which is part of Visual Studio Team Edition. Have been using it for a few months for storing all parts of the database in Team Foundation Server as well as for deployment and database compares, etc.
Of course, as someone else mentioned, it does depend on your environment and price range.
I wrote a utility for dumping all of the relevant parts of my db into a directory structure that I use SVN on. I never got around to trying to incorporate it into the Manager but, if you're interested, it's here: http://www.reluctantdba.com/dbas-and-programmers/sqltools/svnforsql2005.aspx
It's free and, since I regularly run it, you know any bugs get fixed quickly.
You can always try integrating SourceSafe with SQL Server. Here's a quick start : link . To work with it you've got to have Managment Studio Developers Edition.

Re-Running Database Development Scripts

In our current database development evironment we have automated build procceses check all the sql code out of svn create database scripts and apply them to the various development/qa databases.
This is all well and good, and is a tremdous improvement over what we did in the past, but we have a problem with rerunning scripts. Obviously this isn't a problem with some scripts like altering procedures, because you can run them over and over without adversly affecting the system. Right now to add metadata and run statements like create/alter table statements we add code to check and see if the objects exists, and if they do, don't run them.
Our problem is that we really only get one shot to run the script, because once the script has been run, the objects are in the environment and system won't run the script again. If something needs to change once it's been deployed, we have a difficult process of running update scripts agaist the update scripts and hoping that everything falls in the correct order and all of the PKs line up between the environments (the databases are, shall we say, "special").
Short of dropping the database and starting the process from scratch (the last most current release), does anyone have a more elegant solution to this?
I'm not sure how best to approach the problem in your specific environment, but I'd suggest reading up on Rail's migrations feature for some inspiration on how to get started.
http://wiki.rubyonrails.org/rails/pages/UnderstandingMigrations
We address this - or at least a similar problem to this - as follows:
The schema has a version number - this is represented by a table which has one row per version which, as well as the version number, carries boring things like a date/time stamp for when that version came into existence.
By having the schema create/modify DDL wrapped in code that performs the changes for us.
In the context above one would build the schema change code as part of the build process then run it and it would only apply schema changes that haven't already been applied.
In our experience (which is bound not to be representative) in most cases the schema changes are sufficiently small/fast that they can safely be run in a transaction which means that if it fails we get a rollback and the db is "safe" - although one would always recommend taking backups before applying schema updates if practicable.
I evolved this out of nasty painful experience. Its not a perfect system (or an original idea) but as a result of working this way we have a high degree of confidence that if there are two instances of one of our databases with the same version that then the schema for those two databases will be the same in almost all respects and that we can safely bring any db up to the current schema for that application without ill effects. (That last isn't 100% true unfortunately - there's always an exception - but its not too far from the truth!)
Do you keep your existing data in the database? If not, you may want to look at something similar to what Matt mentioned for .NET called RikMigrations
http://www.rikware.com/RikMigrations.html
I use that on my projects to update my database on the fly, while keeping track of revisions. Also, it makes it very simple to move database schema to different servers, etc.
if you want to have re-runnability in your scripts, then you can't have them as definitions... what I mean by this is that you need to focus on change scripts rather than here is my Table script.
let's say you have a table Customers:
create table Customers (
id int identity(1,1) primary key,
first_name varchar(255) not null,
last_name varchar(255) not null
)
and later you want to add a status column. Don't modify your original table script, that one has already run (and can have the if(! exists) syntax to prevent it from causing errors while running again).
Instead, have a new script, called add_customer_status.sql
in this script you'll have something like:
alter table Customers
add column status varchar(50) null
update Customers set status = 'Silver' where status is null
alter table Customers
alter column status varchar(50) not null
Again you can wrap this with an if(! exists) block to allow re-running, but here we've leveraged the notion that this is a change script, and we adapt the database accordingly. If there is data already in the customers table then we're still okay, since we add the column, seed it with data, then add the not null constraint.
Both of the migration frameworks mentioned above are good, I've also had excellent experience with MigratorDotNet.
Scott named a couple of other SQL tools that address the problem of change management. But I'm still rolling my own.
I would like to second this question, and add my puzzlement that there is still no free, community-based tool for this problem. Obviously, scripts are not a satisfactory way to maintain a database schema; neither are instances. So, why don't we keep metadata in a separate (and while we're at it, platform-neutral) format?
That's what I'm doing now. My master database schema is a version-controlled XML file, created initially from a simple web service. A simple javascript program compares instances against it, and a simple XSL transform yields the CREATE or ALTER statements. It has limits, like RikMigrations; for instance it doesn't always sequence inter-depdendent objects correctly. (But guess what — neither does Microsoft's SQL Server Database Publication tool.) Really, it's too simple. I simply didn't include objects (roles, users, etc.) that I wasn't using.
So, my view is that this problem is indeed inadequately addressed, and that sooner or later we'll have to get together and tackle the devilish details.
We went the 'drop and recreate the schema' route. We had some classes in our JUnit test package which parameterized the scripts to create all the objects in the schema for the developer executing the code. This allowed all the developers to share one test database and everyone could simultaneously create/test/drop their test tables without conflicts.
Did it take a long time to run? Yes. At first we used the setup method for this which meant the tables were dropped/created for every test and that took way too long. Then we created a TestSuite which could be run once before all the tests for a class and then cleaned up when all the class tests were complete. This still meant that the db setup ran many times when we ran our 'AllTests' class which included all the tests in all our packages. How I solved it was adding a semaphore to the OracleTestSuite code so when the first test requested the database to be setup it would do that but any subsequent call would just increment a counter. As each tearDown() method was called, the counter would decrement the counter until it reached 0 and the OracleTestSuite code would drop everything. One issue this leaves is whether the tests assume that the database is empty. It can be convenient to let database tests know the order in which they run so they can take advantage of the state of the database because it can reduce the duplication of DB setup.
We used the concept of ObjectMothers to solve a similar problem with creating complex domain objects for testing purposes. Mock objects might be a better answer but we hadn't heard about them at the time. After all this time, I'd recommend creating test helper methods that could create standardized datasets for the typical scenarios. Plus that would help document the important edge cases from a data perspective.

Resources