Adding primary keys to a production database - sql-server

I just inherited a relatively small SQL Server database. We have a decentralized system operating on about ten sites, with each site being pounded all day by between sixty and one hundred clients. Upon inspecting the system, a couple of things jumped out at me: there are no maintenance plans or keys defined.
I have dozens of different applications that are already accessing the database. The majority of them are written in C with inline SQL. Part of what I was brought in to do was write stored procedures for everything and have our applications move to that. Before I do this, however, I really think I should be focusing on these seemingly glaring issues.
Also, we'll eventually be looking into replication to a central site, so I really think these things should be addressed before we even think of that.
Figuring out a redesign scheme and maintenance plan will be time-consuming but not problematic - I've done it before at single sites. But, how am I going to go about implementing these major changes to the database across ten (or more) production sites while ensuring data integrity and not breaking the applications?

I would suspect that with no keys officaly defined, that this database probaly has tons of data integrity problems. Lucky you.
For replication you will need GUIDs. I would do this, Add the GUIDs and PK definitions in the dev environment and test test test. You'll prbably find alot of crap where people did select * and adding the columns will cause probnalem or cause things to show up on reports that you don't want. Find and fix all these things. Be sure to script allthe changes to the data and put them in source control along with any code changes you need to make to the application. Then schedule down time for maintenance of the database during the lowest usage hours. Let the users know the application will be down ahead of time. During the down time, have the application show a down message, change the datbase to single user mode so no one except the team making this change can affect the database, make a fullbackup, run the scripts to make the changes to the database, run the code to change the application, test, take the database out of single user mode and turn the application back on.
Under no circumstances would I try to make a change this major without going to single user mode.

First ensure you have valid backups of every db, and test-restore them to make sure they restore OK.
Consider using Ola Hallengren's maintenance vs. Maintenance Plans if you need to deploy identical, consistent, scripted solutions to all your sites (Ola Hallengren's site)
Then I'd say look at getting some basic indexing in place, starting with heavy-hitter tables first. You can identify them with various methods - presume you know how, but just to throw a few out thoughts: code review, SQL Trace, Query Plan analysis, and then there are 3rd party tools e.g., Idera SQLdm, Confio Ignite, Quest's Spotlight on SQL Server or Foglight Performance Analysis for SQL Server.
I think this will get you rolling.

Some additional ideas.
One of the first thing's I'd check is: are all the database instances alike, as far as database objects are concerned? Do they all have the exact same tables, columns (and their order in the tables), nullability, etc. etc. Be sure to check pretty much everything listed in sys.objects. Once you know that the database structures are all in synch, then you know that any database modification scripts you generate will work on all the instances.
Once you modify your test environment with your planned changes, you have to ensure that they don't break existing functionality. Can you accurately emulate "...being pounded all day by between sixty and one hundred clients" on your test environment? If you can't, then you of course cannot know if your changes will break anything until they go live. (An assumption I'd avoid: just because a given instance has no duplicates in the columns you wish to build a primary key on does not mean that there are never any duplicates present...)


Altering database tables on updating website

This seems to be an issue that keeps coming back in every web application; you're improving the back-end code and need to alter a table in the database in order to do so. No problem doing manually on the development system, but when you deploy your updated code to production servers, they'll need to automatically alter the database tables too.
I've seen a variety of ways to handle these situations, all come with their benefits and own problems. Roughly, I've come to the following two possibilities;
Dedicated update script. Requires manually initiating the update. Requires all table alterations to be done in a predefined order (rigid release planning, no easy quick fixes on the database). Typically requires maintaining a separate updating process and some way to record and manage version numbers. Benefit is that it doesn't impact running code.
Checking table properties at runtime and altering them if needed. No manual interaction required and table alters may happen in any order (so a quick fix on the database is easy to deploy). Another benefit is that the code is typically a lot easier to maintain. Obvious problem is that it requires checking table properties a lot more than it needs to.
Are there any other general possibilities or ways of dealing with altering database tables upon application updates?
I'll share what I've seen work best. It's just expanding upon your first option.
The steps I've usually seen when updating schemas in production:
Take down the front end applications. This prevents any data from being written during a schema update. We don't want writes to fail because relationships are messed up or a table is suddenly out of sync with the application.
Potentially disconnect the database so no connections can be made. Sometimes there is code out there using your database you don't even know about!
Run the scripts as you described in your first option. It definitely takes careful planning. You're right that you need a pre-defined order to apply the changes. Also I would note often times you need two sets of scripts, one for schema updates and one for data updates. As an example, if you want to add a field that is not nullable, you might add a nullable field first, and then run a script to put in a default value.
Have rollback scripts on hand. This is crucial because you might make all the changes you think you need (since it all worked great in development) and then discover the application doesn't work before you bring it back online. It's good to have an exit strategy so you aren't in that horrible place of "oh crap, we broke the application and we've been offline for hours and hours and what do we do?!"
Make sure you have backups ready to go in case (4) goes really bad.
Coordinate the application update with the database updates. Usually you do the database updates first and then roll out the new code.
(Optional) A lot of companies do partial roll outs to test. I've never done this, but if you have 5 application servers and 5 database servers, you can first roll out to 1 application/1 database server and see how it goes. Then if it's good you continue with the rest of the production machines.
It definitely takes time to find out what works best for you. From my experience doing lots of production database updates, there is no silver bullet. The most important thing is taking your time and being disciplined in tracking changes (versioning like you mentioned).

Should we start with multiple small-grained databases for an app that may scale massively

We're developing a new eCommerce website and are using NHibernate for the first time. At present we are splitting our data into multiple SQL Server databases, divided per area of functionality. So we have one for UserInfo, one for Orders, one for ProductCatalogue and so on...
Our justification for this decision is twofold really:
the website has the potential to be HUGE (it is a new website for one of the largest online brands in the UK) and we feel that by partitioning our data along functional lines we will be able to move the databases onto their own servers which would give us an easy scaling route should we need it;
my team has always worked this way - partly as a consequence of following the MS Commerce Server pattern from previous projects.
However, reading up on this decision on the internet, we find that the normal response to this sort of model is extremely scathing. "Creating more work for the devs now in order to create more work for the devs later" is one sample comment from Stack Overflow!
In addition, NHibernate is much easier to use with only one database (just one SessionFactory needed). And knowing that Stack Overflow ran off just one box for a long time makes me think that maybe we should not try to be so clever.
So, my question is, "are we correct in thinking that using fine-grained databases might increase our ability to scale or should we sacrifice this for easier development"?
Why don't you just design your database properly and put the files on appropriate disk? Use a cluster if necessary. Creating multiple databases is not an inherently scaling solution. Also - cross database referential integrity? Good luck.
What's your definition of "HUGE"? SQL Server can handle massive databases, but one thing I've learnt is that people often have no idea what constitutes a lot of data.
I've never worked in a project like this. I'm used to databases with several hundred tables, which had never been a problem.
Therefore I can't say if your idea is a good idea, I never tried it. The "my team has always worked this way"-argument is a major driver for many decisions, and I can't even say that it is always wrong.
With NHibernate you organize your data in classes. They can be in different namespaces and assemblies. You usually don't work much with the database directly, you don't need this kind of structure there.
About the scalability argument: I'm not sure if it is really scaling well when you need to access several databases every time. I mean: you always need users and orders and probably more. Then you need to get all this data from several databases.
Agree fully with starskythehutch - keep your related tables together in the same DB. BUT, you may want to consider having separate databases for things that are not related or non-critical to your main product; but that are a part of the app.
For eg: if you decide to log every visit/hit to the site in a DB, you should probably keep that in a separate DB.
The reason you should consider:
1. huge number of transactions - say hundreds of thousands / sec. Having non-critical un-related stuff in a separate DB will ensure that tlog contentions because of this are avoided.
Restore, DBCC CHECKDB, backup times. If you stuff your non-related non-critical stuff in your main DB, you are essentially increasing the size of your DB and it will affect these operations. Having it in separate DB will help you improve performance of these operations.

YAGNI and database creation scripts

Right now, I have code which creates the database (just a few CREATE queries on a SQLite database) in my main database access class. This seems unnecessary as I have no intention of ever using the code. I would just need it if something went wrong and I needed to recreate the database. Should I...
Leave things as they are, even though the database creation code is about a quarter of my file size.
Move the database-creation code to a separate script. It's likely I'll be running it manually if I ever need to run it again anyway, and that would put it out-of-sight-out-of-mind while working on the main code.
Delete the database-creation code and rely on revision control if I ever find myself needing it again.
I think it is best to keep the code. Even more importantly, you should maintain this code (or generate it) every time the database schema changes.
It is important for the following reasons.
You will probably be surprised how many times you need it. If you need to migrate your server, or setup another environment (e.g. TEST or DEMO), and so on.
I also find that I refer to the DDL SQL quite often when coding, particularly if I have not touched the system for a while.
You have a reference for the decisions you made, such as the indexes you created, unique keys, etc etc.
If you do not have a disciplined approach to this, I have found that the database schema can drift over time as ad-hoc changes are made, and this can cause obscure issues that are not found until you hit the database. Even worse without a disciplined approach (i.e. a reference definition of the schema) you may find that different databases have subtly different schema.
I would just need it if something went
wrong and I needed to recreate the
Recreating the database is absolutely not an exceptional case. That code is part of your deployment process on a new / different system, and it represents the DB structure your code expects to work with. You should actually have integration tests that confirm this. Working indefinitely with a single DB server whose schema was created incrementally via manually dispatched SQL statements during development is not something you should rely on.
But yes, it should be separated from the access code; so option 2 is correct. The separate script can then be used by tests as well as for deployment.

Database Refresh

How often do you refresh your development database from production database?Since there are many types of projects (targeting different domains) I would like to know how it is being done and at what intervals(days/months/years) it is being done ?
While working at Callaway Golf we had an automated build that would completely refresh the database from a baseline. This baseline would be updated (from production) almost daily. We had a set up scripts (DTS) that would do this for us. So if there was some new and interesting information we could easily do it a couple times of day, once a week, etc. The key here is automation to perform the task. If it is easy to do then when it is done is really only dependent on how performing the task impacts the load on the production database, the network, and the amount of time it takes to complete it. This could of course be set up as a schedule task to run at off peak hours and before the dev team gets in in the morning.
The key things in refreshing your development database are:
(1) Automate the refresh through a script.
(2) Obfuscate the data in your development database since you do not want your developers to see the real data or you could do some sampling of your production database.
(3) Decide the frequency of the refresh -- I usually do it once a week.
Depends on what kind of work you're doing. If you're debugging issues that are closely related to the data, then frequent updates are good.
If you're doing data Quality Assurance (which often involves writing code to detect and repair it, that you have to develop and test away from the production server), then you need extremely fresh data. The bad data that is the most valuable to fix is the data that was just inserted or updated today.
If you are writing client code, then infrequent updates are good. Often when I'm writing C# UI code, I could care less what the data is, I just care if it shows up in the right box on the screen.
If you have data with any security issues, you should stop using production data--i.e. never update from production--and get a good sample data generator. Writing a good sample data generator is hard, so 3rd party products are the way to go. MS Data Dude comes to mind, and I recommend Sql RedGate's data generator.
And finally, how hard is it to get a copy of the production data? If it is cheap and automatable, just get a new copy every night. If it is expensive (requires the attention of a very busy DBA), well, resource constraints might answer the question for you regardless to these other concerns.
We tend to refresh every couple of days, or perhaps once a week or so if things are "normal," though if we're investigating something amiss we may do so more much more often.
Our production database is about 1GB, so it's not a trivial thing to copy around. Also, for us there's generally no burning need to get current data from production into the dev systems.
The "how" is simply a MySQL "backup" and "restore"
In a lot of cases, refreshing the dev database really isn't that important. Often production systems have far more data that required for development, and working with such a large dataset can be a hassle for several reasons. Examples include development on the interface, where it's more important to have some data instead of anything specific. In this case, it's more customary to thin out the production database to a smaller subset of real data. After you do this once, it's not really that important to update, as long as all schema changes are pushed through the dev database(s).
On the other hand, performance bugs may often require production-sized databases to be able to reproduce and identify bottlenecks, so in this scenario it is extremely useful to have an almost-realtime database. Many issues may only present themselves with the exact data used in production.
We tend to always go back to an on-demand schedule. We have many different databases that are used in a suite of applications. We stay away from automatic DEV databases b/c many of our code changes involve database changes and I don't want anything overwritten.
Not at all, the dev databases (one per developer) get setup by a script or similar a couple of times a day, possibly a couple hundred times when running db tests locally. This includes a couple of records for playing around
Of course we still need a database with at least parts of production in it, for integration and scalability tests. We are aiming for a daily automated refresh. But we aren't there yet.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article):
Another discussion of multi-tenancy as it relates to your data model here:
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.
