Should we start with multiple small-grained databases for an app that may scale massively - sql-server

We're developing a new eCommerce website and are using NHibernate for the first time. At present we are splitting our data into multiple SQL Server databases, divided per area of functionality. So we have one for UserInfo, one for Orders, one for ProductCatalogue and so on...
Our justification for this decision is twofold really:
the website has the potential to be HUGE (it is a new website for one of the largest online brands in the UK) and we feel that by partitioning our data along functional lines we will be able to move the databases onto their own servers which would give us an easy scaling route should we need it;
my team has always worked this way - partly as a consequence of following the MS Commerce Server pattern from previous projects.
However, reading up on this decision on the internet, we find that the normal response to this sort of model is extremely scathing. "Creating more work for the devs now in order to create more work for the devs later" is one sample comment from Stack Overflow!
In addition, NHibernate is much easier to use with only one database (just one SessionFactory needed). And knowing that Stack Overflow ran off just one box for a long time makes me think that maybe we should not try to be so clever.
So, my question is, "are we correct in thinking that using fine-grained databases might increase our ability to scale or should we sacrifice this for easier development"?

Why don't you just design your database properly and put the files on appropriate disk? Use a cluster if necessary. Creating multiple databases is not an inherently scaling solution. Also - cross database referential integrity? Good luck.
What's your definition of "HUGE"? SQL Server can handle massive databases, but one thing I've learnt is that people often have no idea what constitutes a lot of data.

I've never worked in a project like this. I'm used to databases with several hundred tables, which had never been a problem.
Therefore I can't say if your idea is a good idea, I never tried it. The "my team has always worked this way"-argument is a major driver for many decisions, and I can't even say that it is always wrong.
With NHibernate you organize your data in classes. They can be in different namespaces and assemblies. You usually don't work much with the database directly, you don't need this kind of structure there.
About the scalability argument: I'm not sure if it is really scaling well when you need to access several databases every time. I mean: you always need users and orders and probably more. Then you need to get all this data from several databases.

Agree fully with starskythehutch - keep your related tables together in the same DB. BUT, you may want to consider having separate databases for things that are not related or non-critical to your main product; but that are a part of the app.
For eg: if you decide to log every visit/hit to the site in a DB, you should probably keep that in a separate DB.
The reason you should consider:
1. huge number of transactions - say hundreds of thousands / sec. Having non-critical un-related stuff in a separate DB will ensure that tlog contentions because of this are avoided.
Restore, DBCC CHECKDB, backup times. If you stuff your non-related non-critical stuff in your main DB, you are essentially increasing the size of your DB and it will affect these operations. Having it in separate DB will help you improve performance of these operations.

Related

What arguments to use to explain why SQL Server is far better than a flat file

The higher-ups in my company were told by good friends that flat files are the way to go, and we should switch from SQL Server to them for everything we do. We have over 300 servers and hundreds of different databases. From just the few I'm involved with we have > 10 billion records in quite a few of them with upwards of 100k new records a day and who knows how many updates... Me and a couple others need to come up with a response saying why we shouldn't do this. Most of our stuff is ASP.NET with some legacy ASP. We thought that making a simple console app that tests/times the same interactions between a flat file (stored on the network) and SQL over the network doing large inserts, searches, updates etc along with things like network disconnects randomly. This would show them how bad flat files can be, especially when you are dealing with millions of records.
What things should I use in my response? What should I do with my demo code to illustrate this?
My sort list so far:
Security
Concurrent access
Performance with large amounts of data
Amount of time to do such a massive rewrite/switch and huge $ cost
Lack of transactions
PITA to map relational data to flat files
NTFS doesn't support tons of files in a directory well
Lack of Adhoc data searching/manipulation
Enforcing data integrity
Recovery from network outage
Client delay while waiting for other clients changes to commit
Most everybody stopped using flat files for this type of storage long ago for good reason
Load balancing/replication
I fear that this will be a great post on the Daily WTF someday if I can't stop it now.
Additionally
Does anyone know if anything about HIPPA could be used in this fight? Many of our records are patient records...
Data integrity. First, you can enforce it in a database and cannot in a flat file. Second, you can ensure you have referential integrity between different entities to prevent orphaning rows.
Efficiency in storage depending on the nature of the data. If the data is naturally broken into entities, then a database will be more efficient than lots of flat files from the standpoint of the additional code that will need to be written in the case of flat files in order to join data.
Native query capabilities. You can query against a database natively whereas you cannot with a flat file. With a flat file you have to load the file into some other environment (e.g. a C# application) and use its capabilities to query against it.
Format integrity. The database format is more rigid which means more consistent. A flat file can easily change in a way that the code that reads the flat file(s) will break. The difference is related to #3. In a database, if the schema changes, you can still query against it using native tools. If the flat file format changes, you have to effectively do a search because the code that reads it will likely be broken.
"Universal" language. SQL is somewhat ubiquitous where as the structure of the flat file is far more malleable.
I'd also mention data corruption. Most modern SQL databases can have the power killed on the server, or have the server instance crash and you won't (shouldn't) loose data. Flat files aren't really that way.
Also I'd mention search times. Perhaps even write a simple flat file database with 1mil entries and show search times vs MS SQL. With indexes you should be able to search a SQL database thousands of times faster.
I'd also be careful how quickly you write off flat files. Id go so far as saying "it's a good idea for many cases, but in our case....". This way you won't sound like you're not listening to the other views. Tact in situations like this is a major thing to consider. They may be horribly wrong, but you have to convince your boss of that.
What do they gain from using flat files? The conversion process will be hundreds of hours - hours they pay for. How quickly can flat files generate a positive return on that investment? Provide a rough cost estimate. Translate the technical considerations into money (costs), and it puts the problem in their perspective.
On top of just the data conversion, add in the hidden costs for duplicating a database's capabilities...
Indexing
Transaction processing
Logging
Access control
Performance
Security
Databases allow you to easily index your data to be able to particular records or groups of records by searching any number of different columns.
With flat files you have to write your own indexing mechanisms. There is no need to do all that work again when the database does it for you already.
If you use "text files", you'll need to build an interface on top of it which Microsoft has already done for you and called it SQL Server.
Ask your managers if it makes sense to your company to spend all these resources building a home-made database system (because really that's what it is), or would these resources be better spent focusing on the business.
Performance: SQL Server is built for storing conveniently searchable data. It has optimized data structures in memory built with searching/inserting/deleting in mind. Usage of the disk is lowered, as data regularly queried is kept in memory.
Business partners: if you ever plan to do B2B with 3rd party companies, SQL Server has built-in functionality for it called Linked Servers. If you have only a bunch of files, your business partner will give up on you as no data interconnection is possible. Unless you want to re-invent the wheel again, and build an interface for each business partner you have.
Clustering: you can easily cluster servers in SQL Server for high availability and speed, a lot more than what's possible with text based solution.
You have a nice start to your list. The items I would add include:
Data integrity - SQL engines provide built-in mechanisms (relationships, constraints, triggers, etc.) that make it very simple to reduce the amount of "bad" data in your system. You would need to hand code all data constraint separately if you use flat files.
Add-Hoc data retrieval - SQL engines, through the use of SELECT statements, provide a means of filtering and summarizing your data with very little code. If you are using flat files, considerably more code is needed to get the same results.
These items can be replicated if you want to take the time to build a data engine, but what would be the point? SQL engines already provide these benefits.
I don't think I can even start to list the reasons. I think my head is going to explode. I'll take the risk though to try to help you...
Simulate a network outage and show what happens to one of the files at that point
Demo the horrors of a half-committed transaction because text files don't pass the ACID test
If it's a multi-user application, show how long a client has to wait when 500 connections are all trying to update the same text file
Try to politely explain why the best approach to making business decisions is to listen to the professionals who you are paying money and who know the domain (in this case, IT) and not your buddy who doesn't have a clue (maybe leave out that last bit)
Mention the fact that 99% (made up number) of the business world uses relational databases for their important data, not text files and there's probably a reason for that
Show what happens to your application when someone goes into the text file and types in "haha!" for a column that's supposed to be an integer
If you are a public company, the shareholders would be well served to know this is being seriously contemplated. "We" all know this is a ridiculous suggestion given the size and scope of your operation. Patient records must be protected, not only from security breaches but from irresponsible exposure to loss - lives may depend up the data. If the Executives care at all about the patients, THIS should be their highest concern.
I worked with IBM 370 mainframes from '74 onwards and the day that DB2 took over from plain old flat files, VSAM and ISAM was a milestone day. Haven't looked back to flat-file storage, except for streaming data, in my 25 years with RDBMSs of 4 flavors.
If I owned stock in "you", dumping it in a hurry the moment the project took off would seem appropriate...
Your list is a great start of reasons for sticking with a database.
However, I would recommend that if you're talking to a technical person, to shy away from technical reasons in a recommendation because they might come across as biased.
Here are my 2 points against flat file data storage:
1) Security - HIPPA audits require that patient data remain in a secure environment. The common database systems (Oracle, Microsoft SQL, MySQL) have methods for implementing HIPPA compliant security access. Doing so on a flat-file would be difficult, at best.
Side note: I've also seen medical practices that encrypt the patient name in the database to add extra layers of protection & compliance to ensure even if their DB is compromised that the patient records are not at risk.
2) Reporting - Reporting from any structured database system is simple and common. There are hundreds of thousands of developers that can perform this task. Reporting from flat-files will require an above-average developer. And, because there is no generally accepted method for doing reporting off of a flat-file database, one developer might do things different than another. This could impact the talent pool able to work on a home-grown flat-file system, and ultimately drive costs up by having to support that type of a system.
I hope that helps.
How do you create a relational model with plain text files?
Or are you planning to use a different file for each entity?
Pro file system:
Stable (less lines of code = less bugs, easier to understand, more reliable)
Faster with huge data blobs
Searching/sorting is somewhat slow (but sort can be faster than SQL's order by)
So you'd chose a filesystem to create log files, for example. Logging into a DB is useless unless you need to do complex analysis of the data.
Pro DB:
Transactions (which includes concurrent access)
It can search through huge amounts of records (but not through huge blobs of data)
Chopping the data in all kinds of ways with queries is easy (well, if you know your SQL and the special "oddities" of your DB)
So if you need to add data rarely but search it often, select parts of it by certain criteria or aggregate values, a DB is for you.
NTFS does not support mass amounts of .txt files well. Depending on how a flat file system is developed, the health of a harddrive can become an issue. A lot of older file systems use mass amount of small .txt files to store data. It's bad design, but tends to happen as a flat file system gets older.
Fragmentation becomes an issue, and you lose a text file here and there, causing you to lose small amounts of data. Health of a hard drive should not be an issue when it comes to database design.
This is indeed, on the part of your employer, a MAJOR WTF if he's seriously proposing flat files for everything...
You already know the reasons (oh - add Replication / Load Balancing to your list) - what you need to do now is to convince him of them. My approach on this would two fold.
First of all, I would write a script in whatever tool you currently use to perform a basic operation using SQL, and have it timed. I would then write another script in which you sincerely try to get a flat text solution working, and then highlight the difference in performance. Give him both sets of code so he knows you aren't cheating.
Point out that technology evolves, and that just because someone was successful 20 years ago, this does not automatically entitle them to a credible opinion now.
You might also want to mention the scope for errors in decoding / encoding information in text files, that it would be trivial for someone to steal them, and the costs (justify your estimate) in adapting the current code base to use text files.
I would then ask serious questions of management - foremost amongst them, and I would ask this DIRECTLY, is "Why are you prepared to overrule your technical staff on technical matters" based on one other individual's opinion - especially when said individual is not as familiar with our set up as we are...
I'd also then use the phrase "I do not mean to belittle you, but I seriously feel I have to intervene at this point for the good of the company..."
Another approach - turn the tables - have Mr. Wonderful supply arguments as to why text files are the way forward. You'll then either a) Learn something (not likely), or b) Be in a position to utterly destroy his arguments.
Good luck with this - I feel your pain...
Martin
I suggest you get your retalliation in first, post on Daily WTF now.
As to your question: a business reason would be why does your boss want to rewrite all your systems. From scratch as you would, effectively, have to write your own database system.
For a development reason, you would lose access to the SQL server ecosystem, all the libraries, tools, utilities.
Perhaps the guy that suggested this is actually thinking of going into competition with your company.
Simplest way to refute this argument - name a fortune 500 company that processes data on this scale using flat files?
Now name a fortune 500 company that doesn't use a relational database...
Case closed.
Something is really fishy here. For someone to get the terminology right ( "flat file" ) but not know how overwhelmingly stupid an idea that is, it just doesn't add up. I would be willing to be your manager is non-technical, but the person your manager is talking to is. This sounds more like a lost in translation problem.
Are you sure they don't mean no-SQL, as if you are in a document centric environment, moving away from a relational database actually does make sense in some regards, while still having many of the positives of a tradition RDBMS.
So, instead of justifying why SQL is better than flat files, I would invert the problem and ask what problems flat files are meant to solve. I would put odds on money that this is a communication problem.
If its not and your company is actually considering replacing its DB with a home grown flat file system off the recommendation of "a friend", convincing your manager why he is wrong is the least of your worries. Instead, dust off and start circulating your resume.
•Amount of time to do such a massive
rewrite/switch and huge $ cost
It's not just amount of time it is the introduction of new bugs. A re-write of these proportions would cause things that currenty work to break.
I'd suggest a giving him a cost estimate of the hours to do such a rewrite for just one system and then the number of systems that would need to change. Once they have a cost estimate, they will run from this as fast as they can.
Managers like numbers, so do a formal written decision analysis. Compare the two proposals by benefits and risks, side by side with numeric values. When you get to cost 0 to maintain and 100,000,000 to convert they will get the point.
The people that doesn't distinguish between flat files and sql, doesnt understand all arguments that you say before.
The explanation must simple as possible, something like this:
SQL is a some kind of search/concurrency wrapper around the flat files.
All the problems that exist currently, will stay even the company going to write the wrapper from zero.
Also you must to give some other way to resolve the current problems, use smart words like advanced BLL or install/uninstall scripting environment. :)
You have to speak executive. Without saying it, make them realize they're in way over their heads here. Here's some ammunition:
Database theory is hardcore computer science. We're talking about building a scalable system that can handle millions of records and tolerate disasters without putting everyone out of business.
This is the work of PhD-level specialists. They've been refining the field for a good 20 years now, and the great thing about that is this: it allows us to specialize in building business systems.
If you have to, come right out and say that this just isn't done in the enterprise. It would be costly and the result would be inferior. It's exactly the kind of wheel that developers love to reinvent, and in my opinion the only time you should is if the result is going to be a product or service that you can sell. And it won't be.

How to modernize an enormous legacy database?

I have a question, just looking for suggestions here.
So, my application is 'modernizing' a desktop application by converting it to the web, with an ICEFaces UI and server side written in Java. However, they are keeping around the same Oracle database, which at current count has about 700-900 tables and probably a billion total records in the tables. Some individual tables have 250 million rows, many have over 25 million.
Needless to say, the database is not scaling well. As a result, the performance of the application is looking to be abysmal. The architects / decision makers-that-be have all either refused or are unwilling to restructure the persistence. So, basically we are putting a fresh coat of paint on a functional desktop application that currently serves most user needs and does so with relative ease. The actual database performance is pretty slow in the desktop app now. The quick performance I referred to earlier was non-database related stuff (sorry I misspoke there). I am having trouble sleeping at night thinking of how poorly this application is going to perform and how difficult it is going to be for everyday users to do their job.
So, my question is, what options do I have to mitigate this impending disaster? Is there some type of intermediate layer I can put in between the database and the Java code to speed up performance while at the same time keeping the database structure intact? Caching is obviously an option, but I don't see that as being a cure-all. Is it possible to layer a NoSQL DB in between or something?
I don't understand how to reconcile two things you said.
Needless to say, the database is not scaling well
and
currently serves most user needs and does so with relative ease and quick performance.
You don't say you are adding new users or new function, just making the same function accessible via a web interface.
So why is there a problem. Your Web App will be doing more or less the same database work as before.
In fact introducing a web tier could well give new caching opportunities so reducing the work the DB is doing.
If your early pieces of web app development are showing poor performance then I would start by trying to understand how the queries you are doing in the web app differ from those done by the existing app. Is it possible that you are using some tooling which is taking a somewhat naive approach to generating queries?
If the current app performs well and your new java app doesn't, the problem is not in the database layer, but in your application layer. If performance is as bad as you say, they should notice fairly early and have the option of going back to the Desktop application.
The DBA should be able to readily identify the additional workload on the database from your application. Assuming the logic hasn't changed it is unlikely to be doing more writes. It could be reads or it could be 'chattier' (moving the same amount of information but in smaller parcels). Chatty applications can use a lot of CPU. A lot of architects try to move processing from the database layer into the application layer because "work on the database is expensive" but actually make things worse due to the overhead of the "to-and-fro".
PS.
There's nothing 'bad' about having 250 million rows in a table. Generally you access a table through an index. There are typically 2 or 3 hops from the top of an index to the bottom (and then one more to the table). I've got a 20 million row table with a BLEVEL of 2 and a 120+ million row table with a BLEVEL of 3.
Indexing means that you rarely hit more than a small proportion of your data blocks. The frequently used index blocks (and data blocks) get cached in the database server's memory. The DBA would be able to see if this memory area is too small for the workload (ie a lot of physical disk IO).
If your app is getting a lot of information that it doesn't really need, this can put pressure on the memory space. Don't be greedy. if you only need three columns from a row, don't grab the whole row.
What you describe is something that Oracle should be capable of handling very easily if you have the right equipment and database design. It should scale well if you get someone on your team who is a specialist in performance tuning large applications.
Redoing the database from scratch would cost a fortune and would introduce new bugs and the potential for loss of critical information is huge. It almost never is a better idea to rewrite the database at this point. Usually those kinds of projects fail miserably after costing the company thousands or even millions of dollars. Your architects made the right choice. Learn to accept that what you want isn't always the best way. The data is far more important to the company than the app. There are many reasons why people have learned not to try to redesign the database from scratch.
Now there are ways to improve database performance. First thing I would consider with a database this size is partioning the data. I would also consider archiving old data to a data warehouse and doing most reporting from that. Other things to consider would be improving your servers to higher performing models, profiling to find slowest running queries and individually fixing them, looking at indexing, updating statistics and indexes (not sure if this is what you do on Oracle, I'm a SLQ Server gal but your dbas would know). There are some good books on refactoring old legacy databases. The one below is not datbase specific.
http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533/ref=sr_1_1?ie=UTF8&s=books&qid=1275577997&sr=8-1
There are also some good books on performance tuning (look for ones specific to Oracle, what works for SQL Server or mySQL is not what is best for Oracle)
Personally I would get those and read them from cover to cover before designing a plan for how you are going to fix the poor performance. I would also include the DBAs in all your planning, they know things that you do not about the database and why some things are designed the way they are.
If you have a lot of lookups that are for items not in the database you can reduce the number by using a bloom filter. Add everything in the database to the bloom filter then before you do a lookup check the bloom first. Only if the bloom reports it present do you need to bother the database. The bloom will result in false positives but you can design it to the 'size vs false positive' trade off that best suits you.
The strategy is used by Google in their big-table database and they have reported that it significantly improves performance.
http://en.wikipedia.org/wiki/Bloom_filter
Good luck, working on tasks you don't believe in is tough.
So you put a fresh coat of paint on a functional and quick desktop application and then the system becomes slow?
And then you say that "it is needless to say that the database isn't scaling well"?
I don't get it. I think that there is something wrong with your fresh coat of paint, not with the database.
Don't be put down by this sort of thing. See it as a challenge, rather than something to be losing sleep over! I know it's tempting as a programmer to want to rip everything out and start over again, but from a business perspective, it's just not always viable. For example, by using the same database, the business can continue to use the old application while the new one is being developed and switch over customers in groups, rather than having to switch everyone over at the same time.
As for what you can do about performance, it depends a lot on the usage pattern. Caching can help greatly with mostly read-only databases. Even with read/write database, it can still be a boon if correctly designed. A NoSQL database might help with write-heavy stuff, but it might also be more trouble than it's worth if the data has to end up in a regular database anyway.
In the end, it all depends greatly on your application's architecture and usage patterns.
Good luck!
Well without knowing too much about what kinds of queries that are mostly done (I would expact lookups to be more common) perhaps you should try caching first. And cache at different layers, at the layer before the app server if possible and of course what you suggested caching at the layer between the app server and the database.
Caching works well for read data and it might not be as bad as you think.
Have you looked at Terracotta ? They do have some caching and scaling stuff that might be relavant to you.
Take it as a challenge!
The way to 'mitigate this impending disaster' is to do what you should be doing anyway. If you follow best practices the pain of switching out your persistence layer at a later stage will be minimal.
Up until the time that you have valid performance benchmarks and identified bottlenecks in the system talk of performance is premature. In any case I would be surprised if many of the 'intermediate layer' strategies aren't already implemented at the database level.
If the database is legacy and enormous, then
1) it cannot be changed in a way that will change the interface, as this will break too many existing applications. Or, if you change the interface, this has to be coordinated with modifying multiple applications with associated testing.
2) If the issue is performance, then there are probably many changes that can be made to optimize the database without changing the interface.
3) Views can be used to maintain the existing interfaces while restructuring tables for more efficiency, or possibly to allow more efficient access in the future.
4) Standard database optimizations, such as performance analysis, indexing, caching can probably greatly increase efficiency and performance without changing the interface.
There's a lot more that can be done, but you get the idea. It can't really be updated in one single big change. Changes have to be incremental, or transparent to the applications that use it.
The database is PART of the application. Don't consider them to be separate, it isn't.
As developer, you need to be free to make schema changes as necessary, and suggest data changes to improve performance / functionality in production (for example archiving old data).
Your development system presumably does not have that much data, but has the exact same schema.
In order to do performance testing, you will need a system with the same hardware and same size data (same data if possible) as production. You should explain to management that performance testing is absolutely necessary as you feel the app isn't going to perform.
Of course making schema changes (adding / removing indexes, splitting tables out etc) may affect other parts of the system - which you should consider as parts of a SYSTEM - and hence do the necessary regression testing and fixing.
If you need to modify the database schema, and make changes to the desktop client accordingly, to make the web app perform, that is what you have to do - justify your design decision to the management.

How many tables/sprocs/functions in a database is too many?

I'm interested in database refactoring. I deal with several databases that don't have a large amount of data, just a few GB with at most a few hundred thousand rows. However, they have hundreds -- sometimes many hundreds -- of tables, views, sprocs and functions. In some places a divide-and-rule strategy using schemas has been implemented which has helped some problems of seeing ownership/usage of tables. However, it hasn't really helped object coupling.
We all read that integration via shared database isn't A Good Thing, but we also know that it is, at least for a while , a very productive thing as everything is in the database. We just don't apply the Single Responsibility Principle to databases like we do to objects.
Edit: I should add that I have no database performance issues. The tables are not large, the biggest has only a few hundred thousand rows. There is no real database performance issue; except when the database schema/logic/implementation is grotesquely inefficient (say requiring a cursor to do a sproc execution for each row in a result set in order to pre-process data for a report). Before you say I should change these, that is the whole point: I can't because the database is no longer in a state where the impact of changes can be assessed.
Clearly at some point you say "Enough!" and divide into multiple databases connected by messages, ETL, application tiers etc etc
The question is: how many is too many? What is the absolute upper limit of the number of sprocs/tables/functions that you can have before you go insane?
First, stop trying to think of databases in object oriented terms. Principles of object oriented programming simply do NOT apply to relational databases.
Shared databases are a very good thing from a business perspective. Multiple databases storing information that has to be transferred between them quickly becomes way more complex than your piddly many hundreds of objects. Data that is consistent between enterprise applications is priceless. Trying to reconcile if GE Corp and General Electric Corporation are really the same entity between two databases can be a nightmare.
Refactoring datbases is a nice goal, but it is very complex in reality. Don't do it unless you have a major performance issue that needs to be addressed or unless you are willing to commit to a process of identifying all the code that might be affected by a change. Even then, consider if you can know all the code that might change (this is one reason why database people hate, hate, hate dynamic code!).
Often the best way to refactor is to add your change and start changing over to using your new field, sp etc while leaving the old one in place until a set expiration date. Since you are on an annual cycle, you will need to manage those dates over a long period of time. To see if sps are being used, you can identify the ones you aren't sure of and add some code to them to insert to a table everytime they are run. If after your whole year cycle, they haven't been run, you can safely eliminate them. The cycle may be shorter depending on the sp.
If I'm writing something that will only be run annually, I would normally put the word annual in the sp name. But that may not be true where you are, however, the function of the sp should give you an idea if it is something that should only be run periodically. I wouldn't expect usp_send email proc to only run once a year but I might expect that a usp_attendance_report might not be run often. Of course as I said, I would have named it something more like usp_annual_attendance_report and you can consider doing that sort of thing moving forward.
But be aware that any refactoring you do will have to take place on a long cycle to ensure that you don't delete something you need. If your code is in a source control system (and all database tables, sp, views, UDFs, triggers, etc should be), you can probably eliminate some things knowing that if they fail you can pretty instantly put them back. Again, I'd examine the object to determine the possible risk eliminating them would have.
Of course if you have good automated tests in place, eliminating something on dev and running the tests can help you find out if something is still being referenced.
If you are looking for an easy way to refactor, I don't know of one. Refactoring databses is a time-consuming, risky activity and one which may not show enough improvement for the powers that be to be willing to pay for it.
A good book on refactoring databases is:http://www.amazon.com/Refactoring-Databases-Evolutionary-Addison-Wesley-Signature/dp/0321293533
I'm not sure there is a magical limit for any of the things you mentioned. I prefer to keep things in one place so I don't have to remember that some records are in place and other records are in another.
I'd be more interested to know if all this work is impacting your performance? And if it's not then why change it? Unless it's impacting performance in some horrible way your customers won't see any benefit from your work and then what's the point?
Your customers might be better served if you just bought a new machine or upgraded your database server software.

What are the pros/cons of and best practices for using a single database?

Here at work (a multi-billion dollar manufaturing company with a 12 person Windows development team) we are about to go to a single master database for all new applications and will have it broken up with schemas for what we normally would have had databases for before. There will also be a few common schemas with stuff like employee directory and branch directory and so on...
I'm still not sure how I feel about this move, but we're about to have a meeting on this in a few hours to discuss pros, cons, best practices, pitfalls and so on... so I'm looking for your thoughts on this... Is it good? Is it bad? What problems are we going to run into a year from now?
Any thoughts, tips, or advice is welcome. Thanks
EDIT
In response to a comment on this question, we are using SQL Server 2005 and we are actually talking about moving what would have been seperate databases on the same instance into a single database. The driving issue is the complete lack of referential integrity accross databases as the majority of our applications need access to common data such as an employee record, or branch information.
UPDATE
Several people requested that I update this question with the results from our meeting so here it is. We debated back and forth the pros and cons of doing this (I even showed them this question using the projector) and by the time we were done we had pretty much covered the pros and cons covered here. About half of us thought we could get it done with the right resources and commitment, and about half thought we couldn't do it (or that it wouldn't work out well). We decided to use some time with Microsoft to get their thoughts and platform specific advice. I will be sure to update this question and my blog after we've talked to them. Thanks for all the help and helpful answers.
Larger database are harder to maintain due to sheer size: backups take longer, disaster recovery is slower which in turn requires more often backups. You can address these by creating filegroups and using filegroup level backup in your maintenance plans and on crash recovery you can use the 'piecemeal restore' strategy to speed things up.
Proper use of filegroups will make most of the 'cons' cited by previous replies go away: they can distribute the I/O, they can sanitize your maintenance plans and backup/restore strategy, they offer availability by taking offline only the damaged portion of the the db in case of crash. So I'd say that while those 'cons' are legit concerns, they have can be mitigated by a proper deployment strategy. Its true though that these mitigation actions require a true, experienced, dba at the helm as they will go beyond the comfort zone of a developer turned dba by need.
Some of the pros I can think of quickly:
Consistency. You can have a backup-restore so that all data is consistent. Separate dbs don't allow this because you cannot coordinate a consistent set of backups unless you take them all offline, or make them r/o, during the backup.
Dirt cheap high availability: you can deploy database mirroring for disaster recoverability and high availability. Multiple databases have problems because one cannot coordinate a simultaneous failover and apps are faced with the dilemma of seeking each database current location.
Security. While most other posts see one database harder to secure, I'd say is easier to secure. Multiple databases seem harder to secure properly simply because what everyone does is they make one login and add it to that database db_owner group. Having one database will make things harder (unless you end up making everyone dbo, very bad) but once you start doing the right thing (granular access) then one db is not harder than multiple dbs, is actually easier because you won't have to copy/maintain some common groups/rights across multiple dbs.
Control. Will be easier to impose certain policies and good practices on a single db rather than multiple ones (no data access to developers, app data access only through execute rights on the schema to enforce procedures access etc).
There are also some cons I did not see in other posts:
This will be much harder to pull off that you think right now
Increase coupling between formerly separated applications will impose development restrictions: you can't simply alter your schema, you will have to coordinate it with the rest of the apps (you can argue that this was also the case before, but was brushed under the carpet by having separate dbs, and you're right)
Log writes that are now distributed across multiple db logs will be consolidated into one single log file. If your writes are significant, this may turn out to be a serious bottleneck and force you to buy some expensive fast drives for the new, consolidated, log file. In general this can be addresses by making the log drive a stripped array across as many stripes as needed to make it fast enough (usually raid 10).
GAM/SGAM/PFS allocations will also be consolidated, but again this will be alleviated by proper use of file groups.
Pros:
You only need to remember one connection string
When users report that access is slow, you know which DB is causing the trouble
Cons:
Backups of The One DB will take a long time and will get progressively longer over time.
Restoring data from a backup will get increasingly difficult.
Performance Tuning (SQL Profiler, Execution Plan estimation) for a feature for one app will slow down every app.
Restricting access to a single application's data is cumbersome if at all possible which will likely mean in practice that all devs and DBAs will be given keys to the ENTIRE kingdom.
New developers/DBAs have a much larger learning curve as they need to navigate a large and mostly useless (to them) database structure which means higher costs for training/ramp up.
When The One database goes down, everyone in your organization plays solitaire until it is restored.
Creating test instances for app development means copying your entire db
The only "Pro" I can think of is that all of your systems will be in the one database and therefore a single place to backup, store, etc. However, I would consider this to also be one of the biggest "Cons".
Some other general Cons:
Much harder to move an application to a different location/server in the future.
Possible locking issues if any applications make use of tempdb.
Possible unrelated performance degredation on one application when another application is being used.
Much harder to implement an application level security model if all tables are in the same database.
It sounds to me as though your company is transitioning between two completely distinct motives for using database technology. The first is application support. The second is data integration. If I'm right about this, the process will open up a huge can of worms, and many of the issues won't even be addressed by putting all the data in one big database.
Consider two of the points you made. The first is the complete lack of referential integrity across different databases. The second is the idea that each application will have its own schema. What this permits to happen is complete lack of referential integrity across schemas, putting you back in the quicksand you are in now.
Fixing the data so that referential integrity is present, and fixing the schemas so that referential integrity is enforced, and fixing the applications so that the applications agree with the new schemas will turn out to be a monumental task.
Here's what your company really needs to do: Have one single CONCEPTUAL database that contains all "enterprise data", and defined in such a way that both referential integrity and entity integrity are enforced. Revise existing schemas so that they conform to the CONCEPTUAL database except for data that is both purely local to that schema and undocumented in the unified conceptual database. Use constraints wherever needed to guarantee that the data covered by these schemas doesn't lose integrity.
Make the decision about whether these schemas belong in one database or many databases based on database administration, fail soft, security, and performance requirements and NOT on the need to integrate data. Whether you use one platform or multiple platforms is a separable decision.
Where necessary, maintain synchronized copies of the same data in separate databases. Include the overhead of doing this in your performance considerations above.
Document the conceptual database out the gazoo. Don't just settle for definitions of the FORM of data. Insist on definitions of the semantics of the data as well.
Notice that if you use ID fields instead of natural keys to enforce referential integrity, you will have to generate each ID field in one schema, and let the association between ID and dependent data propagate by means of synonyms, views, and synchronized replication.
This is not going to be easy.
If DB is getting bigger, making back-up is getting more difficult because of it's size.
This could mean a serious scalability problem if you want to add high-traffic applications in the future, since it is much easier to add new database servers which run seperate dbs than it is to parrallelize a single DB. At least in SQL Server.
Pros:
The convenience of having everything in one place
Thinking less about good database design
Cons:
Even unrelated things are in one place
Less thinking about good database design leading to poorly normalized data
To me this just sounds like laziness and a belief that all this "fancy ivory tower database stuff" is worthless.
I can see that being scary, but considering the number of businesses that use Oracle EBS, or SAP, or other systems that are, in essence, this same configuration, I don't see it being a Bad Thing™. It's a big move, and will be tough to get correct, but it can really improve integration across the enterprise in the long run.
I've never heard of this approach and would like to know how the meeting goes. I see no real benefit in combining multiple applications into a single database when the data doesn't relate to each other.
I'm thinking you might have issues if you decide that an application requires it's own database server at one point.
Ah, the old EggsInOneBasket design pattern. It's not a favourite.
You're just compounding any problems caused by damage to that database. Spread the risk!
For the referential integrity issue, you can make copies of those shared tables in the subsidiary databases. You can't use real replication, but what you do is deny everything but select on these to most users.
On the same server, you can either push or pull data from the official repository of the master data and insert any new rows/update any changed rows. You can even do this with a trigger in the master database (I don't recommend it, though).
If it's different instances or servers, you can use linked servers or SSIS.
You can put the common data into a "core" schema in each database. Then you can have tools to check that all your core tables in every subsidiary database are consistent. The worse that can happen is that an application is not seeing a new employee because the core isn't updated. And keeping your database separate gives you an ability to decouple and gives you maintenance windows. (You can even decouple and run "standalone" if your master is down for maintenance).
I expect you'll only be seeing a few dozen of these core entity tables in even a largish enterprise.
There are many other ways to solve the referential integrity (RI) issue. I am not as familiar with SQL Server as other DB's. In Informix you can use synonyms to point to objects in other DB's and use these for your RI. In Oracle you can make a DB links to one or more DB's to accomplish the same thing.
These approaches have the issue that if any of the DB's are down the RI will fail causing issues in the dependent DB's. selects would work, but inserts would fail.
Consolidation can be a good idea, depending upon the size of the schema's, and other issues with scalability. SQL Server has serious scalability issues. Other DB platforms allow horizontal scaling with either a share everything approach (Oracle's RAC, latest Informix release) or a partitioned share nothing approach (DB2's DPF, Informix XPS, Netezza, Teradata)
I am with some of the others here interested to hear the results of your meeting.

What are the advantages of using a single database for EACH client?

In a database-centric application that is designed for multiple clients, I've always thought it was "better" to use a single database for ALL clients - associating records with proper indexes and keys. In listening to the Stack Overflow podcast, I heard Joel mention that FogBugz uses one database per client (so if there were 1000 clients, there would be 1000 databases). What are the advantages of using this architecture?
I understand that for some projects, clients need direct access to all of their data - in such an application, it's obvious that each client needs their own database. However, for projects where a client does not need to access the database directly, are there any advantages to using one database per client? It seems that in terms of flexibility, it's much simpler to use a single database with a single copy of the tables. It's easier to add new features, it's easier to create reports, and it's just easier to manage.
I was pretty confident in the "one database for all clients" method until I heard Joel (an experienced developer) mention that his software uses a different approach -- and I'm a little confused with his decision...
I've heard people cite that databases slow down with a large number of records, but any relational database with some merit isn't going to have that problem - especially if proper indexes and keys are used.
Any input is greatly appreciated!
Assume there's no scaling penalty for storing all the clients in one database; for most people, and well configured databases/queries, this will be fairly true these days. If you're not one of these people, well, then the benefit of a single database is obvious.
In this situation, benefits come from the encapsulation of each client. From the code perspective, each client exists in isolation - there is no possible situation in which a database update might overwrite, corrupt, retrieve or alter data belonging to another client. This also simplifies the model, as you don't need to ever consider the fact that records might belong to another client.
You also get benefits of separability - it's trivial to pull out the data associated with a given client ,and move them to a different server. Or restore a backup of that client when the call up to say "We've deleted some key data!", using the builtin database mechanisms.
You get easy and free server mobility - if you outscale one database server, you can just host new clients on another server. If they were all in one database, you'd need to either get beefier hardware, or run the database over multiple machines.
You get easy versioning - if one client wants to stay on software version 1.0, and another wants 2.0, where 1.0 and 2.0 use different database schemas, there's no problem - you can migrate one without having to pull them out of one database.
I can think of a few dozen more, I guess. But all in all, the key concept is "simplicity". The product manages one client, and thus one database. There is never any complexity from the "But the database also contains other clients" issue. It fits the mental model of the user, where they exist alone. Advantages like being able to doing easy reporting on all clients at once, are minimal - how often do you want a report on the whole world, rather than just one client?
Here's one approach that I've seen before:
Each customer has a unique connection string stored in a master customer database.
The database is designed so that everything is segmented by CustomerID, even if there is a single customer on a database.
Scripts are created to migrate all customer data to a new database if needed, and then only that customer's connection string needs to be updated to point to the new location.
This allows for using a single database at first, and then easily segmenting later on once you've got a large number of clients, or more commonly when you have a couple of customers that overuse the system.
I've found that restoring specific customer data is really tough when all the data is in the same database, but managing upgrades is much simpler.
When using a single database per customer, you run into a huge problem of keeping all customers running at the same schema version, and that doesn't even consider backup jobs on a whole bunch of customer-specific databases. Naturally restoring data is easier, but if you make sure not to permanently delete records (just mark with a deleted flag or move to an archive table), then you have less need for database restore in the first place.
To keep it simple. You can be sure that your client is only seeing their data. The client with fewer records doesn't have to pay the penalty of having to compete with hundreds of thousands of records that may be in the database but not theirs. I don't care how well everything is indexed and optimized there will be queries that determine that they have to scan every record.
Well, what if one of your clients tells you to restore to an earlier version of their data due to some botched import job or similar? Imagine how your clients would feel if you told them "you can't do that, since your data is shared between all our clients" or "Sorry, but your changes were lost because client X demanded a restore of the database".
As for the pain of upgrading 1000 database servers at once, some fairly simple automation should take care of that. As long as each database maintains an identical schema, then it won't really be an issue. We also use the database per client approach, and it works well for us.
Here is an article on this exact topic (yes, it is MSDN, but it is a technology independent article): http://msdn.microsoft.com/en-us/library/aa479086.aspx.
Another discussion of multi-tenancy as it relates to your data model here: http://www.ayende.com/Blog/archive/2008/08/07/Multi-Tenancy--The-Physical-Data-Model.aspx
Scalability. Security. Our company uses 1 DB per customer approach as well. It also makes code a bit easier to maintain as well.
In regulated industries such as health care it may be a requirement of one database per customer, possibly even a separate database server.
The simple answer to updating multiple databases when you upgrade is to do the upgrade as a transaction, and take a snapshot before upgrading if necessary. If you are running your operations well then you should be able to apply the upgrade to any number of databases.
Clustering is not really a solution to the problem of indices and full table scans. If you move to a cluster, very little changes. If you have have many smaller databases to distribute over multiple machines you can do this more cheaply without a cluster. Reliability and availability are considerations but can be dealt with in other ways (some people will still need a cluster but majority probably don't).
I'd be interested in hearing a little more context from you on this because clustering is not a simple topic and is expensive to implement in the RDBMS world. There is a lot of talk/bravado about clustering in the non-relational world Google Bigtable etc. but they are solving a different set of problems, and lose some of the useful features from an RDBMS.
There are a couple of meanings of "database"
the hardware box
the running software (e.g. "the oracle")
the particular set of data files
the particular login or schema
It's likely Joel means one of the lower layers. In this case, it's just a matter of software configuration management... you don't have to patch 1000 software servers to fix a security bug, for example.
I think it's a good idea, so that a software bug doesn't leak information across clients. Imagine the case with an errant where clause that showed me your customer data as well as my own.

Resources