Note: I have no intention of implementing this, it's more of a thought experiment.
Suppose I had multiple services available through a web interface. At least two of which required user registration and some data in a database. A single registration would grant access to all services. Like Google (GMail, Google Docs, etc.).
Would all of these services, which are related to registered users, be located within a single database, perhaps with table-prefixes for what service they were for?
Or would each service have it's own database? The only plus I can see to doing this is that it would make table names cleaner. Any time any user interaction would be needed, interacting with at least two different databases would be needed, which would needlessly complicate sql queries.
Would this suggest that the 'big boys' use only a single database, and load it with tons of different (and perhaps completely unrelated) tables?
If you use the right DBMS, you can have the best of both strategies. In PostgreSQL, within a 'database' you can have separate schemas. The authentication service would access a single schema and provide the other services a key which is used as a reference for data in the other schemas. You can still deal with the entire database as a single entity i.e:
query across schemas without using dblink
store personally identifiable information separately (schemas can have separate per-user permissions to further protect data)
DBMS managed foreign key constrains (I believe)
consistent (re the data) backup and restore
You get these advantages at the cost of a more complex DAL (may not be supported by your favorite DAL framework) and less portability between DBMS's.
I do not think it is a good idea to make multiple services dependent on a single database. If you need to restore some service from a backup, you'll have to restore all.
You are overloading a database server probably too.
I would do that only if it is likely they will share much data at future point.
Also you might consider smaller database with only the shared user data.
I would consider having 1 user / role repository with a separate database for services.
I've never done this, but I think it would depend on performance. If there's almost no overhead to do separate databases, that might be the answer. Doing separate DBs may also make it easy to split DBs across machines.
Complexity is also an issue. Hopefully your schema would be defined in such a way that you wouldn't need to dip into several different databases for different queries.
There's always a problem with potentially overloading databases and access thereof; replication is one potential good solution.
There are several strategies.
When you move to multiple databases (or multiple servers), things get more complex. Your core user information could be in a single database. The individual services could be in other databases. The problem with that is that the database is the outer unit of referential integrity, so you cannot design in foreign keys across databases. One way around this is to distribute changes to the core master tables (additions and updates only, obviously, since deletions would be forbidden due to a foreign-key constraint) to separate databases on a regular basis, and then enforce RI against these copies of the core master database tables within the service databases. This also means that the service databases and their services can run while the other databases are down for maintenance. Obviously this is an increased architectural complexity for an improvement to your service windows and reduced coupling.
I would recommend starting with a single database. If your RDBMS supports it, I would organize components according to SCHEMAs which would allow you to at least maintain a logical separation by design. You can more easily refactor later.
Many databases have tables which can be considered unrelated. Sometimes in a system you have multiple entity networks that hardly connect (sometimes not at all). You can use SCHEMAs in these cases too.
Related
We've designed a multitenant system (lets say hundreds of tenants, not thousands). There is no shared data. The database is PostgreSQL. Is it better to create a separate database per tenant or schema?
What are the pros and cons? What is the impact on the filesystem, DB engine tables/views like locks, objects privileges, etc. - will they be much bigger in a multiple schema solution? Separate databases should be easier to backup/restore.
I know there are lots of similar questions, but most relate to cases with shared data, which is a major drawback to multiple databases, and we do not have such a requirement.
If you never need to use tables from multiple tenants in a single query, I'd go for separate databases.
The raw speed of queries is not really affected by this and neither is the impact on the filesystem or memory.
However the query optimizer tends to get slower with many schemas and many table (but we are talking hundreds of thousands here). Tab-completion is psql is also not as efficient in that case.
It's also a tad easier to use pg_dump/pg_restore with separated databases than with separated schemas.
But it's a blurry line and the actual answer can very well be based on personal opinion and preferences.
I have a large application which contains "modules" such as Finance, HR, Sales, Customer Service.
To make the application manageable and to distribute the load, I have decided to give each module its own database on a single server. There is also going to be a Master database for holding master information such as information about users, some global lookup tables, and some security stuff.
I am now trying to decide whether to place module-specific stored procedure in their corresponding database, or whether to keep them all in the Master database. For example there is a stored procedure named dbo.sales_customer_orders that selects data from only the Sales database tables. And of course this SP is going to be executed a lot of times by users. Therefore should it be in the Sales database or will it be okay to keep it in the Master database in terms of performance/scalability/reliability/security.
Does it matter that a stored procedure resides in a different database to the one its selecting from?
In my experience you would not experience an immediate performance penalty by sharding the data across multiple databases and this is actually a common practice in large n-tier applications. You would obviously experience some minor penalty upon moving the databases to different servers.
You could see this blog post as well as several others on the site which talk about the correct way to shard data as well as the importance of using multiple connection string's for reads and writes to facilitate scaling and possibly caching layer later on.
How do you actually plan to develop all these databases? If you want to use SSDT, you will be drowned in all those cross-database dependencies. Besides, your procedure in question being resided in the head database makes no sense if, for example, some particular customer decided not to buy the Sales module (and there is no Sales database anywhere around). In this case, calling it will lead to some very unpleasant and unexpected consequences, such as batch being aborted and (possibly) transaction left open.
Keep similar things together; otherwise, there will be no modularity in your approach.
Performance-wise, usually there is no difference for cross-database calls within the same SQL Server instance. If your shards are located on different instances, however, the result might be anywhere between "slightly noticeable" to "detrimental" - it depends on many factors, and not all of them can be mitigated by a DBA.
I am administering a rather large database that has grown in complexity and design from a single application database. Now there is a plan to add a fifth application that carries with it its own schema and specific data. I have been researching SSO solutions but that is not really what I am after. My goal is to have one point of customer registration, logins and authorization.
Ideally, each application would request authentication and be given authorization to multiple applications, where the applications would then connect to the appropriate database for operations. I do not have first hand experience dealing with this degree of separation as the one database has been churning flawlessly for years. Any best practice papers would be appreciated :)
I would envision a core database that maintained shared data - Customer/Company/Products
Core tables and primary Keys –To maintain referential integrity should I have a smaller replicated table in each “application” database. What are some ways to share keys among various databases and ensure referential integrity?
Replication – Two subscribers currently pull data from the production database where data is later batched into a DW solution for reporting. Am I going down a road that can lead to frustration?
Data integrity – How can I ensure for example that:
DATABASE_X.PREFERENCES.USER_ID =always references a= CORE_DATABASE.USERS.USER_ID
Reporting – What type of hurdles would I cross to replicate/transform data from multiple databases into one reporting database?
White Papers - Can anyone find good refernces to this strategy in practice?
Thanks
A few urls for you. Scale out implementations can vary wildly to suit requirements but hopefully these can help you.
http://blogs.msdn.com/b/sqlcat/archive/2008/06/12/sql-server-scale-out.aspx
this one is 2005 centric but is VERY good
http://msdn.microsoft.com/en-us/library/aa479364.aspx#scaloutsql_topic4
this one a good solution for reporting...
http://msdn.microsoft.com/en-us/library/ms345584.aspx
given you an analysis services one too :)
http://sqlcat.com/whitepapers/archive/2010/06/08/scale-out-querying-for-analysis-services-with-read-only-databases.aspx
I created something like this a few years ago using views and stored procedures to bring in the data from the Master database into the subordinate databases. This would allow you to fairly easily join those master tables into the other subordinate tables.
Have you looked into using RAC? You can have multiple physical databases but only one logical database. This would solve all of your integrity issues. And you can set aside nodes just for reporting.
Don't throw out the idea of having separate applications and linking the logon/off functions via webservice (esque) requests. I have seen billing/user registration systems separated in this way. Though at extremely large scales, this might not be a good idea.
I was looking at godaddy.com which says they offer up to 10 MySQL DBs, but I don't know why you would need more than 1 ever since a DB can have mutliple tables. Can't multiple DBs be integrated into a single DB? Is there an example case where its better or unfeasible to not have multiple ones? And how do you differentiate between them when you want to call them, from their directory or from a name?
Best,
I guess separation of concerns would be the most obvious answer. In the same way you can have all of your functionality in one humongous class in object oriented programming, it's a good idea to keep non-related information separate. It's easier to wrap your head around smaller chunks of data, and future developers mights start to think tables are related, and aggregate data in a way they were never meant to.
Imagine that you're doing two different projects with two different teams. Maybe you won't one team to access the other team tables.
There can also be a space limit in each database, and It each one can be configured with specific params to optimize the performance.
In other hand, two final users can be assigned to make the backups of each entire database, and you wan`t one user to make the backup of the other DB because he could be able to restore the database in other place and access the first database data.
I'm sure there are some pretty good DBAs on the forum who can answer this in detail.
Storing tables in different databases makes because you are able to backup them up individually. Furthermore, you will be able to control access to each database under different NT groups (e.g. Admin vs. users). Although this can be done at the indvidual table level, sometimes it makes sense to grant or deny access to an entire database to a particular group.
When you need to call them in SQL Server you need to append the database name to the query like this SELECT * FROM [MyDatabase].[dbo].[MyTable].
One other reason to use separate databases relates to whether you need full transactional recovery or not. For instance, if I havea bunch of tables that are populated on a schedule through import processes and never by the users, putting them in a separate database allows me to set the recovery mode to simple which reduces the logging (a good thing when you are loading millions of records at once). I can also not do transactional log backup every fifteen minutes like I do for the data in the database with the user inserted data. It could also make recovery a faster process when needed as the databases would be smaller and thus individally take less time to recover. Won't help much when the whole server crashes but it could help a lot if onely one datbase gets corrupted for some reason. If the data relates to different applications, it simplifies the security as well to have the data in separte databases. And of course sometimes we have commercial databases and we can;t add tables to those and so may need a separate database to handles some things we want to add to that data (we do this for instance with our Project Management software, we have a spearate database where we extract and summarize data from the PM system for reporting and then write all our custome reports off that.)
I am working with a half dozen DBs. The DBs all have the same schemas, the same SPs, etc. Speaking to the person who originally designed the DBs, a big part of the motivation for using many DBs was efficiency; the alternative would be to add a column to pretty much every table and sp in the database indicating which set of data was being worked in, resulting in one giant (and thus slower) DB instead of several samll DBs. In place of having a column to indicate which set of data is being queried, the connection string is used to select which database is being hit.
The only reason I really dislike this organization is that it involves a lot of code duplication and thus hurts maintenance. For example, every time I wish to change a stored procedure, I need to run the alter statement on every database.
One solution I have considered is to combine all of the data into one big database, adding an extra column all over the place to indicate which database the data would be in if I had not combined it. Then, I could partition all of the tables by this column's value. In theory, the result of all of this is that the underlying representation of all of the data itself will be morally the same as it is now, but without the redundancies in the indexes, schemas, SPs, etc.
My questions are this:
Is this a good idea? Is there a better way to accomplish this?
Are there any gotchas in doing this?
Will this have any impact on performance?
Everyone will deal with this at some point. My own personal opinion is that multiple databases are a pain in the backside and are not faster. They are a pain because of the maintenance headaches. Adding an extra column in each table as necessary will not slow your process done that much, if indexing is set properly. And your maintenance will be much easier. Plus, doing transactions across multiple DB's can be a hassle and involve MTC.
BTW, using a single database is often called a multi-tenant database. You might want to research this a bit. But I would avoid multiple DB's like this if possible.
I'm of a different mind than Randy.
The multi-tenant model has its advantages.
For one, maintenance is not really much different whether you have 5 databases or 500. At some point you stop looking at maintenance of individual databases and look at the set. Yes you must serialize backups and you can't be performing index reorg/rebuild across all databases at once.
But for code changes across multiple more-or-less identical databases, there are easy ways to script a lot of things to be done to multiple databases without really lifting an extra finger. I use a tool called SQLFarms Combine (now sold by JNetDirect), but there are other offerings such as RedGate MultiScript that I haven't played with.
What I like most about the multi-tenant model is that when you grow and scale and suddenly need a new database server, it is very easy to move one of the tenants (say, the busiest or fastest growing) to the new server. If everybody is jammed into the same database, this extraction of only their data becomes quite difficult, especially if there is to be minimized downtime. In the multi-tenant model, you can set up mirroring for just their database, and then switch the primary when you're ready.
I'd be in favor of combining these databases. There are other facilities built into SQL Server to account for the potential performance downfalls of a very large database, like additional indexing on a second physical disk, partitioning, clustering, etc. The headache and overhead involved in deploying schema updates to that many different databases can be time consuming when it's easily handled in a single database. I think SQL Server scales really well in cases like this - let the database server do what it's designed to do and provide responsive access to your data. You can focus on application design and leave the storage model to SQL Server.
Also, though this isn't mentioned above, I'd suspect that there's some level of dynamic SQL involved in the applications that use this "many database" model because you've got to switch between databases based on something you know, so it can't be hard coded into the application or in a configuration file, meaning that either connection strings or actual SQL statements have to be generated on the fly, and that can be a really big security risk (read about "SQL Injection" if you're unfamiliar with the potential risks of dynamic SQL).