Understanding "multiple, logically interrelated databases" in the context of distributed databases? - database

From the following definition:
A distributed database is a collection of
multiple, logically interrelated
databases
distributed over a computer network.
Sometimes "distributed database system" is used to refer jointly to the distributed database and the distributed DBMS.
I do not understand the phrase "multiple logically interrelated databases". I have heard of tables being related logically "relational".
Please can any one give a simple yet clear example of "multiple logically interrelated databases"?

The databases would be logically related, but not actually related in the way you think of tables being related (foreign keys).
One way of doing this is to put some tables from your schema into one database and other tables into another database. For instance, you might put your read-heavy data into one database optimized for reading, and your write-heavy data in another database optimized for writing. Those tables might still be logically related, but you wouldn't be able to use foreign keys since they are in different databases
Another way of doing this would be to have a single table split across multiple databases. For instance if you have a large site with an international presence and several data centers around the world, you might have a users table that is partitioned across those databases with users from a given country residing in the users table on the database closest to them geographically.

Related

postgresql multitenant schemas vs databases

We've designed a multitenant system (lets say hundreds of tenants, not thousands). There is no shared data. The database is PostgreSQL. Is it better to create a separate database per tenant or schema?
What are the pros and cons? What is the impact on the filesystem, DB engine tables/views like locks, objects privileges, etc. - will they be much bigger in a multiple schema solution? Separate databases should be easier to backup/restore.
I know there are lots of similar questions, but most relate to cases with shared data, which is a major drawback to multiple databases, and we do not have such a requirement.
If you never need to use tables from multiple tenants in a single query, I'd go for separate databases.
The raw speed of queries is not really affected by this and neither is the impact on the filesystem or memory.
However the query optimizer tends to get slower with many schemas and many table (but we are talking hundreds of thousands here). Tab-completion is psql is also not as efficient in that case.
It's also a tad easier to use pg_dump/pg_restore with separated databases than with separated schemas.
But it's a blurry line and the actual answer can very well be based on personal opinion and preferences.

Creating postgres schemas in specific tablespaces

According to Simple Configuration Recommendations for PostgreSQL the recommended best practice for setting up the most flexible and manageable environment is to create a application specific tablespace that has its own mountpoint at /pgdata-app_tblspc and "For every schema there should be a minimum of two tablespaces. One for tables and one for indexes"
I am able to create these mount points and tablespaces, but wondering how to assign schemas to specific tablespaces. As far as I can tell, tablespaces are pegged to databases through the CREATE DATABASE ... TABLESPACE ... command, but there is no TABLESPACE directive in the CREATE SCHEMA command.
Following the logic of the Simple Configuration Recommendation document, it appears the implicit recommendation is to create one database per application, with each database mapped to two tablespaces: one for data and the other for indexes.
However, the same document goes on to say that application specific databases is not the preferred way of maintaining data separation between applications. Having one database with multiple schemas is the way to go.
What am I missing here? Appreciate any pointers.
Why does CREATE SCHEMA not have a tablespace clause?
Schemas provide a logical separation of data, while tablespaces provide a physical separation. Only objects that hold data, like tables and indexes, have a tablespace clause in their CREATE statement. A schema does not have an associated data file.
If you want the tables that live in different schemas to reside in different tablespaces, you'll have to add a tablespace clause to each CREATE TABLE and CREATE INDEX statement.
Should you use two tablespaces per application, one for tables and one for indexes?
I would say that this depends on your performance requirements and the amount of data.
If you are dealing with a multi-terabyte data warehouse and you want to optimize performance by distributing your data over different storage systems, using tablespaces will be an interesting option.
For a smallish database I'd say that this is not worth the trouble, and you'd be better of if you buy enough RAM to fit the database in memory.
Are different databases or different schemas the better way of separating data for different applications?
If the applications need to access each other's data, put them in different schemas in one database. Otherwise use two databases to ensure that they cannot mess with each other's data.
Overall, tablespaces are good if you want to limit the growth of a table or the tablespaces are on different storage systems for load distribution.

What are the problems with a join between two tables in two different databases?

I am interested in your thoughts about the the pitfalls of joining two or more tables from different databases. I'll try to give an example.
Suppose table Table1 is located in DatabaseA database and Table2 is located in DatabaseB .
Let's say i have a view, in DatabaseA that pulls out some data from Table1, and some other tables in DatabaseA'.
This view is used to push data to another database, let's call this one, unimaginatevely, DatabaseC.
If i need some data from Table2, my instinct is to join directly Table2 in this view, sort of like this table1 inner join DatabaseB..table2 on [some columns]
Doing this is pretty simple and quick, but i have a nagging voice in my head that keeps telling me not to do this. My worries are about not being able to track down all the objects depending on Table2, so if I change something there, I have to be very carefull and remember everywhere i use this table. So, sort of like breaking SRP for this view (and two databases), because this view can change from two different actions (performed on two different databases: Changing Table1 or changing Table2)
I am interested in your opinions. Is this a good or bad idea? What would be the problems with this approach (performance wise, maintainence wise and so on) and if you have a real world experience where this approach either was a big mistake or was a life saver for you.
P.S: I've searched this topic on google and SO, but could not find anything related to this. I will gladly take the minus votes, duplicate questions and other 'reprimands' from SO users just to have a different view on this problem.
P.P.S: I am using SQL Server 2005.
Thank you and hope i made myself clear:)
If they are on the same server, there is no real problem pulling from separate database. In fact, you may want to separate them for good reasons. For instance if you have a combination of transactional tables and lookup tables that are imported from files. The transactional data needs full recovery and frequent transactional log backups to be able to properly restore, the lookup data does not and can benefit from being in a database in simple recovery mode.
We have many different databases our applications use and we cross databases in queries all the time. As long as the indexing is done properly, there has been no noticable performance difference. The biggest potential issue is for data integrity as you can't set up foreign keys across databases. This can be handled in triggers if need be though.
Now when the databases are on different servers, there can be a performance problem and getting the data is more complicated.
Like everything else in SQL, it depends.
At my job, we do this a LOT. We have very large data sets, and separate DBs for header and detail level records, then additional DBs for reports or tables that we build off of other data, etc etc.
There's not really a performance issue from joining across DBs, and in some cases depending on your hardware setup it may be FASTER. If DatabaseA and DatabaseB are on separate physical drives with different controllers, it will likely be faster to run a query joining those than if they were in the same DB on the same volume.
Maintenance can be an issue but no more than for any other database/tables. It's not like you have different versions of the same tables, you just have those tables in different DBs.
The only major drawback is SQL Server does a poor job of showing intra-database dependencies, so you will need to keep track of these yourself. There are some scripts for this and also third party utilities, and I have heard that SQL Server Denali will add additional support for this but I'm not sure if that's accurate.
Your nagging voice is probably right.
Not least of the problems will be how to enforce declarative referential integrity since you cannot create foreign keys between databases, therefore sooner or later you will have to cope with inconsistent or mismatched or incomplete data.
But if you don't care about that, I don't see a problem :-)
Some general themes re cross-database joins:
Foreign keys
As others have pointed out, in the absence of foreign keys, you'll need to roll your own referential integrity. Not a problem in itself, but issues can surface when you're not in control of the data in one or more of the databases.
A related issue is the use of CASE tools. When reverse-engineering a schema, they will overlook links between tables where a FK->PK relationship doesn't exist.
Performance
If the database are on different servers then you're exposed to the vagaries of whatever else is running on those servers as well as the cost of running the join operation itself. Again, if the servers are all within your control, this is something you can monitor but this may may not be the case.
Coupling
If your solution relies on other databases you have multiple points of failure. If a database goes down, this could cascade to one or more systems.
Data modification
Your solution may be coupled to what you believe to be static data in tables on another database. However, what if this were accidentally (or purposefully) amended, duplicated or deleted. Again, if the databases in question are out of your remit, other teams/departments may not be aware of how your system operates.
All this being, true, there are many cases where cross-database joins are the norm. A few examples I've seen:
Mart-Repository
Performant operations take place on the mart whilst the master data stash is kept on the repository. CRUD operations take place between the two on a frequent or infrequent basis (nightly update, real-time etc).
Legacy DB
You might expose a legacy database for data migration and or reporting/auditing purposes.
Lookup
One or more of your databases may contain static lookup information which can be re-used.
So to answer your question - it depends on what exactly you're doing and whether the risk is acceptable. Other solutions exist such as replication but again, how feasible this is will depend on the structure of your department/company.
The answer to your questions is...it depends.
I have noticed that there is no serious degradation in performance when you keep the queries nice and simple (fewer join etc).
The more complex the queries, the more chance that the optimizer will produce a suboptimal execution plan.
The optimizer ultimately gets to decide how to execute the query. The more complex the query, the more opportunity for the optimizer to get the order of operations "wrong".
I recently experimented with this problem...
I ran a query with roughly 8 joins on a single database. I then put up a copy of that database on the same server with a different name, and then I modified the query so that it would join to a couple tables in the second copy of the database.
As a single database query, it ran in under 3 seconds; expected given the volume of data.
The cross database joined query run in just under 3 minutes.
enter code here

One User database servicing multiple application databases

I am administering a rather large database that has grown in complexity and design from a single application database. Now there is a plan to add a fifth application that carries with it its own schema and specific data. I have been researching SSO solutions but that is not really what I am after. My goal is to have one point of customer registration, logins and authorization.
Ideally, each application would request authentication and be given authorization to multiple applications, where the applications would then connect to the appropriate database for operations. I do not have first hand experience dealing with this degree of separation as the one database has been churning flawlessly for years. Any best practice papers would be appreciated :)
I would envision a core database that maintained shared data - Customer/Company/Products
Core tables and primary Keys –To maintain referential integrity should I have a smaller replicated table in each “application” database. What are some ways to share keys among various databases and ensure referential integrity?
Replication – Two subscribers currently pull data from the production database where data is later batched into a DW solution for reporting. Am I going down a road that can lead to frustration?
Data integrity – How can I ensure for example that:
DATABASE_X.PREFERENCES.USER_ID =always references a= CORE_DATABASE.USERS.USER_ID
Reporting – What type of hurdles would I cross to replicate/transform data from multiple databases into one reporting database?
White Papers - Can anyone find good refernces to this strategy in practice?
Thanks
A few urls for you. Scale out implementations can vary wildly to suit requirements but hopefully these can help you.
http://blogs.msdn.com/b/sqlcat/archive/2008/06/12/sql-server-scale-out.aspx
this one is 2005 centric but is VERY good
http://msdn.microsoft.com/en-us/library/aa479364.aspx#scaloutsql_topic4
this one a good solution for reporting...
http://msdn.microsoft.com/en-us/library/ms345584.aspx
given you an analysis services one too :)
http://sqlcat.com/whitepapers/archive/2010/06/08/scale-out-querying-for-analysis-services-with-read-only-databases.aspx
I created something like this a few years ago using views and stored procedures to bring in the data from the Master database into the subordinate databases. This would allow you to fairly easily join those master tables into the other subordinate tables.
Have you looked into using RAC? You can have multiple physical databases but only one logical database. This would solve all of your integrity issues. And you can set aside nodes just for reporting.
Don't throw out the idea of having separate applications and linking the logon/off functions via webservice (esque) requests. I have seen billing/user registration systems separated in this way. Though at extremely large scales, this might not be a good idea.

User Table in Separate DB

Note: I have no intention of implementing this, it's more of a thought experiment.
Suppose I had multiple services available through a web interface. At least two of which required user registration and some data in a database. A single registration would grant access to all services. Like Google (GMail, Google Docs, etc.).
Would all of these services, which are related to registered users, be located within a single database, perhaps with table-prefixes for what service they were for?
Or would each service have it's own database? The only plus I can see to doing this is that it would make table names cleaner. Any time any user interaction would be needed, interacting with at least two different databases would be needed, which would needlessly complicate sql queries.
Would this suggest that the 'big boys' use only a single database, and load it with tons of different (and perhaps completely unrelated) tables?
If you use the right DBMS, you can have the best of both strategies. In PostgreSQL, within a 'database' you can have separate schemas. The authentication service would access a single schema and provide the other services a key which is used as a reference for data in the other schemas. You can still deal with the entire database as a single entity i.e:
query across schemas without using dblink
store personally identifiable information separately (schemas can have separate per-user permissions to further protect data)
DBMS managed foreign key constrains (I believe)
consistent (re the data) backup and restore
You get these advantages at the cost of a more complex DAL (may not be supported by your favorite DAL framework) and less portability between DBMS's.
I do not think it is a good idea to make multiple services dependent on a single database. If you need to restore some service from a backup, you'll have to restore all.
You are overloading a database server probably too.
I would do that only if it is likely they will share much data at future point.
Also you might consider smaller database with only the shared user data.
I would consider having 1 user / role repository with a separate database for services.
I've never done this, but I think it would depend on performance. If there's almost no overhead to do separate databases, that might be the answer. Doing separate DBs may also make it easy to split DBs across machines.
Complexity is also an issue. Hopefully your schema would be defined in such a way that you wouldn't need to dip into several different databases for different queries.
There's always a problem with potentially overloading databases and access thereof; replication is one potential good solution.
There are several strategies.
When you move to multiple databases (or multiple servers), things get more complex. Your core user information could be in a single database. The individual services could be in other databases. The problem with that is that the database is the outer unit of referential integrity, so you cannot design in foreign keys across databases. One way around this is to distribute changes to the core master tables (additions and updates only, obviously, since deletions would be forbidden due to a foreign-key constraint) to separate databases on a regular basis, and then enforce RI against these copies of the core master database tables within the service databases. This also means that the service databases and their services can run while the other databases are down for maintenance. Obviously this is an increased architectural complexity for an improvement to your service windows and reduced coupling.
I would recommend starting with a single database. If your RDBMS supports it, I would organize components according to SCHEMAs which would allow you to at least maintain a logical separation by design. You can more easily refactor later.
Many databases have tables which can be considered unrelated. Sometimes in a system you have multiple entity networks that hardly connect (sometimes not at all). You can use SCHEMAs in these cases too.

Resources