Creating postgres schemas in specific tablespaces - database

According to Simple Configuration Recommendations for PostgreSQL the recommended best practice for setting up the most flexible and manageable environment is to create a application specific tablespace that has its own mountpoint at /pgdata-app_tblspc and "For every schema there should be a minimum of two tablespaces. One for tables and one for indexes"
I am able to create these mount points and tablespaces, but wondering how to assign schemas to specific tablespaces. As far as I can tell, tablespaces are pegged to databases through the CREATE DATABASE ... TABLESPACE ... command, but there is no TABLESPACE directive in the CREATE SCHEMA command.
Following the logic of the Simple Configuration Recommendation document, it appears the implicit recommendation is to create one database per application, with each database mapped to two tablespaces: one for data and the other for indexes.
However, the same document goes on to say that application specific databases is not the preferred way of maintaining data separation between applications. Having one database with multiple schemas is the way to go.
What am I missing here? Appreciate any pointers.

Why does CREATE SCHEMA not have a tablespace clause?
Schemas provide a logical separation of data, while tablespaces provide a physical separation. Only objects that hold data, like tables and indexes, have a tablespace clause in their CREATE statement. A schema does not have an associated data file.
If you want the tables that live in different schemas to reside in different tablespaces, you'll have to add a tablespace clause to each CREATE TABLE and CREATE INDEX statement.
Should you use two tablespaces per application, one for tables and one for indexes?
I would say that this depends on your performance requirements and the amount of data.
If you are dealing with a multi-terabyte data warehouse and you want to optimize performance by distributing your data over different storage systems, using tablespaces will be an interesting option.
For a smallish database I'd say that this is not worth the trouble, and you'd be better of if you buy enough RAM to fit the database in memory.
Are different databases or different schemas the better way of separating data for different applications?
If the applications need to access each other's data, put them in different schemas in one database. Otherwise use two databases to ensure that they cannot mess with each other's data.
Overall, tablespaces are good if you want to limit the growth of a table or the tablespaces are on different storage systems for load distribution.

Related

postgresql multitenant schemas vs databases

We've designed a multitenant system (lets say hundreds of tenants, not thousands). There is no shared data. The database is PostgreSQL. Is it better to create a separate database per tenant or schema?
What are the pros and cons? What is the impact on the filesystem, DB engine tables/views like locks, objects privileges, etc. - will they be much bigger in a multiple schema solution? Separate databases should be easier to backup/restore.
I know there are lots of similar questions, but most relate to cases with shared data, which is a major drawback to multiple databases, and we do not have such a requirement.
If you never need to use tables from multiple tenants in a single query, I'd go for separate databases.
The raw speed of queries is not really affected by this and neither is the impact on the filesystem or memory.
However the query optimizer tends to get slower with many schemas and many table (but we are talking hundreds of thousands here). Tab-completion is psql is also not as efficient in that case.
It's also a tad easier to use pg_dump/pg_restore with separated databases than with separated schemas.
But it's a blurry line and the actual answer can very well be based on personal opinion and preferences.

Performance issue while separation small amount of tables to different schemas

Assume that we have a database on MS SQL Server 2008 with 20-30 tables at the core of our distributed system. Permissions to read and write these tables can vary for each layer of our system.
For example, we have three types of clients, that could connect to our database directly or via some intermediate layer. To eliminate the possibility of incorrect operations we have to correctly set the permissions for each type of client.
The obvious solution is to separate our tables to different SQL Server schemas and set permissions to access objects in schema as a whole. And now we must decide how justified this solution on relatively small amount of tables and how it will impact on performance (it seeems that very often we must to join tables from different schemas).
Joining tables from different schemas will not affect performance.
But, actually, it is better to grant permissions on procedures not on tables.

Understanding "multiple, logically interrelated databases" in the context of distributed databases?

From the following definition:
A distributed database is a collection of
multiple, logically interrelated
databases
distributed over a computer network.
Sometimes "distributed database system" is used to refer jointly to the distributed database and the distributed DBMS.
I do not understand the phrase "multiple logically interrelated databases". I have heard of tables being related logically "relational".
Please can any one give a simple yet clear example of "multiple logically interrelated databases"?
The databases would be logically related, but not actually related in the way you think of tables being related (foreign keys).
One way of doing this is to put some tables from your schema into one database and other tables into another database. For instance, you might put your read-heavy data into one database optimized for reading, and your write-heavy data in another database optimized for writing. Those tables might still be logically related, but you wouldn't be able to use foreign keys since they are in different databases
Another way of doing this would be to have a single table split across multiple databases. For instance if you have a large site with an international presence and several data centers around the world, you might have a users table that is partitioned across those databases with users from a given country residing in the users table on the database closest to them geographically.

Syncrhonizing 2 database with different schemas

We have a normalized SQL Server 2008 database designed using generic tables. So, instead of having a separate table for each entity (e.g. Products, Orders, OrderItems, etc), we have generic tables (Entities, Instances, Relationships, Attributes, etc).
We have decided to have a separate denormalized database for quick retrieval of data. Could you please advise me of various technologies out there to synchronize these 2 databases, assuming they have different schemas?
Cheers,
Mosh
When two databases have so radically different schemas you should be looking at techniques for data migration or replication, not synchronization. SQL Server provides two technologies for this, SSIS and Replication, or you can write your own script to do this.
Replication will take new or modified data from a source database and copy it to a target database. It provides mechanisms for scheduling, packaging and distributing changes and can handle both real-time as well as batch updates. To work it needs to add enough info in both databases to track modifications and matching rows. In your case it would be hard to identify which "Products" have changed as you would have to identify all relevant modified rows in 4 or more different tables. It can be done but it will require some effort. In any case, you would have to create views that match the target schema, as replication doesn't allow any transformation of the source data.
SSIS will pull data from one source, transform it and push it to a target. It has no built-in mechanisms for tracking changes so you will have to add fields to your tables to track changes. It is strictly a batch process that can run according to a schedule. The main benefit is that you can perform a wide variety of transformations while replication allows almost none (apart from drawing the data from a view). You could create dataflows that modify only the relevant Product field when a Product related Attribute record changes, or simply reconstitute an entire Product record and overwrite the target record.
Finally, you can create your own triggers or stored procedures that will run when the data changes and copy it from one database to the other.
I should also point out that you have probably over-normalized your database. In all three cases you will have some performance penalty when you join all tables to reconstitute an entity, resulting in a larger amount of locking that is necessary and inefficient use of indexes. You are sacrificing performance and scalability for the sake of ease of change.
Perhaps you should take a look at the Sparse Column feature of SQL Server 2008 for a way to support flexible schemas while maintaining performance and scalability.

User Table in Separate DB

Note: I have no intention of implementing this, it's more of a thought experiment.
Suppose I had multiple services available through a web interface. At least two of which required user registration and some data in a database. A single registration would grant access to all services. Like Google (GMail, Google Docs, etc.).
Would all of these services, which are related to registered users, be located within a single database, perhaps with table-prefixes for what service they were for?
Or would each service have it's own database? The only plus I can see to doing this is that it would make table names cleaner. Any time any user interaction would be needed, interacting with at least two different databases would be needed, which would needlessly complicate sql queries.
Would this suggest that the 'big boys' use only a single database, and load it with tons of different (and perhaps completely unrelated) tables?
If you use the right DBMS, you can have the best of both strategies. In PostgreSQL, within a 'database' you can have separate schemas. The authentication service would access a single schema and provide the other services a key which is used as a reference for data in the other schemas. You can still deal with the entire database as a single entity i.e:
query across schemas without using dblink
store personally identifiable information separately (schemas can have separate per-user permissions to further protect data)
DBMS managed foreign key constrains (I believe)
consistent (re the data) backup and restore
You get these advantages at the cost of a more complex DAL (may not be supported by your favorite DAL framework) and less portability between DBMS's.
I do not think it is a good idea to make multiple services dependent on a single database. If you need to restore some service from a backup, you'll have to restore all.
You are overloading a database server probably too.
I would do that only if it is likely they will share much data at future point.
Also you might consider smaller database with only the shared user data.
I would consider having 1 user / role repository with a separate database for services.
I've never done this, but I think it would depend on performance. If there's almost no overhead to do separate databases, that might be the answer. Doing separate DBs may also make it easy to split DBs across machines.
Complexity is also an issue. Hopefully your schema would be defined in such a way that you wouldn't need to dip into several different databases for different queries.
There's always a problem with potentially overloading databases and access thereof; replication is one potential good solution.
There are several strategies.
When you move to multiple databases (or multiple servers), things get more complex. Your core user information could be in a single database. The individual services could be in other databases. The problem with that is that the database is the outer unit of referential integrity, so you cannot design in foreign keys across databases. One way around this is to distribute changes to the core master tables (additions and updates only, obviously, since deletions would be forbidden due to a foreign-key constraint) to separate databases on a regular basis, and then enforce RI against these copies of the core master database tables within the service databases. This also means that the service databases and their services can run while the other databases are down for maintenance. Obviously this is an increased architectural complexity for an improvement to your service windows and reduced coupling.
I would recommend starting with a single database. If your RDBMS supports it, I would organize components according to SCHEMAs which would allow you to at least maintain a logical separation by design. You can more easily refactor later.
Many databases have tables which can be considered unrelated. Sometimes in a system you have multiple entity networks that hardly connect (sometimes not at all). You can use SCHEMAs in these cases too.

Resources