Why use database schemas? - sql-server

I'm working on a single database with multiple database schemas,
e.g
[Baz].[Table3],
[Foo].[Table1],
[Foo].[Table2]
I'm wondering why the tables are separated this way besides organisation and permissions.
How common is this, and are there any other benefits?

You have the main benefit in terms of logically groupings objects together and allowing permissions to be set at a schema level.
It does provide more complexity in programming, in that you must always know which schema you intend to get something from - or rely on the default schema of the user to be correct. Equally, you can then use this to allow the same object name in different schemas, so that the code only writes against one object, whilst the schema the user is defaulted to decides which one that is.
I wouldn't say it was that common, anecdotally most people still drop everything in the dbo schema.

I'm not aware of any other possible reasons besides organization and permissions. Are these not good enough? :)
For the record - I always use a single schema - but then I'm creating web applications and there is also just a single user.
Update, 10 years later!
There's one more reason, actually. You can have "copies" of your schema for different purposes. For example, imagine you are creating a blog platform. People can sign up and create their own blogs. Each blog needs a table for posts, tags, images, settings etc. One way to do this is to add a column
blog_id to each table and use that to differentiate between blogs. Or... you could create a new schema for each blog and fresh new tables for each of them. This has several benefits:
Programming is easier. You just select the approppriate schema at the beginning and then write all your queries without worrying about forgetting to add where blog_id=#currentBlog somewhere.
You avoid a whole class of potential bugs where a foreign key in one blog points to an object in another blog (accidental data disclosure!)
If you want to wipe a blog, you just drop the schema with all the tables in it. Much faster than seeking and deleting records from dozens of different tables (in the right order, none the less!)
Each blog's performance depends only (well, mostly anyway) on how much data there is in that blog.
Exporting data is easier - just dump all the objects in the schema.
There are also drawbacks, of course.
When you update your platform and need to perform schema changes, you need to update each blog separately. (Added yet later: This could actually be a feature! You can do "rolling udpates" where instead of updating ALL the blogs at the same time, you update them in batches, seeing if there are any bugs or complaints before updating the next batch)
Same about fixing corrupted data if that happens for whatever reason.
Statistics for all the platform together are harder to calculate
All in all, this is a pretty niche use case, but it can be handy!

To me, they can cause more problems because they break ownership chaining.
Example:
Stored procedure tom.uspFoo uses table tom.bar easily but extra rights would be needed on dick.AnotherTable. This means I have to grant select rights on dick.AnotherTable to the callers of tom.uspFoo... which exposes direct table access.
Unless I'm completely missing something...
Edit, Feb 2012
I asked a question about this: SQL Server: How to permission schemas?
The key is "same owner": so if dbo owns both dick and tom schema, then ownership chaining does apply. My previous answer was wrong.

There can be several reasons why this is beneficial:
share data between several (instances
of) an application. This could be the
case if you have group of reference
data that is shared between
applications, and a group of data
that is specific for the instance. Be careful not to have circular references between entities in in different schema's. Meaning don't have a foreign key from an entity in schema 1 to another entity in schema 2 AND have another foreign key from schema 2 to schema 1 in other entities.
data partitioning: allows for data to be stored on different servers
more easily.
as you mentioned, access control on DB level

Related

How to handle different organization with single DB?

Background
building an online information system which user can access through any computer. I don't want to replicate DB and code for every university or organization.
I just want user to hit a domain like www.example.com sign in and use it.
For second user it will also hit the same domain www.example.com sign in and use it. but the data for them are different.
Scenario
suppose a university has 200 employees, 2nd university has 150 and so on.
Qusetion
Do i need to have separate employee table for each university or is it OK to have a single table with a column that has University ID?
I assume 2nd is best but Suppose i have 20 universities or organizations and a total of thousands of employees.
What is the best approach?
This same thing is for all table? This is just to give you an example.
Thanks
The approach will depend upon the data, usage, and client requirements/restrictions.
Use an integrated model, as suggested by duffymo. This may be appropriate if each organization is part of a larger whole (i.e. all colleges are part of a state college board) and security concerns about cross-query access are minimal2. This approach has a minimal amount of separation between each organization as the same schema1 and relations are "openly" shared. It leads to a very simple model initially, but it can become very complicated (with compound FKs and correct usage of such) if needing relations for organization-specific values because it adds another dimension of data.
Implement multi-tenancy. This can be achieved with implicit filters on the relations (perhaps hidden behinds views and store procedures), different schemas, or other database-specific support. Depending upon implementation this may or may not share schema or relations even though all data may reside in the same database. With implicit isolation, some complicated keys or relationships can be hidden/eliminated. Multi-tenancy isolation also generally makes it harder/impossible to cross-query.
Silo the databases entirely. Each customer or "organization" has a separate database. This implies separate relations and schema groups. I have found this approach to to be relatively simple with automated tooling, but it does require managing multiple database. Direct cross-querying is impossible, although "linked databases" can be used if there is a need.
Even though it's not "a single DB", in our case, we had the following restrictions 1) not allowed to ever share/expose data between organizations, and 2) each organization wanted their own local database. Thus, our product ended up using a silo approach. Make sure that the approach chosen meets customer requirements.
None of these approaches will have any issue with "thousands", "hundreds of thousands", or even "millions" of records as long as the indices and queries are correctly planned. However, switching from one to another can violate many assumed constraints and so the decision should be made earlier on.
1 In this response I am using "schema" to refer to the security grouping of database objects (e.g. tables, views) and not the database model itself. The actual database model used can be common/shared, as we do even when using separate databases.
2 An integrated approach is not necessarily insecure - but it doesn't inherently have some of the built-in isolation of other designs.
I would normalize it to have UNIVERSITY and EMPLOYEE tables, with a one-to-many relationship between them.
You'll have to take care to make sure that only people associated with a given university can see their data. Role based access will be important.
This is called a multi-tenant architecture. you should read this:
http://msdn.microsoft.com/en-us/library/aa479086.aspx
I would go with Tenant Per Schema, which means copying the structure across different schemas, however, as you should keep all your SQL DDL in source control, this is very easy to script.
It's easy to screw up and "leak" information between tenants if doing it all in the same table.

Automatic database schema generation system?

I'm working with a client who has a piece of custom website software that has something I haven't seen before. It has a MySQL Database backend, but most of the tables are auto-generated by the php code. This allows end-users to create tables and fields as they see fit. So it's a database within a database, but obviously without all the features available in the 'outermost' database. There are a couple tables that are basically mappings of auto-generated table names and fields to user-friendly table names and fields.* This makes queries feel very unintuitive :P
They are looking for some additional features, ones that are immediately available when you use the database directly, such as data type enforcement, foreign keys, unique indexes, etc. But since this a database within a database, all those features have to be added into the php code that runs the database. The first thing that came to my mind is Inner Platform Effect* -- but I don't see a way to get out of database emulation and still provide them with the features they need!
I'm wondering, could I create a system that gives users nerfed ability to create 'real' tables, thus gaining all the relational features for free? In the past, it's always been the developer/admin who made the tables, and then the users did CRUD operations through the application. I just have an uncomfortable feeling about giving users access to schema operations, even when it is through the application. I'm in uncharted territory.
Is there a name for this kind of system? Internally, in the code, this is called a 'collection' system. The name of 'virtual' tables and fields within the database is called a 'taxonomy'. Is this similiar to CCK or the taxonomy modules in Drupal? I'm looking for models of software that do this kind of this, so I can see what the pitfalls and benefits are. Basically I'm looking for more outside information about this kind of system.
Note this is not a simple key-value mapping, as the wikipedia article on inner-platform effect references. These work like actual tuples of multiple cells -- like simple database tables.
I've done this, you can make it pretty simple or go completely nuts with it. You do run into problems though when you put it into customers' hands, are we going to ask them to figure out primary keys, unique constraints and foreign keys?
So assuming you want to go ahead with that in mind, you need some type of data dictionary, aka meta-data repository. You have a start, but you need to add the ideas that columns are collected into tables, then specify primary and foreign keys.
After that, generating DDL is fairly trivial. Loop through tables, loop through columns, build a CREATE TABLE command. The only hitch is you need to sequence the tables so that parents are created before children. That is not hard, implement a http://en.wikipedia.org/wiki/Topological_ordering
At the second level, you first have to examine the existing database and then sometimes only issue ALTER TABLE ADD COLUMN... commands. So it starts to get complicated.
Then things continue to get more complicated as you consider allowing DEFAULTS, specifying indexes, and so on. The task is finite, but can be much larger than it seems.
You may wish to consider how much of this you really want to support, and then make a value judgment about coding it up.
My triangulum project does this: http://code.google.com/p/triangulum-db/ but it is only at Alpha 2 and I would not recommend using it in a Production situation just yet.
You may also look at Doctrine, http://www.doctrine-project.org/, they have some sort of text-based dictionary to build databases out of, but I'm not sure how far they've gone with it.

Few database design questions relating to user content site

Designing a user content website (kind of similar to yelp but for a different market and with photo sharing) and had few databse questions:
Does each user get their own set of
tables or are we storing multiple
user data into common tables? Since
this even a social network, when
user sizes grows for scalability
databases are usually partitioned
off. Different sets of users are
sent separately, so what is the best
approach? I guess some data like
user accounts can be in common
tables but wall posts, photos etc
each user will get their own table?
If so, then if we have 10 million
users then that means 10 million x
what ever number of tables per user?
This is currently being designed in
MySQL
How does the user tables know what
to create each time a user joins the
site? I am assuming there may be a
system table template from which it
is pulling in the fields?
In addition to the above question,
if tomorrow we modify tables,
add/remove features, to roll the
changes down to all the live user
accounts/tables - I know from a page
point of view we have the master
template, but for the database, how
will the user tables be updated? Is
that something we manually do or the
table will keep checking like every
24 hrs with the system tables for
updates to its structure?
If the above is all true, that means we are maintaining 1 master set of tables with system default values, then each user get the same value copied to their tables? Some fields like say Maximum failed login attempts before system locks account. One we have a system default of 5 login attempts within 30 minutes. But I want to allow users also to specify their own number to customize their won security, so that means they can overwrite the system default in their own table?
Thanks.
Users should not get their own set of tables. It will most likely not perform as well as one table (properly indexed), and schema changes will have to be deployed to all user tables.
You could have default values specified on the table for things that are optional.
With difficulty. With one set of tables it will be a lot easier, and probably faster.
That sort of data should be stored in a User Preferences table that stores all preferences for all users. Again, don't duplicate the schema for all users.
Generally the idea of creating separate tables for each entity (in this case users) is not a good idea. If each table is separate querying may be cumbersome.
If your table is large you should optimize the table with indexes. If it gets very large, you also may want to look into partitioning tables.
This allows you to see the table as 1 object, though it is logically split up - the DBMS handles most of the work and presents you with 1 object. This way you SELECT, INSERT, UPDATE, ALTER etc as normal, and the DB figures out which partition the SQL refers to and performs the command.
Not splitting up the tables by users, instead using indexes and partitions, would deal with scalability while maintaining performance. if you don't split up the tables manually, this also makes that points 2, 3, and 4 moot.
Here's a link to partitioning tables (SQL Server-specific):
http://databases.about.com/od/sqlserver/a/partitioning.htm
It doesn't make any kind of sense to me to create a set of tables for each user. If you have a common set of tables for all users then I think that avoids all the issues you are asking about.
It sounds like you need to locate a primer on relational database design basics. Regardless of the type of application you are designing, you should start there. Learn how joins work, indices, primary and foreign keys, and so on. Learn about basic database normalization.
It's not customary to create new tables on-the-fly in an application; it's usually unnecessary in a properly designed schema. Usually schema changes are done at deployment time. The only time "users" get their own tables is an artifact of a provisioning decision, wherein each "user" is effectively a tenant in a walled-off garden; this only makes sense if each "user" (more likely, a company or organization) never needs access to anything that other users in the system have stored.
There are mechanisms for dealing with loosely structured types of information in databases, but if you find yourself reaching for this often (the most common method is called Entity-Attribute-Value), your problem is either not quite correctly modeled, or you may not actually need a relational database, in which case it might be better off with a document-oriented database like CouchDB/MongoDB.
Adding, based on your updated comments/notes:
Your concerns about the number of records in a particular table are most likely premature. Get something working first. Most modern DBMSes, including newer versions of MySql, support mechanisms beyond indices and clustered indices that can help deal with large numbers of records. To wit, in MS Sql Server you can create a partition function on fields on a table; MySql 5.1+ has a few similar partitioning options based on hash functions, ranges, or other mechanisms. Follow well-established conventions for database design modeling your domain as sensibly as possible, then adjust when you run into problems. First adjust using the tools available within your choice of database, then consider more drastic measures only when you can prove they are needed. There are other kinds of denormalization that are more likely to make sense before you would even want to consider having something as unidiomatic to database systems as a "table per user" model; even if I were to look at that route, I'd probably consider something like materialized views first.
I agree with the comments above that say that a table per user is a bad idea. Also, while it's a good idea to have strategies in mind now for how you can cope when things get really big, I'd concentrate on getting things right for a small number of users first - if no-one wants to / is able to use your service, then unfortunately you won't be faced with the problem of lots of users.
A common approach among very large sites is database sharding. The summary is: you have N instances of your database in parallel (on separate machines), and each holds 1/N of the total data. There's some shared way of knowing which instance holds a given bit of data. To access some data you have 2 steps, rather than the 1 you might expect:
Work out which shard holds the data
Go to that shard for the data
There are problems with this, such as: you set up e.g. 8 shards and they all fill up, so you want to share the data over e.g. 20 shards -> migrating data between shards.

How would you design your database to allow user-defined schema

If you have to create an application like - let's say a blog application, creating the database schema is relatively simple. You have to create some tables, tblPosts, tblAttachments, tblCommets, tblBlaBla… and that's it (ok, i know, that's a bit simplified but you understand what i mean).
What if you have an application where you want to allow users to define parts of the schema at runtime. Let's say you want to build an application where users can log any kind of data. One user wants to log his working hours (startTime, endTime, project Id, description), the next wants to collect cooking recipes, others maybe stock quotes, the weekly weight of their babies, monthly expenses they spent for food, the results of their favorite football teams or whatever stuff you can think about.
How would you design a database to hold all that very very different kind of data? Would you create a generic schema that can hold all kind of data, would you create new tables reflecting the user data schema or do you have another great idea to do that?
If it's important: I have to use SQL Server / Entity Framework
Let's try again.
If you want them to be able to create their own schema, then why not build the schema using, oh, I dunno, the CREATE TABLE statment. You have a full boat, full functional, powerful database that can do amazing things like define schemas and store data. Why not use it?
If you were just going to do some ad-hoc properties, then sure.
But if it's "carte blanche, they can do whatever they want", then let them.
Do they have to know SQL? Umm, no. That's your UIs task. Your job as a tool and application designer is to hide the implementation from the user. So present lists of fields, lines and arrows if you want relationships, etc. Whatever.
Folks have been making "end user", "simple" database tools for years.
"What if they want to add a column?" Then add a column, databases do that, most good ones at least. If not, create the new table, copy the old data, drop the old one.
"What if they want to delete a column?" See above. If yours can't remove columns, then remove it from the logical view of the user so it looks like it's deleted.
"What if they have eleventy zillion rows of data?" Then they have a eleventy zillion rows of data and operations take eleventy zillion times longer than if they had 1 row of data. If they have eleventy zillion rows of data, they probably shouldn't be using your system for this anyway.
The fascination of "Implementing databases on databases" eludes me.
"I have Oracle here, how can I offer less features and make is slower for the user??"
Gee, I wonder.
There's no way you can predict how complex their data requirements will be. Entity-Attribute-Value is one typical solution many programmers use, but it might be be sufficient, for instance if the user's data would conventionally be modeled with multiple tables.
I'd serialize the user's custom data as XML or YAML or JSON or similar semi-structured format, and save it in a text BLOB.
You can even create inverted indexes so you can look up specific values among the attributes in your BLOB. See http://bret.appspot.com/entry/how-friendfeed-uses-mysql (the technique works in any RDBMS, not just MySQL).
Also consider using a document store such as Solr or MongoDB. These technologies do not need to conform to relational database conventions. You can add new attributes to any document at runtime, without needing to redefine the schema. But it's a tradeoff -- having no schema means your app can't depend on documents/rows being similar throughout the collection.
I'm a critic of the Entity-Attribute-Value anti-pattern.
I've written about EAV problems in my book, SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming.
Here's an SO answer where I list some problems with Entity-Attribute-Value: "Product table, many kinds of products, each product has many parameters."
Here's a blog I posted the other day with some more discussion of EAV problems: "EAV FAIL."
And be sure to read this blog "Bad CaRMa" about how attempting to make a fully flexible database nearly destroyed a company.
I would go for a Hybrid Entity-Attribute-Value model, so like Antony's reply, you have EAV tables, but you also have default columns (and class properties) which will always exist.
Here's a great article on what you're in for :)
As an additional comment, I knocked up a prototype for this approach using Linq2Sql in a few days, and it was a workable solution. Given that you've mentioned Entity Framework, I'd take a look at version 4 and their POCO support, since this would be a good way to inject a hybrid EAV model without polluting your EF schema.
On the surface, a schema-less or document-oriented database such as CouchDB or SimpleDB for the custom user data sounds ideal. But I guess that doesn't help much if you can't use anything but SQL and EF.
I'm not familiar with the Entity Framework, but I would lean towards the Entity-Attribute-Value (http://en.wikipedia.org/wiki/Entity-Attribute-Value_model) database model.
So, rather than creating tables and columns on the fly, your app would create attributes (or collections of attributes) and then your end users would complete the values.
But, as I said, I don't know what the Entity Framework is supposed to do for you, and it may not let you take this approach.
Not as a critical comment, but it may help save some of your time to point out that this is one of those Don Quixote Holy Grail type issues. There's an eternal quest for probably over 50 years to make a user-friendly database design interface.
The only quasi-successful ones that have gained any significant traction that I can think of are 1. Excel (and its predecessors), 2. Filemaker (the original, not its current flavor), and 3. (possibly, but doubtfully) Access. Note that the first two are limited to basically one table.
I'd be surprised if our collective conventional wisdom is going to help you break the barrier. But it would be wonderful.
Rather than re-implement sqlservers "CREATE TABLE" statement, which was done many years ago by a team of programmers who were probably better than you or I, why not work on exposing SQLSERVER in a limited way to the users -- let them create thier own schema in a limited way and leverage the power of SQLServer to do it properly.
I would just give them a copy of SQL Server Management Studio, and say, "go nuts!" Why reinvent a wheel within a wheel?
Check out this post you can do it but it's a lot of hard work :) If performance is not a concern an xml solution could work too though that is also alot of work.

Is it acceptable to cross between databases?

I'm not sure what this practice is actually called, so perhaps someone can edit the title to more accurately reflect my question.
Let's say we have a site that stores objects of different types. Each type of object has its own database (a database of books and assorted information with its tables, a database of CDs and information with its tables, and so on). However, all of the objects have keywords and the keywords should be uniform across all objects, regardless of type. A new database with a few tables is made to store keywords, however each object database is responsible for mapping the object ID to a keyword.
Is that a good practice?
Is there a reason to have separate databases for each type of object? You would be better off using multiple tables, and joining them. For example, you may have a table GENERIC_OBJECT which holds things that are common across all types, and then a table called BOOK_OBJECT where BOOK_OBJECT.ID = GENERIC_OBJECT.ID for a given book. Another table would be CD_OBJECT where CD_OBJECT.ID = GENERIC_OBJECT.ID for a given CD. Then things like keywords that are common across all objects would be stored in the GENERIC_OBJECT table, and things that are specific to the item would go in the item's corresponding table.
By separating them into different databases, you lose:
the ability to do ACID transactions (assuming you aren't using a two-phase commit solution).
the ability to have referential integrity.
JOINs across tables.
Thomas, what you're missing in your comment responses to our concerns about referential integrity is that you can't do a foriegn key across two databases. If the two tables are in one database, then you can use foriegn key constraints to ensure that when you delete an object, anything that relies upon its object id is also deleted, and other similar things.
While it is possible to do joins across databases, I wouldn't generally split the data across databases just because they are of slightly different categories. Others have also mentioned the inability to use referential integrity across databases.
On the other hand, if each type of product has radically different front-end applications, or if you expect each database to become massively large, those might be reasons to consider leaving them in separate databases. (Although scaling isn't a problem for most modern databases).
Syntax example for cross-database joins:
SELECT *
FROM books b
INNER JOIN KeywordDB.dbo.Keywords k
ON b.keywordID = k.keywordID
In this example, you are performing the query from the local database that contains the books table, and you are joining to the other database. (This is a MS SQL syntax example)
No, it's a bad idea. By separating them into different databases, you significantly impair your ability to do JOIN queries.
It does seem a little bit too seperated but with some well designed views it could work especially if the views are simply lookups.
Why such seperation in the first place?
As everyone has mentioned this is, in general, not a good idea. However, to play devils advocate, I've seen other developers do this. I'm sure that there are some reasons that one might want accomplish this however if absolutely needed (not sure if your asking for solutions but) you might want to use some sort of synchronization to keep the data synchronized. Have all (or what is needed) of the data in both databases.
This also isn't an ideal solution, but if you must uses two different database types, this might be a better way to go about such a thing.
It could at least solve the issues that everyone has been outlining – though keep in mind that it does present a new problem… Is everything in sync?
Good Luck,Frank
If the decision about whether to use one database or two is yours, I recommend going with just one database. The data in the two tables appears closely related, judging from your question. The size and complexity doesn't seem to merit splitting into two databases.
What's your DBMS? If it's Oracle, DB2, SQL Server, or even MS Access, you shouldn't have any trouble administering a single database with keyword data and object data in logically related tables.

Resources