Loosely Coupled Database Design - How To? - sql-server

I'm implementing a web - based application using silverlight with an SQL Server DB on the back end for all the data that the application will display. I want to ensure that the application can be easily scalable and I feel the direction to go in with this is to make the database loosely coupled and not to tie everything up with foreign keys. I've tried searching for some examples but to no avail.
Does anyone have any information or good starting points/samples/examples to help me get off the ground with this?
Help greatly appreciated.
Kind regards,

I think you're mixing up your terminology a bit. "Loosely coupled" refers to the desirability of having software components that aren't so dependent upon each other that they can't function or even compile without being together in the same program. I've never seen the term used to describe the relationships between tables in the same database.
I think if you search on the terms "normalization" and "denormalization" you'll get better results.

Unless you're doing massive amounts of inserts at a time, like with a data warehouse, use foreign keys. Normalization scales like crazy, and you should take advantage of that. Foreign keys are fast, and the constraint really only holds you back if you're inserting millions upon millions of records at a time.
Make sure that you're using integer keys that have a clustered index on them. This should make joining table very rapid. The issues you can get yourself wrapped around without foreign keys are many and frustrating. I just spent all weekend doing so, and we made a conscious choice to not have foreign keys (we have terabytes of data, though).

Before you even think of such a thing, you need to think about data integrity. Foreign keys exist so that you cannot put records into tables if the primary data they are based on is not there. If you do not use foreign keys, you will sooner or later (probably sooner) end up with worthless data because you don't really know who the customer is that the order is attached to for instance. Foreign keys are data protection, you should never consider not using them.
And even though you think all your data will come from your application, in real life, this is simply not true. Data gets in from multiple applications, from imports of large amounts of data, from the query window (think about when someone decides to update all the prices they aren't going to do that one price at a time from the user interface). Data can get into database from many sources and must be protected at the database level. To do less is to put your entire application and data at risk.

Intersting comment about database security when data is input through external sources like database scripts.

Related

Do you absolutely need foreign keys in a database?

I was wondering how useful foreign keys really are in a database. Essentially, if the developers know what keys the different tables depend on, they can write the queries just as though there was a foreign key, right?
Also, I do see how to foreign-key constraints help prevent all sorts of bugs with data integrity, but say for example, the programmers do a good job of preserving data integrity, how necessary are foreign keys really?
If you don't care about referential integrity then you are right. But.... you should care about referential integrity.
The problem is that people make mistakes. Computers do not.
Regarding your comment:
but say for example, the programmers do a good job of
preserving data integrity
Someone will eventually make a mistake. No one is perfect. Also if you bring someone new in you aren't always sure of their ability to write "perfect" code.
In addition to that you lose the ability to do cascading deletes and a number of other features that having defined foreign keys allow.
I think that assuming that programmers will always preserve data integrity is a risky assumption.
There's no reason why you wouldn't create foreign keys, and being able to guarantee integrity instead of just hoping for integrity is reason enough.
Not using referential integrity in a database is like not using seatbelts in cars. It will provide you with measurable improvements in taking you from A->B, but it will make "real" difference only in the most extreme cases. Why take the "risk" unless you really have to?
The underlaying reason people ask this question is always performance.
Foreign keys give the optimizer much more information to work with, and it will potentially produce better execution plans. It's not like a specific query will be % percent faster with enabled constraints, it's more like you effectively eliminate entire classes of problems due to bad execution plans. You also enable the optimizer to rewrite queries in ways that just isn't possible without the constraints (join elimination for example).
Starting right here, I would like to start a myth that referential integrity always increases performance in databases. I'm fairly confident that if 100 people designed their databases with full integrity checking, less than 5 people will actually have to consider spend a whopping 1 second to disable them for performance reasons. Out of those 5 people, there will be close to 0 people who find that they need to disable 100% of the constraints.
Foreign keys are invaluable as a means of ensuring integrity, and even if you trust your developers to never (!) make errors the cost of having them is usually well worth it.
Foreign keys also serve as documentation, in that you can see what relates to what. This information is typically also used by tools, such as for generating reports, creating data sets from table definitions, object-relational mappers, etc. Even if you do not use any of these today, having FKs will make it easier to tread that path later.
Foreign keys also allow you to define cascade rules, which e.g. can be used to to delete associated records in related tables when a row in one table is deleted.
Only if you have ridiculously high loads should you consider bypassing FKs.
Edit: updated answer to include points from other answers (reports, cascades).
You said
but say for example, the programmers
do a good job of preserving data
integrity
The expression you were looking for is, "I'm 100% certain that every programmer and every database administrator will manually preserve data integrity perfectly no matter what application touches this database, no matter how complex the database becomes, from now until the time it's decommissioned."
You don't have to use them but why wouldn't you?
They are there to help. From making life easier with cascade updates and cascade deletes, to guaranteeing that constraints aren't violated.
Maybe the application honors the constraints, but isn't it useful to have them clearly specified? You could document them, or you could put them in the database where most programmers expect to find constraints they are supposed to conform to (a better idea I think!).
Finally, if you ever need to import data into this database which doesn't go via the front-end, you may accidently import data which violates the constraints and breaks the application.
I'd definetly not recommend skipping the relationships in a database
Foreign Keys make life so much easier when using report builders and data analysis tools. Just select one table, check the include related tables box and BAM! you've got you're report built. Ok Ok, it's not that easy, but they certianly save time in that respect.
Use constraints rather than application logic to enforce integrity because it is generally easier, cheaper and more reliable to maintain constraints in one place (the database) rather than in every application.
I understand from one of your comments that your motivation for asking the question is that you think leaving out the keys may make it easier to evolve the database design during development. In my experience you are wrong about that. I find that it's actually better to be more restrictive with constraints in the early stages of development. If in doubt, create the constraint because it's much easier to remove constraints later than it is to create them. Removing a constraint will tend to break fewer things than adding one and generally requires less testing and fewer code changes to achieve.
Another point to make is that when you scrap your current user interface and use a new one with shiny new tools, you won't lose your referential integrity because the new devs have no idea what should be related to what. Databases are generally in use much much longer than user interfaces. They are also often used by more than one application interface and then you have the problem of different interfaces trying to enforce different integrity rules.
I will also point out that I have had occasion to look at the data in, quite literally, hundreds of databases and have not found one yet that has good data if they didn't set up FKs. This bad data complicates reporting, it complicates imports and exports to and from clients and other third party vendors who need or provide the data. And if the bad data is in a financial area, it could also have legal and accounting implications. I can even remember one time the company had thousands of bad inventory records where the actual product that was stored was no longer identifiable (nor the location) which also created issues with defining the value of the inventory necessary for financial reporting. This is not only bad from a perspective of not knowing what parts you have on hand, but it enables people to steal parts without being caught simply by deleting the part number from the part table (this particular place didn't have auditing in place either.).
Folks have offered up some good answers above. However, one important point I didn't see mentioned is that foreign keys make your entity relationship diagrams (ERDs) easier to generate and much more meaningful. Without FKs, you either need to depict the FK relationships on your ERD manually (painful for you) or not at all (painful for others, and perhaps even for yourself once your memory of the implied FK relationships starts to fade over time). With FKs explicitly defined, most tools that automatically generate ERDs from database object definitions will automatically detect and depict the FK relationships.
Perhaps the question should be "How bad are orphan records?". In many cases orphaned records aren't really going to hurt anything. Yes these records may persist until the end of time but how bad is this really? Cascading updates or deletes are rarely useful features. Referential integrity sounds nice but I think is not as important as we have been lead to believe. The biggest benefit to FK's is the documentation they provide. In my experience FK's for referential integrity are way more trouble than they are worth.
I am having the same question today, and found many articles talking about why you don't have to use foreign keys online. But so far, 10 of 11 answers here say you should have FKs.
I am not a db expert and just want to share some points I found online about when and why you don't have FKs:
Some points from 9 reasons why there are no foreign keys constraints:
Performance
Legacy data
Full table reload
Higher level framework
Cross database relations
Database platform agnostic
Open for change
Lazy architect
Keep model a secret
Some points from At GitHub we do not use foreign keys, ever, anywhere.
FKs are in your way to shard your database.
FKs are a performance impact.
FKs don't work well with online schema migrations.
Note: I don't have any opinions. Just sharing some online articles to provide a different answer to most of the current ones.

Automatic database schema generation system?

I'm working with a client who has a piece of custom website software that has something I haven't seen before. It has a MySQL Database backend, but most of the tables are auto-generated by the php code. This allows end-users to create tables and fields as they see fit. So it's a database within a database, but obviously without all the features available in the 'outermost' database. There are a couple tables that are basically mappings of auto-generated table names and fields to user-friendly table names and fields.* This makes queries feel very unintuitive :P
They are looking for some additional features, ones that are immediately available when you use the database directly, such as data type enforcement, foreign keys, unique indexes, etc. But since this a database within a database, all those features have to be added into the php code that runs the database. The first thing that came to my mind is Inner Platform Effect* -- but I don't see a way to get out of database emulation and still provide them with the features they need!
I'm wondering, could I create a system that gives users nerfed ability to create 'real' tables, thus gaining all the relational features for free? In the past, it's always been the developer/admin who made the tables, and then the users did CRUD operations through the application. I just have an uncomfortable feeling about giving users access to schema operations, even when it is through the application. I'm in uncharted territory.
Is there a name for this kind of system? Internally, in the code, this is called a 'collection' system. The name of 'virtual' tables and fields within the database is called a 'taxonomy'. Is this similiar to CCK or the taxonomy modules in Drupal? I'm looking for models of software that do this kind of this, so I can see what the pitfalls and benefits are. Basically I'm looking for more outside information about this kind of system.
Note this is not a simple key-value mapping, as the wikipedia article on inner-platform effect references. These work like actual tuples of multiple cells -- like simple database tables.
I've done this, you can make it pretty simple or go completely nuts with it. You do run into problems though when you put it into customers' hands, are we going to ask them to figure out primary keys, unique constraints and foreign keys?
So assuming you want to go ahead with that in mind, you need some type of data dictionary, aka meta-data repository. You have a start, but you need to add the ideas that columns are collected into tables, then specify primary and foreign keys.
After that, generating DDL is fairly trivial. Loop through tables, loop through columns, build a CREATE TABLE command. The only hitch is you need to sequence the tables so that parents are created before children. That is not hard, implement a http://en.wikipedia.org/wiki/Topological_ordering
At the second level, you first have to examine the existing database and then sometimes only issue ALTER TABLE ADD COLUMN... commands. So it starts to get complicated.
Then things continue to get more complicated as you consider allowing DEFAULTS, specifying indexes, and so on. The task is finite, but can be much larger than it seems.
You may wish to consider how much of this you really want to support, and then make a value judgment about coding it up.
My triangulum project does this: http://code.google.com/p/triangulum-db/ but it is only at Alpha 2 and I would not recommend using it in a Production situation just yet.
You may also look at Doctrine, http://www.doctrine-project.org/, they have some sort of text-based dictionary to build databases out of, but I'm not sure how far they've gone with it.

Database Designing: An art or headache (Managing relationships)

I have seen in my past experience that most of the people don't use physical relationships in tables and they try to remember them and apply them through coding only.
Here 'Physical Relationships' refer to Primary Key, Foreign Key, Check constraints, etc.
While designing a database, people try to normalize the database on paper and keep things documented. Like, if I have to create a database for a marketing company, I will try to understand its requirements.
For example, what fields are mandatory, what fields will contain only (a or b or c) etc.
When all the things are clear, then why are most of the people afraid of the constraints?
Don't they want to manage things?
Do they have a lack of knowledge
(which I don't think is so)?
Are they not confident about future
problems?
Is it really a tough job managing all these entities?
What is the reason in your opinion?
I always have the DBMS enforce both primary key and foreign key constraints; I often add check constraints too. As far as I am concerned, the data is too important to run the risk of inaccurate data being stored.
If you think of the database as a series of stored true logical propositions, you will see that if the database contains a false proposition - an error - then you can argue to any conclusion you want. Given a false premise, any conclusion is true.
Why don't other people use PK and FK constraints, etc?
Some are unaware of their importance (so lack of knowledge is definitely a factor, even a major factor). Others are scared that they will cost too much in performance, forgetting that one error that has to be fixed may easily use up all the time saved by not having the DBMS do the checking for you. I take the view that if the current DBMS can't handle them well, it might be (probably is) time to change DBMS.
Many developers will check the constraints in code above the database before they actually go to perform an operation. Sometimes, this is driven by user experience considerations (we don't want to present choices / options to users that can't be saved to the database). In other cases, it may be driven by the pain associated with executing a statement, determining why it failed, and then taking corrective action. Most people would consider code more maintainable if it did the check upfront, along with other business logic that might be at play, rather than taking corrective action through an exception handler. (Not that this is necessarily an ideal line of thinking, but it is a prevalent one.) In any case, if you are doing the check in advance of issuing the statement, and not particularly conscious of the fact that the database might get touched by applications / users who are not coming in through your integrity-enforcing code, then you might conclude that database constraints are unnecessary, especially with the performance hit that could be incurred from their use. Also, if you are checking integrity in the application code above the database, one might consider it a violation of DRY (Don't Repeat Yourself) to implement logically equivalent checks in the database itself. The two manifestations of integrity rules (those in database constraints and those in application code above the database) could in principle become out-of-sync if not managed carefully.
Also, I would not discount option 2, that many developers don't know much about database constraints, too readily.
Well, I mean, everyone is entitled to their own opinion and development strategy I suppose, but in my humble opinion these people are almost certainly wrong :)
The reason, however, someone may wish to avoid constraints is efficiency. Not because constraints are slow, but because storing redundant data (i.e. caching) is a very effective way of speeding up (well, avoiding) an expensive calculation. This is an acceptable approach, when implemented properly (i.e. the cache is updated a regular/appropriate intervals, generally I do this with a trigger).
As to the motivation to not us FKs without a caching motivation, I can't imagine it. Perhaps they aim to be 'flexible' in their DB structure. If so, fine, but then don't use a relational DB, because it's pointless. Non-relational DBs (OO dbs) certainly have their place, and may even arguably be better (quite arguable, but interesting to argue) but it's a mistake to use a relational DB and not use it's core properties.
I would always define PK and FK constraints. especially when using an ORM. it really makes the life easy for everybody to let the ORM reverse engineer the database instead of manually configuring it to use some PKs and FKs
There are several reasons for not enforcing relationships in descending order of importance:
People-friendly error handling.
Your program should check constraints and send an intelligible message to the user. For some reason normal people dont like "SQL exception code -100013 goble rule violated for table gook'.
Operational flexibility.
You dont really want your operators trying to figure out which order you must load your tables in at 3 a.m., nor do you want your testers pulling their hair out 'cause they cannot reset the database back to its starting position.
Efficiency.
Cheking constraints does consume IO and CPU.
Functionality.
Its a cheap way to save details for later recovery. For instance in an on line order system you could leave the detail item rows in the table when the users kills a parent order, if he later reinstates the order the details re-appear as if by a miracle -- you acheive this extra feature by deleteing lines of code. (course you need some housekeeping process but it is trivial!)
As things get more complex and more tables and relationships are needed in the database, how can you ensure the database developer remembers to check all of them? When you makea change to the schema that adds a new "informal" relationship, how can you ensure all the application code which might be affected gets changed?
Suddenly you could be deleting records that should stay because they have related data the developer forgot to check when writng the delete process or because that process was in place before the last ten related tables were added to the schema.
It is foolhardy in the extreme to not formally set up PK/FK relationships. I process data received from many different vendors and databases. You can tell which ones have data integrity problems most likely caused by a failure to explicitly define relationships by the poor quality of their data.

Pros and cons of programmatically enforcing foreign key than in database

It is causing so much trouble in terms of development just by letting database enforcing foreign key. Especially during unit test I can’t drop table due to foreign key constrains, I need to create table in such an order that foreign key constrain warning won’t get triggered. In reality I don’t see too much point of letting database enforcing the foreign key constrains. If the application has been properly designed there should not be any manual database manipulation other than select queries. I just want to make sure that I am not digging myself into a hole by not having foreign key constrains in database and leaving it solely to the application’s responsibility. Am I missing anything?
P.S. my real unit tests (not those that use mocking) will drop existing tables if the structure of underlying domain object has been modified.
In my experience, if you don't enforce foreign keys in a database, then eventually (assuming the database is relatively large and heavily used) you will end up with orphaned records. This can happen in many ways, but it always seems to happen.
If you index properly, there should not be any performance advantages to foreign keys.
So the question is, does the potential damage/hassle/support cost/financial cost of having orphaned records in your database outweigh the development and testing hassle?
In my experience, for business applications I always use foreign keys. It should just be a one-time setup cost to get your build scripts working correctly, and the data stability will more than pay for that over the life of an application.
The point of enforcing the rules in the database is that it's declarative - e.g. you do not have to write ton of code to handle it.
As far as your unit tests, just delete tables in the proper order. You just have to write a function to do it right once.
Your issues in development should not drive the DB design. Constantly rebuilding a DB is a developer use case, not a customer use case.
Also, the DB constraints help beyond your application. You never know what your customer might try to do. Don't over do it, but you need a few.
It might seem like you can rely on your applications to follow implied rules, but unless you enforce them eventually someone will make a mistake.
Or maybe 5 years from now someone will do a tidy-up of old records "which are no longer needed" and not realise that there is data in other tables still referencing them. Then a few days/weeks later you or your successor gets the fun job of trying to repair the mess that the database has got in to. :-)
Here's a nice discussion on that in a previous question on SO: What's wrong with foreign keys?. [Edit]: The argument is to make non-enforced foreign keys to get some of the pros if any of the cons apply.
If the application has been properly
designed there should not be any
manual database manipulation other
than select queries
What? What kind of koolaid are you drinking? Most databases applications exist to manipulate the data in the database not just to see it. Generally the whole purpose of the application is to add the new orders or create the new customer records or document the customer service calls etc.
Foreign keys are for data integrity. Data integrity is critical to being able to use the data with any reliability. Databases without data integrity are useless and can cause companies to lose money. This trumps your self-centered view that FKs aren't needed because they make development more complicated for you. The data is far more important than your convenience in running tests (which can be written to account for the FKs).

GUIDs as Primary Keys - Offline OLTP

We are working on designing an application that is typically OLTP (think: purchasing system). However, this one in particular has the need that some users will be offline, so they need to be able to download the DB to their machine, work on it, and then sync back once they're on the LAN.
I would like to note that I know this has been done before, I just don't have experience with this particular model.
One idea I thought about was using GUIDs as table keys. So for example, a Purchase Order would not have a number (auto-numeric) but a GUID instead, so that every offline client can generate those, and I don't have clashes when I connect back to the DB.
Is this a bad idea for some reason?
Will access to these tables through the GUID key be slow?
Have you had experience with these type of systems? How have you solved this problem?
Thanks!
Daniel
Using Guids as primary keys is acceptable and is considered a fairly standard practice for the same reasons that you are considering them. They can be overused which can make things a bit tedious to debug and manage, so try to keep them out of code tables and other reference data if at all possible.
The thing that you have to concern yourself with is the human readable identifier. Guids cannot be exchanged by people - can you imagine trying to confirm your order number over the phone if it is a guid? So in an offline scenario you may still have to generate something - like a publisher (workstation/user) id and some sequence number, so the order number may be 123-5678 -.
However this may not satisfy business requirements of having a sequential number. In fact regulatory requirements can be and influence - some regulations (SOX maybe) require that invoice numbers are sequential. In such cases it may be neccessary to generate a sort of proforma number which is fixed up later when the systems synchronise. You may land up with tables having OrderId (Guid), OrderNo (int), ProformaOrderNo (varchar) - some complexity may creep in.
At least having guids as primary keys means that you don't have to do a whole lot of cascading updates when the sync does eventually happen - you simply update the human readable number.
#SqlMenace
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
Not true. Primary key != clustered index.
If the clustered index is another column ("inserted_on" springs to mind) then the inserts will be sequential and no page splits or excessive fragmentation will occur.
This is a perfectly good use of GUIDs. The only draw backs would be a slight complexity in working with GUIDs over INTs and the slight size difference (16 bytes vs 4 bytes).
I don't think either of those are a big deal.
Will access to these tables through
the GUID key be slow?
There are other problems with GUIDs, you see GUIDs are not sequential, so inserts will be scattered all over the place, this causes page splits and index fragmentation
In SQL Server 2005 MS introduced NEWSEQUENTIALID() to fix this, the only problem for you might be that you can only use NEWSEQUENTIALID as a default value in a table
You're correct that this is an old problem, and it has two canonical solutions:
Use unique identifiers as the primary key. Note that if you're concerned about readability you can roll your own unique identifier instead of using a GUID. A unique identifier will use information about the date and the machine to generate a unique value.
Use a composite key of 'Actor' + identifier. Every user gets a numeric actor ID, and the keys of newly inserted rows use the actor ID as well as the next available identifier. So if two actors both insert a new row with ID "100", the primary key constraint will not be violated.
Personally, I prefer the first approach, as I think composite keys are really tedious as foreign keys. I think the human readability complaint is overstated -- end-users shouldn't have to know anything about your keys, anyways!
Make sure to utilize guid.comb - takes care of the indexing stuff. If you are dealing with performance issues after that then you will be, in short order, an expert on scaling.
Another reason to use GUIDs is to enable database refactoring. Say you decide to apply polymorphism or inheritance or whatever to your Customers entity. You now want Customers and Employees to derive from Person and have them share a table. Having really unique identifiers makes data migration simple. There are no sequences or integer identity fields to fight with.
I'm just going to point you to What are the performance improvement of Sequential Guid over standard Guid?, which covers the GUID talk.
For human readability, consider assigning machine IDs and then using sequential numbers from those machines as a possibility. This will require managing the assignment of machine IDs, though. Could be done in one or two columns.
I'm personally fond of the SGUID answer, though.
Guids will certainly be slower (and use more memory) than standard integer keys, but whether or not that is an issue will depend on the type of load your system will see. Depending on your backend DB there may be issues with indexing guid fields.
Using guids simplifies a whole class of problems, but you pay for it part will performance and also debuggability - typing guids into those test queries will get old real fast!
The backend will be SQL Server 2005
Frontend / Application Logic will be .Net
Besides GUIDs, can you think of other ways to resolve the "merge" that happens when the offline computer syncs the new data back into the central database?
I mean, if the keys are INTs, i'll have to renumber everything when importing basically. GUIDs will spare me of that.
Using GUIDs saved us a lot of work when we had to merge two databases into one.
If your database is small enough to download to a laptop and work with it offline, you probably don't need to worry too much about the performance differences between ints and Guids. But do not underestimate how useful ints are when developing and troubleshooting a system! You will probably need to come up with some fairly complex import/synch logic regardless of whether or not you are using Guids, so they might not help as much as you think.
#Simon,
You raise very good points. I was already thinking about the "temporary" "human-readable" numbers i'd generate while offline, that i'd recreate on sync. But i wanted to avoid doing with with foreign keys, etc.
i would start to look at SQL Server Compact Edition for this! It helps with all of your issues.
Data Storage Architecture with SQL Server 2005 Compact Edition
It specifically designed for
Field force applications (FFAs). FFAs
usually share one or more of the
following attributes
They allow the user to perform their
job functions while disconnected from
the back-end network—on-site at a
client location, on the road, in an
airport, or from home.
FFAs are usually designed for
occasional connectivity, meaning that
when users are running the client
application, they do not need to have
a network connection of any kind. FFAs
often involve multiple clients that
can concurrently access and use data
from the back-end database, both in a
connected and disconnected mode.
FFAs must be able to replicate data
from the back-end database to the
client databases for offline support.
They also need to be able to replicate
modified, added, or deleted data
records from the client to the server
when the application is able to
connect to the network
First thought that comes to mind: Hasn't MS designed the DataSet and DataAdapter model to support scenarios like this?
I believe I read that MS changed their ADO recordset model to the current DataSet model so it works great offline too. And there's also this Sync Services for ADO.NET
I believe I have seen code that utilizes the DataSet model which also uses foreign keys and they still sync perfectly when using the DataAdapter. Havn't try out the Sync Services though but I think you might be able to benefit from that too.
Hope this helps.
#Portman By default PK == Clustered Index, creating a primary key constraint will automatically create a clustered index, you need to specify non clustered if you don't want it clustered.

Resources