Tips for build foreign keys into a legacy database - database

I've got a database that doesn't have any foreign keys. I've done some checks and there are a a fair few orphaned records.
Its a pretty large database 500 + tables and I'm looking at the possibility of building the foreign keys back in.
Other than trawling though every single table over time?
Has anybody ever been through this process before and can maybe offer some insights or tips on how to make the process a little easier.
Any help advice appreciated.

I assume you mean "doesn't have any foreign key constraints"...if there were no foreign keys, you wouldn't know which records matched at all.
Do the primary and foreign key fields have the same name? As in, the PK table has a "CustomerId" field and the FK table(s) also have a "CustomerId" field? If so, you might be able to query the column properties (perhaps using INFORMATION_SCHEMA, you didn't mention an RDBMS) to figure out some implied relationships. Just query for all the tables that have a field called "CustomerId" that is not a PK and there's a good (but not certain) bet that those tables should have an FK constraint to the Customer table. You could even use the output of the query to generate the DDL to create the constraints.

You can work from the largest to smallest tables, or start with the least performant area of the database. Adding keys should help your performance significantly, but you'll have to resolve the orphan rows first. You may need input from the business for that. Expect them to be very confused about what's going on.

Related

Creating a SQL database without defining primary key

So in my work environment we don't use a 'primary key' as defined by SQL Server. In other words, we don't right click a column and select "set as primary key".
We do however still have primary keys, we just use a unique ID column. In stored procedures we use these to access the data like you would in any relational database.
My question is, other than the built in functionality that comes with defining a primary key in SQL Server like Entity Framework stuff etc. Is there a good reason to use the 'primary key' functionality over just using a unique ID column and accessing your tables with that in your own stored procedures?
The biggest drawback I see (again other than being able to use Entity Framework and things like that) is that you have to mentally keep track or otherwise keep track of what ID relates to what tables.
There is nothing "special" about the PRIMARY KEY constraint. It's just a uniqueness constraint and you can achieve the same results by using the UNIQUE NOT NULL syntax to define your keys instead.
However, uniqueness constraints (i.e. keys in general, not "primary" keys specifically) are very important for data integrity reasons. They ensure that your data is unique which means that sensible, meaningful results can be derived from your data. It's extremely difficult to get accurate results from a database that contains duplicate data. Also, uniqueness constraints are required to enforce referential integrity between tables, which is another very important aspect of data integrity. Poor data integrity is a data management problem that costs businesses billions of dollars every year and that's the bottom line of why keys are important.
There is a further reason where unique indexes are important: query optimization and performance. Unique indexes improve query performance. If your data is supposed to be unqiue then creating a unique index on it will give the query optimizer the best chance of picking a good execution plan for your queries.
I think the drawback is not using the primary key at all and using a unique key constraint for something it wasn't intended to do.
Unique keys: You can have many of them. They are meant to offer a way to determine uniqueness among rows.
Primary key: like the Highlander, there can only be one. It's intended use is to identify the rows of the table.
I can't think of any good reason not to use a primary key. My opinion is that without a primary key, your table isn't actually a table. It's just a lump of data.
Follow Up: If you don't believe me, check out this guy who asked a bunch of DBA's if it was OK not to use a primary key.
Is it OK not to use a Primary Key When I don't Need one
There are philosophical and practical answers to your question.
The practical answer is that using the primary key constraint enforces "not null", and "unique". This protects you from application-level bugs.
The philosophical answer is that you want developers to operate at the highest possible level of abstraction, so that they don't have to stuff their brain full of detail when trying to solve problems.
Primary and foreign keys are abstractions that allow us to make assumptions about the underlying data model. We can think in terms of (business) entities, and their relationships.
In your workplace, you're forcing developers to think in terms of tables and indexes and conventions. You no longer think about "customers" and "orders" and "line items", but about software artefacts that represent those business entities, and the "we always represent uniqueness by a combination of a GUID and unique index" rule. That mental model is already complicated enough in most applications; you're just making it harder for yourselves, especially when bringing new developers into the team.

Dropping and recreating unexpected primary keys

I have a tool which uses SQL scripts to apply changes to a customer database. Often this invloves changing a column definition (datatype etc). The problem is that often there are primary keys applied by the user that we don't know about (and they don't remember), which trips up the process (eg when changing columns belonging to the indexes or primary keys).
The requirement given to me is that this update process should be 'seamless', with no human involvement to prepare the ground. I have also researched this on this forum, and as far as I can see my particular question has not yet been asked.
I know how to disable and then later rebuild all indexes on a database, and even those only in certain tables, but if the index is on a primary key I still can't change any column that is part of the primary key unless I explicitly drop the PK by name, and later recreate it explicitly, which means I have to know about it at code-time. I can probably write a query to find the name of the primary key on a table if one is there, but how to know how to recreate it?
How can I, using Transact-SQL (or PL/SQL), detect, drop and then recreate the primary keys on given tables, without knowing at code time what they are or what columns belong to them? The key is that the tool cannot know in advance what the primary keys are are on any given table, nor what they comprise. The SQL code must handle this itself.
Better still would be to detect if a known column belongs to a primary key, then drop and later recreate that after I have changed the column.
This needs to be done in both Oracle and Sql Server, ideally purely with SQL code.
TIA
I really don't understand why would a customer define his own primary keys for the tables? Moreover, I don't understand why would you let them? In my world, if customer changes schema in any way, this automatically means end of support for them.
I will strongly advise against dropping and recreating primary keys on production database. Any number of bad things can happen, leading to data loss.
And it's not just the PKs, you will have to drop the foreign key constraints first. And FKs may reference not only the PKs but the unique constraints as well, so yao have to deal with those as well.
Your best bet would be to create a new table with the required schema, copy the data, drop original table and rename the new one. Of course, you will have to handle the FKs, but it's easier. Check this link an example:
http://sqlblog.com/blogs/john_paul_cook/archive/2009/09/17/script-to-create-all-foreign-keys.aspx

Database normalization - How not OK is it to have a table with no relationships?

I'm really new to database design, as I will now demonstrate:
I have an MS Sql database that I need to add a table to. The table contains information that pertains to another table. However, there are no candidates for primary keys (all fields can be duplicates). The only thing the table will ever be used for is to keep records that may be required for a certain kind of query, and they can be retrieved super-easily using a field that my other tables also contain (but never uniquely).
Specifically, my main table has a bunch of chemistry records. Each chemistry record is associated with another set of records called quality-control records (in my second table). They are associated by a field called "BatchID". The super-easy part is that I can say, "get all records with this BatchID" and get exactly what I need. But there can be multiple instances of any BatchID in both tables (in fact, there usually are), so I'd need to jump through hoops to link them. In a more general sense, in theory, is it OK to have a table floating around not attached to anything?
The overwhelmingly simple solution is to just put the quality control in the db with no relationships to the chemistry table. I'd need to insert at least one other table to relate it to anything else, maybe more, and the only reason for complicating my life like that is that I don't want to violate some important precept of database design.
My question is, is it ever OK to just have a free-floating table in a database? Or is that right out?
Thanks for any help.
In theory, it's ok to have a table that doesn't have any foreign key constraints. But the table you describe (both tables you describe) should probably have a foreign key that references the table of batches. We'd expect the table of batches to have "BatchID" as its primary key.
The relational model requires tables to have at least one candidate key. It's almost always a bad idea to have a SQL table that doesn't have a candidate key.

Relations in SQL Server and optimization

I was a developer in a certain project developed under sql-server and .Net, they don't use physical relations between their tables but they use logical ones " logical foreign keys ".
I asked them that for what reason they do that ,they say "it is more optimal".
What I really want to know, is it really more optimal or it is just a myth?
When it comes to reads from a database, whether foreign keys are defined or not doesn't come into it. There is no relationship between having foreign keys and the performance of reads.
Things that will effect performance are how the tables are stored, what indexes are defined on them and the stored statistics (just to name a few).
This is a bad justification for not having referential integrity in the database (in particular as it can be trivial to test).
Using the assumption that " logical foreign keys " are just values that reference a key in another table without a physical link between them in terms of constraints I can tell you what the benefits of the physical link is.
First of all a "physical" foreign key is a constraint and it enforces referential integrity between the two values. So that, if you want for example to use a foreign key that doesn't exist in the other table you will receive an error. The same thing will also happen if you try to delete a key that is a foreign key by constraint in another table.
Secondly it is arguable that it is more optimal since you can index the foreign key constraints and benefit from that for example when you use joins.
More on this: http://msdn.microsoft.com/en-us/library/ff647793.aspx
There is actually no physical difference between a "real" foreign key and a "logical" foreign key. They're both just columns in a table and don't affect the way that a table is stored on disk. This actually surprised me too when I first learned.
The only difference is that when you have a "real" foreign key, whenever a delete, update, or insert statement is ran on a table, the database server has to check that the value is being updated to a legitimate value. If you look at the execution plan for a statement that's an update, insert, delete, or merge, you'll actually see it has to scan or seek on all tables that have a foreign key.
This can be quite a performance overhead if there are a lot of foreign keys or there aren't helpful indexes.
Picture you have a table for Companies, and then another table for Employees. Your employees table will likely have a column called companyId.
When you run:
delete from Companies where companyId = 123;
The database server needs to make sure that there aren't any employees for that companyId. The same applies when you run:
insert into Employees (companyId, name) values (123, 'John');
The database server needs to search the companies table to make sure that the companyId 123 exists.
Yes it is faster to have only "logical" foreign keys. However, it comes at the cost of possible data corruption and might cost more time finding bugs and other sources of data corruption. Whether it's worth it is up to you. One thing to consider is that it doesn't affect read-only queries.
Edit As Martin Smith pointed out and I had left out, there are some cases where the foreign key would be faster. If there is an inner join on a table with a foreign key, and no columns are referenced by the second table, then the query doesn't have to hit the second table since it can trust the foreign key.

Should a database table always have primary keys?

Should I always have a primary key in my database tables?
Let's take the SO tagging. You can see the tag in any revision, its likely to be in a tag_rev table with the postID and revision number. Would I need a PK for that?
Also since it is in a rev table and not currently use the tags should be a blob of tagIDs instead of multiple entries of multiple post_id tagid pair?
A table should have a primary key so that you could identify each row uniquely with it.
Technically, you can have tables without a primary key, but you'll be breaking good database design rules.
You should strive to have a primary key in any non-trivial table where you're likely to want to access (or update or delete) individual records by that key. Primary keys can consist of multiple columns, and formally speaking, will be the shortest available superkey; that is, the shortest available group of columns which, together, uniquely identify any row.
I don't know what the Stack Overflow database schema looks like (and from some of the things I've read on Jeff's blog, I don't want to), but in the situation you describe, it's entirely possible there is a primary key across the post identifier, revision number and tag value; certainly, that would be the shortest (and only) superkey available.
With regards to your second point, while it may be reasonable to argue in favour of aggregating values in archive tables, it does go against the principle that each row/column intersection in a table ought to contain one single value. While it may slightly simplify development, there is no reason you can't keep to a normalised table with versioned metadata, even for something as trivial as tags.
I tend to agree that most tables should have a primary key. I can only think of two times where it doesn't make sense to do it.
If you have a table that relates keys to other keys. For example, to relate a user_id to an answer_id, that table wouldn't need a primary key.
A logging table, whose only real purpose is to create an audit trail.
Basically, if you are writing a table that may ever need to be referenced in a foreign key relationship then a primary key is important, and if you can't be positive it won't be, then just add the PK. :)
See this related question about whether an integer primary key is required. One of the answers uses tagging as an example:
Are there any good reasons to have a database table without an integer primary key
For more discussion of tagging and keys, see this question:
Id for tags in tag systems
From MySQL 5.5 Reference Manual section 13.1.17:
If you do not have a PRIMARY KEY and an application asks for the PRIMARY KEY in your tables, MySQL returns the first UNIQUE index that has no NULL columns as the PRIMARY KEY.
So, technically, the answer is no. However, as others have stated, in most cases it is quite useful.
I firmly believe every table should have a way to uniquely identify a record. For 99% of the tables, this is a primary key. For the rest you may get away with a unique index (I'm thinking one column look up type tables here). Any time I have a had to work with a table without a way to uniquely identify records, there has been trouble.
I also believe if you are using surrogate keys as your PK, you should, where at all possible, have a separate unique index on whatever combination of fields make up the natural key. I realize there are all too many times when you don't have a true natural key (names are not unique or what makes something unique might be spread across several parentchild tables), but if you do have one, please please please make sure it has a unique index or is created as the PK.
If there is no PK, how will you update or delete a single row ? It would be impossible ! To be honest I have used a few times tables without PK, for instance to store activity logs, but even in this case it is advisable to have one because the timestamps could not be granular enough. Temporary tables is another example. But according to relational theory the PK is mandatory.
it is good to have keys and relationships . Helps a lot. however if your app is good enough to handle the relationships then you could possibly skip the keys ( although i recommend that you have them )
Since I use Subsonic, I always create a primary key for all of my tables. Many DB Abstraction libraries require a primary key to work.
Note: that doesn't answer the "Grand Unified Theory" tone of your question, but I'm just saying that in practice, sometimes you MUST make a primary key for every table.
If it's a join table then I wouldn't say that you need a primary key. Suppose, for example, that you have tables PERSONS, SICKPEOPLE, and ILLNESSES. The ILLNESSES table has things like flu, cold, etc., each with a primary key. PERSONS has the usual stuff about people, each also with a primary key. The SICKPEOPLE table only has people in it who are sick, and it has two columns, PERSONID and ILLNESSID, foreign keys back to their respective tables, and no primary key. The PERSONS and ILLNESSES tables contain entities and entities get primary keys. The entries in the SICKPEOPLE table aren't entities and don't get primary keys.
Databases don't have keys, per se, but their constituent tables might. I assume you mean that, but just in case...
Anyway, tables with a large number of rows should absolutely have primary keys; tables with only a few rows don't need them, necessarily, though they don't hurt. It depends upon the usage and the size of the table. Purists will put primary keys in every table. This is not wrong; and neither is omitting PKs in small tables.
Edited to add a link to my blog entry on this question, in which I discuss a case in which database administration staff did not consider it necessary to include a primary key in a particular table. I think this illustrates my point adequately.
Cyberherbalist's Blog Post on Primary Keys

Resources