I haven't touched database since graduated from school, so please forgive me if my question is too entry level.
As i remember how to draw ERD with UML, recently, my boss asked me to create a database for inventory system with frontend. I googled some similar systems and found that in backend their databases don't have any relationships between tables ( i did DB reversed UML ).
So I thought about it, it seems application works fine even without relationships ( no foreign keys ), so what's the point we have reasons we still need relationship between tables?
This is one of the areas where there is often a noticeable disparity between the theory that is taught in CS courses and the reality of what happens in practice.
Often what you'll run into is a mash-up between the two: an ERD model that shows all the proper relationships and keys, and the "reality" of what actually gets implemented in the database.
The implementation side is probably the part that catches people by surprise, as you have seen: no relationships defined, and foreign keys are simply implied by the matching column names across different tables. This is a tradeoff.
One one hand, managing foreign keys in a database has overhead. Every time a row is added or modified, the database will need to examine those foreign keys and make sure that the change will preserve the relational integrity. After all, that's what you are asking for when you define those relationships, right? And in an ideal world where that overhead is negligible, this is probably a good thing, because as DBA's we like it when our physical implementation matches the idealized model we spent all that time creating. We sleep better knowing that every entry in the customer table references a valid location in the company_location table.
On the other hand, there is reality. That overhead is not something we can easily ignore. Not when that nightly batch load is 4 hours late, and some marketing manager is asking you every 10 minutes for an estimate on when his data will be available. So we cut some corners and make some compromises. And hey, we're pretty good programmers, right? Certainly we can code the application in a way that will always maintain the referential integrity of the database without having to spend all that extra time to deal with foreign keys in the database....well, maybe.... the truth is that it is really hard to be sure that RI will always be preserved by an application that is already implementing some potentially complex business logic.
There are, of course, many other reasons for using explicit RI, and plenty of good reasons for ignoring it in the physical implementation. You are right, at the end of the day applications often do work OK without relationships being defined. And at the end of the day, I will probably get home safe even if I don't put on my seatbelt for the drive. But having the relationships implemented in the database is a pretty solid insurance policy when it comes to guaranteeing the integrity our a data. Analysts that use that database to generate business insights like consistent data. And transactional applications might depend on the assumption that the data is relationally consistent.
Î guess my point is that there is no "always right" answer here, and it really is a case-by-base thing. I would just suggest to start from the assumption that you'll physically implement the model, complete with RI, as faithfully as possible. Then, if you find hot spots, carefully and conservatively relax those constraints as needed.
Foreign key's take care of referential integrity.
To explain this a bit more: By adding a foreign key you are saying "what is in this column must be in the column I am pointing to as well". This makes sure your naming stay's consistent.
If you did not do this mistakes could be introduced when adding redundant information. Like calling "James", "Jamed" by mistake.
Related
I trying to model horse racing in a database and have been presented with two alternative ERDs. I'm no, expert on database design and was hoping someone could lay out for me the advantages/disadvantages of Alternative A vs Alternative B. Is it a case of one being preferable to another absolutely or is A preferable under certain circumstances (what are those circumstances) an B better in other circumstances.
From my very inexperienced perspective, it appears that it would be easier to query Alternative B.
versus
debater brings up some good points that you should certainly consider.
Based on your reply to my question, I would suggest the following structure (which you'll see is quite similar to "B"). I pulled the fields for the Horse and Jockey tables pretty much out of thin air - if they don't fit your needs, obviously you'd ignore them and store whatever you need.
Horse
ID primary key (in each table)
Name
Age
Gender
etc.
Jockey
ID
Name
Height
Weight
etc.
Race
ID
Post_Time
Distance
etc.
Starter
ID
Race_ID foreign key to Race
Horse_ID foreign key to Horse
Jockey_ID foreign key to Jockey
Gate
Finish_Position
etc.
With this structure, a race is an event that happens at a specific time and that involves specific horses and riders. Does that make sense?
Industrial databases tend to have a lot of historical (logging, auditing, trail, etc) tables in them. Typically there are more of these kinds of table than anything else.
Real-world applications tend to need this information. Often there is a requirement for auditing (who did what, when, how and why, etc.). Often there is a requirement for an 'undo' function, which needs the historical data to disentangle things. Sometimes there is a requirement for analysis of past performance and so on.
I've noticed that commercial databases often have no deletion of data. Records just change status (e.g. by having a 'Historical' flag set). I've also noticed that commercial and industrial databases tend to have huge numbers of tables, many of which are no longer used or needed (but nobody had the time to weed out).
Perhaps worth mentioning is that the sooner you start collecting historical data, the more you will have when you come to have a need for it.
And I suppose that is why multi-terabyte disks are selling like hotcakes. 😀
I suggest you give thought early to how your old data is to be archived away (or deleted), before you get to a crisis.
Another point to mention is that data protection laws may require you to delete some data in a timely (and thorough) manner.
A friend of mine is creating an application after she designed the database. However, she just put everything in two 1st NF tables, which have a few functional dependencies on non-PK columns. Indeed, she added the PKs after I suggested her. The 3rd+ NF design of her system may require 7 to 15 tables.
I pointed out that there will be a couple of possible update abnormal. However, she said she has stored procedures for insertion and updates so the update abnormal will never happen since she will force the users/applications insert/update the data via the stored procedures.
Is there any other good reason to persuade her to design the system with higher normal form? Or is her solution good enough practically?
She is probably unskilled and unaware of it. It has already been demonstrated at large that there is no way to help such people.
She probably thinks that she can get all that code in those SP's right, without making a single mistake. That is more than just likely a ludicrous overestimation of her own abilities.
I advise you to let her simply make her mistakes. For YOU, it will save the time of trying and the frustration of not succeeding in getting the message through, and for HER, it will give her the opportunity to learn from her own mistakes (only option left since she's apparently not really willing to learn from other peoples' mistakes), and it will save her the frustration of having to undergo your later coming 'round with that glorious "I told you so.".
But if you really insist on giving it a try, then you might point out to her that :
her 1NF tables are actually views (materialized views at that) on the 3/BC NF tables that she is willingly not implementing.
the updates that her end-user probably expects to be doing, are most likely, however, updates to those 7-15 3/BC NF tables. (I say this because you indicated as her sole reason for the lower NF that she wants to avoid such a "huge" number of tables - the cynicism in "huge" is intentional.)
but the updates that the DBMS is expecting, are updates to the 1NF tables (which are views).
Therefore, her problem boils down to "distilling the appropriate updates to 3NF tables from updates that are specified as updates to views on those tables".
Therefore, getting all her SP code to be correct, is a problem of how to do view updating.
And guess what, after more than 40 years of research on the relational model, by thousands of researchers whose intellectual abilities exceed hers probably by orders of magnitude, it is exactly this problem of view updating that still stands unsolved ... (although granted, her problem is not "view updating in general", but "view updating in some given specific situation". Nonetheless, I very much doubt that she has the skill to spot all the intricacies involved.)
I'm looking at the pros and cons of these three primary methods of coming up with primary keys for database rows.
So assuming I am using a database that supports more than one of these methods, is there a simple heuristic to determine what the best option would be for me?
How do considerations such a distributed/multiple masters, performance requirements, ORM use, security and testing have on the choice?
Any unexpected drawbacks that one might run into?
UUIDs
Unless these are generated "in increasing monotonic sequence" they can drastically hurt/fragment indexes. Support for UUID generation varies by system. While usable, I would not use a UUID as my primary clustered index/PK in most cases. If needed I would likely make it a secondary column, perhaps indexed, perhaps not.
Some people argue that UUIDs can be used to safely generate/merge records from an arbitrary number of systems. While a UUID (depending upon method) generally has an astronomically small chance of collision, it is possible to -- at least with some outside input or very bad luck :) -- generate collisions. I am of the belief that only a true PK should be transmitted between systems, which I would argue is not (or should not be) a database-generated UUID in most cases.
autoincrement/sequence keys and sequence tables
This really depends on what the database supports well. Some databases support sequences which are more flexible that a simple "auto-increment". This may or may not be desirable (or may be the only way for this kind of task simply, even). Sequence tables are generally more flexible yet, but if this kind of "flexibility" is needed I would be tempted to go back and visit the design-pattern, especially if it involves the use of triggers. While I dislike "limiting ORMs", that may also make a difference in choosing the "simpler" auto-increment or sequence types/database support.
Regardless of the method used, when using surrogate primary keys, the true primary key should still be identified and encoded into the schema.
In addition, I argue that "security compromises through exposing an auto-sequence PK" are a result of incorrectly exposing an internal database property. While a very simple way to handle CRUD operation, I believe there is a distinction between the internal keys and the exposed keys (e.g. pretty customer number).
Just my two cents.
Edit, additional replies to Tim:
I think the generated vs. true PK question is a very good one and one I need to consider also. I'd like UUIDs in general to the points you make. My hesitation was in size vs. an int/long. Was not aware of potential indexing de-optimizations, which is a much bigger concern for me.
I wouldn't really worry about the size -- if a UUID is best, then it's best. If it's not, then it's not. In the overall scheme the extra 12bytes over an int likely won't make much of a difference. SQL Server 2005+ supports the newsequentialid UUID generation function to avoid the fragmentation associated with normal UUID generation. The page discusses it some. I am sure that other databases have similar solutions.
And by "encoded into the schema", do you mean more than adding a uniqueness constraint?
Yes. The primary key doesn't have to be the only [unique] constraint. Just using a surrogate PK doesn't mean the database model should be compromised :-) Additional indexes can also be used to cover, etc.
And by "distinction between", are you saying that surrogate primary keys never leak out?
The wording in my initial post was a tad hard. It's not "never" so much as "if they do and it matters then that's another problem". Often times people complain of insecurity through guessable numbers -- e.g. if your order is 23 then there is likely an order 22 and 24, etc. If this is your "protection" and/or can leak sensitive information then the system is already flawed. (Separating internal and external ids does not inherently fix this issue and authentication/authorization is still required. However, it is one issue raised against using "sequential ids" -- I find encoding a nonce into distributed URLs handles this for my use-case rather well.)
More to what I really wanted to get across: Just because the surrogate PK id happens to be 8942 doesn't mean that it's order 8942. That is, keeping with the "some fields are internal only to db" design, the order "number" might be entirely unrelated on the surface (but fully supported in the DB model), such as "#2010-42c" or whatever makes sense for the business requirement(s). It is this external number that should be exposed in most cases.
I feel that sometimes the generated key is really the true primary key as other fields are mutable (eg. user may change email and username).
This may be the case within a database and I will not argue this statement. However, once again holding that the surrogate PK's are internal to the database, just make sure to only export/import tuples that can be well-identified. If the username/email may change, then this might very well include a UUID assigned upon account creation -- and could very well be the surrogate PK itself.
Of course, as with everything, remain open and fit the model to the problem, not the problem to the model :-) For a service like twitter, for instance, they use their own number generation schema. See Twitter's new ID generation. Unlike [some] UUID generation, the approach by twitter (assuming that all the servers are correctly setup) guarantees that none of the distributed machines/processes will ever generate a duplicate ID, requires only 64-bits, and maintains rough ordering (most significant bits are time-stamp). (The number of records generated by twitter may be in no way related to local requirements ;-)
Happy coding.
Say I had three tables: Accommodation, Train Stations and Airports. Would I have address columns in each table or an address table that is referenced by the other tables? Is there such a thing as over-normalization?
Database Normalization is all about constructing relations (tables) that maintain certain functional
dependencies among the facts (columns) within the relation (table) and among the various relations (tables)
making up the schema (database). Bit of a mouth-full, but that is what it is all about.
A Simple Guide to Five Normal Forms in Relational Database Theory
is the classic reference for normal forms. This paper defines in simple terms what the essence of each normal form is
and its significance with respect to database table design. This is a very good "touch-stone" reference.
To answer your specific question properly requires additional information. Some critical questions you have to ask
are:
Is an Address a simple fact (e.g. blob of text) or a composite fact (e.g.
composed of multiple attributes: Address line, City Name, Postal Code etc.)
What are the other "facts" relating to "Accommodation",
"Airport" and "Train Station"?
What sets of "facts" uniquely and minimally identify an "Airport", an "Accommodation"
and a "Train Station" (these facts are typically called a key or candidate key)?
What functional dependencies exist among Address facts and the facts
composing each relations key?
All this to say, the answer to your question is not as straight forward as one might hope for!
Is there such a thing as "over normalization"? Maybe. This depends on whether the
functional dependencies you have identified and used to build your tables are
of significance to your application domain.
For example, suppose it was determined that an address
was composed of multiple attributes; one of which is postal code. Technically a postal
code is a composite item too (at least Canadian Postal Codes are). Further normalizing your
database to recognize these facts would probably be an over-normalization. This is because
the components of a postal code are irrelevant to your application and therefore factoring
them into the database design would be an over-normalization.
For addresses, I would almost always create a separate address table. Not only for normalization but also for consistency in fields stored.
As for such a thing as over-normalization, absolutely there is! It's hard to give you guidance on what is and isn't over-normalization as I think it mostly comes from experience. However, follow the books on each level of normalization and then once it starts to get difficult to see where things are you've probably gone too far.
Look at all the sample/example databases you can as well. They will give you a good indication on when you should be splitting out data and when you shouldn't.
Also, be well aware of the type and amount of data you're storing, along with the speed of access, etc. A lot of modern web software is going fully de-normalized for many performance and scalability reason. It's worth looking into those for reason why and when you should and shouldn't de-normalize.
Would I have address columns in each table or an address table that is referenced by the other tables?
Can airports, train stations and accommodation each have a different address format?
A single ADDRESS table minimizes the work necessary dealing with addresses - suite, RR, postal/zip code, state/province...
Is there such a thing as over-normalization?
There are different levels of normalization. I've only encountered what I'd consider poor design rather than normalization.
Personally I'd go for another table.
I think it makes the design cleaner, makes reporting on addresses much simpler and will make any changes you need to make to the address schema easier.
If you need to have it denormalized later on you can always create two views that contain the Train station and airport information along with any address information you need.
This isn't really what I understand by normalisation. You don't seem to be talking about removing redundancy, just how to partition the storage or data model. I'm assuming that the example of addresses for Accommodation, Train Stations and Airports will all be disjoint?
As far as I know it would only be normalisation if you started thinking along the lines. Postcode is functionally dependent upon street address so should be factored out into its own table.
In which case this could be ever desirable or undesirable dependent upon context. Perhaps desirable if you administer the records and can ensure correctness, and less desirable if users can update their own records.
A related question is Is normalizing a person’s name going too far?
If you have a project/piece of functionality that is very performance sensitive, it may be smart to denormalize the database in some cases. However, this can lead to maintenance issues for various reasons. You may instead want to duplicate the data with cache tables but there are drawbacks to this as well. It's really a case by case basis but in normal practice, database normalization is a good thing. 99% of the non-normalized databases I've seen are not by design, but rather by a misunderstanding/mistake by the developer.
Would I have address columns in each table or an address table that is referenced by the other tables?
As others have alluded to, this is not really a question of normalization because you're not attempting to reduce redundancy or organize dependencies. Either way is perfectly acceptable. Moving the addresses to a separate table might make sense if you are going to have centralized validation or business logic specific to addresses.
Is there such a thing as over-normalization?
Yes. As has been mentioned, in large systems (lots of data, lots of transactions, or both) you can normalize to the point where performance becomes an issue. This is why lots of systems use denormalized database for reporting and querying.
In addition to performance though, there is also the issue of how easy the data is to query. In systems where there will be a lot of end-user querying of the data (can be dangerous!), a denormalized structure is easier for most non-technical or non-database people to understand.
Like most things we deal with, it's a trade-off between understanding, performance, and future maintainability and there is rarely a clear-cut answer to where you draw the line in any given system.
With experience, you will learn where the line is best drawn for the systems you write.
With that said, my preference is to err on the side of more vs less normalization.
If you are using Oracle 9i, you could store address objects in your tables. That would remove the (justified) concerns about address formats.
I agree with S.Lott, and would like to add:
A good answer depends on what you know already. The basic "math" of relational database theory, however, defines very well-defined, distinct levels of normalization. You cannot normalize anymore when you've reached the ultimate normal form.
Depending on what you want to model with your three entities, and how you identify them, you can come up with very different conceptual data models, all of which can be represented in a mix of normal forms -- or unnormalized at all (like 1 table for all data with descriptors and NULL holes all over the place...).
Consider you normalize your three entities to the ultimate normal form. I can now introduce a new requirement, or use case, or extension, which gives an upto-now descriptive attribute a somehow ordered, or referencing, or structured nature if you look at its content. Then, the model should represent this behavior, and what used to be an attribute perhaps will better be a separate entity referenced by other entities.
Over-normalization? Only in the sense that can you normalize a given model so it gets inefficient to store, or process, on a given DB platform. Depending on what can be handled efficiently there, you might want to de-normalize certain aspects, trading off redundancy for speed (data warehouse dbs do this all the time), and insight, or vice versa.
All (working) db designs I've seen so far either have a rather normalized conceptual data model, with quite some denormalization done at the logical and/or physical data model level (speaking in Sybase PowerDesigner terms) to make the model "manageable" -- either that, or they were not working, i.e. failed because the maintenance problems became kingsize real quick.
When you say "address", I presume you mean a complete address, like street, city, state/province, maybe country, and zip/postal code. That's 4 or 5 fields, maybe more if you allow for "address line 1" and "address line 2", care-of's, etc. That should definately be in a separate table, with an "addressid" to link to the Station, etc tables. Otherwise, you are creating 3 separate copies of the same set of field definitions. That's bad news because it creates extra effort to keep them consistent. Like, what if initially you are only dealing with U.S. addresses (I'm an American so I'll assume U.S.), but later you find you also need to allow for Canadians. You'll need to expand the size of the postal code field and add a country code. If there's a common table, then you only have to do this once. If there isn't, then you have to do this three times. And it's likely that the "three times" is not just changing the database schema, but changing every place in your programs that processes an address.
One of the benefits of normalization is to minimize the impact of changes.
There are times when you want to denormalize to make queries more efficient. But this should be done very cautiously, only after you have good reason to believe that the fully normalized model creates serious inefficiency problems. In my humble experience, most programmers are far to quick to denormalize, usually with a quick "oh, breaking that out into a separate table is too much trouble".
I think in this situation it is OK to have address columns in each table. You'll hardly have an address which will be used more than two times. Most of the adresses will be used just one per entity.
But what could be in an extra table are names of streets, cities, countries...
And most important every train station, accomodoation and airport will probably have just one address so it's an n:1 relation.
I can only add one more constructive note to the answers already posted here. However you choose to normalize your database, that very process becomes almost trivial when the addresses are standardized (look the same). This is because as you endeavor to prevent duplicates, all the addresses that are actually the same do look the same.
Now, standardizing addresses is not trivial. There are CASS services which do this for you (for US addresses) which have been certified by the USPS. I actually work for SmartyStreets where this is our expertise, so I'd suggest you start your search there. You can either perform batch processing or use the API to standardize the addresses as you receive them.
Without something like this, your database may be normalized, but duplicate address data (whether correct or incomplete and invalid, etc) will still seep in because of the many, many forms they can take. If you have any further questions about this, I'll personally assisty you.
2 tables:
- views
- downloads
Identical structure:
item_id, user_id, time
Should I be worried?
I don't think that there is a problem, per se.
When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence.
Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often.
This alone could be a good reason not to burden it with records from the downloads.
Also, the fact that they are indentical now does not mean they will be in the future: views and downloads are different, after all, so sooner or later one or both could grow an extra field or two.
These tables are the same NOW but may schema change in the future. If they represent 2 different concepts it is good to keep them separate. What if you wanted to have a foreign key from another table to the downloads table but not the views table, if they were that same table you could not do this.
I think the answer has to be "it depends". As someone else pointed out, if the schema of one or both tables is likely to evolve then no. I can think of other cases well (simplifying the security model by allow apps/users access to one or the other).
Having said this, I work with a legacy DB where this is a problem. We have multiple identical tables for customer invoices. Data is actually moved between then at different stages in the processing life-cycle. It makes for a complicated mess when trying to access data. It would have been easily solved by a state flag in the original schema, but we now have 20+ years of code written against the multi-table version.
Short answer: depends on why they are the same schema :).
From a E/R modelling point of view I don't see a problem with that, as long as they represent two semantically different entities.
From an implementation point of view, it really depends on how you plan to query that data:
If you plan to query those tables independently from each other, keeping them separate is a good choice
If you plan to query those tables together (maybe with a UNION of a JOIN operation) you should consider storing them in a single table with a discriminator column to distinguish their type
When considering whether to consolidate them into a single table you should also take into account other factors like:
The amount of data stored in each table
The rate at which data grows in each table
The ratio of read/write operations executed on each table
Chris Date and Dave McGoveran formalised the "Principle of Orthogonal Design". Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result.
Arguably it isn't always totally practical to do that and it isn't necessarily clear cut exactly when the principle is being broken. However, I do think it's a good guiding rule, if only because it avoids the problem of duplicate logic in data access code or constraints, i.e. it's a good DRY principle. Avoid having tables with potentially overlapping meanings unless there is some database constraint that prevents duplication between them.
It depends on the context - what is a View and what is a Download? Does a Download imply a View (how else would it be downloaded)?
It's possible that you have well-defined, separate concepts there - but it is a smell I'd want to investigate further. It seems likely that a View and a Download are related somehow, but your model doesn't show anything.
Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding.