What are the adverse effects of having too many lookup tables in the database?
I have to incorportate too many Enumerations, based on the applications.
What would experts advice?
Initially you have to ask yourself "how many is too many?". If there is a logical relation between two tables, there has to be a FK.
If you don't need the related tables anywhere within the database, you could consider to remove them and use a CHECK constraint with an "IN" clause to enforce data validity. Though, this would cause an alteration of the table with each new value within the enumeration.
My personal advice is to keep the FKs and the tables. It's a clear solution and the database is way better to maintain if there is a describing text available for all those numbers.
Let me tell how awful it is to have too few lookup tables. THe orginal designers at one place I worked decided to put all lookups into one table and define what the lookups were for using a typeid. This caused almost all queries to hit this table to get the lookup descriptive value causing a performance jam.
Further, without separate lookups, the fields that took the typeid were not constrained by the values appropriate to that field because a foreign key can only be on the the whole table not a chunk. So the filed that stored the clientid might accidentally contain the value for a user group. This caused data integrity problems and made reporting much more difficult as we had to intepret values that didn't make sense in context. There is no prize for using too few tables, in fact it is often an anti-pattern in database design.
Create 1000 lookup tables if that is what you need.
As Florian, I like a lot more to have tons of Foreign Keys then to have CHECK IN (..) - for a simple reason: you can insert other records on your tables.
Maintaning CHECK IN () is a much bigger problem. Imagine this scenario:
CREATE TABLE street
(
id serial not null,
st_type varchar(20) not null,
st_name varchar(100) not null,
constraint street_pk primary key (id)
constraint street_type_check check st_type in ('STREET','AVENUE','SQUARE')
);
You have 1000 rows with those types checked, correct? If you need to add another one, you will need to drop the constraint and recreate it.
IF you take a item off that list, like SQUARE, what will happen to the rows already commited (and checked at moment of insertion) that have that type? They will still keep an invalid type.
Tables and Foreign Keys are easier to maintain and keep track of.
The Whole point of lookup data is that there is a finite list of valid identifiers for a specific field. if those specific fields are used in procedures or where statements to determine the correct process path or the limit the select list, then there is no such thing as too many lookups.
if it is not a finite list of identifiers for a specific process or where clause then they should not be a lookup value.
two types of fields that come to mind which might be considered lookup values but don't necessarily need to be.
City and Province/state:
There is a finite list of these but because there are sooo many you might not want to make a lookup for these.
Related
I have the luxury of designing a database from scratch. When designing columns to act as unique keys should I just use unique integers or should I attempt to make the values interpretable. So if I had a lookup table of ward names in a hospital should the id column contain unique codes that in someway relate to the name of the ward or just unique integers?
Resist the temptation to overload the id values with meaning. Use other attributes to store the info you're considering stuffing into the id.
Overloading the id with "meaning" is bad because:
If the data being stuffed into the ID changes, so must your ID. ID's should never change
If the data type of the data changes, you'll have a problem, for example:
If your ID is numeric, and the stuffed info changes from numeric to text, you'll have big problems
If the stuffed data changes from a simple field to a one-to-many child, your model will break
What you believe has "important" meaning now may not be important in the future. Then your "specially encoded" data will become useless and a burden, even a serious restriction
What currently "identifies" a product may change as the business evolves
If have seen this idea attempted many times, never successfully. In every case, the idea was scraped and surrogate IDs were introduced to replace the magic IDs, with all the risk and development cost associated with that task.
In my career, have seen most of the problems listed above actually happen.
You should not be using a lookup table. Make your tables innodb and use referential integrity to join tables together. Your id columns should always be set as primary and should be set to auto increment. Never try to make up your own ids. You should really look at some tutorial on referential integrity and learn how to assoicate tables with other tables.
I have a question about using null vs. default value for foreign key columns in database. I found a lot of opposite opinions about null vs. default values when designing databases but not exactly for foreign keys (what are main pros and cons).
Currently I'm designing a new database which will store a lot of data for different web applications and other systems with different data access approaches (ORM, stored procedures) and I want to implement general rules on the lowest level as possible (database). (So not to worry about this rules later in applications).
To give you an example let's say that I have a table of users User with foreign key column for his nationality NationalityID which is a primary key CountryID for table Country.
Now I have two/three options:
A: I allow NationalityID column (and all other similar foreign key columns in database) to be null and just stick with common approach of checking always and everywhere for null (applying rules in application)
or
B: I assign a default value for every foreign key to be let's say "-1" and put in every relation table additional column with "-1" as a key and all other data as "No data" (for this example in Country table I put column with CountryID of "-1" and for CountryName I set "No data"). So every time I will want to know users nationality I will always get result without additional code rules (no need for me to check if it's null or not).
or
C: I can disallow null value for foreign keys. But this is really something what I want to avoid. (I need to have an option to store at least basic data (users name) if not the additional data (users nationality))
So is B good approach or not? What am I missing here? Do I lose more that I gain with this approach? Which problems could I have (in addition to be careful to always have additional column in relational tables with ID value of "-1" which says there is "No data")?
What is your good/bad experience with foreign key default values?
thank you
If you normalize this won't be an issue.
Instead of putting nationality in the USER table, make a User_Nationality table that links users to Country_ID in the other table.
If they have an entry in that lookup table, great. If not, you don't need to store a NULL or default value for it.
You need to enforce FK relationships, and allowing NULL goes against that. You also don't want to make up information that may not be accurate just to populate a field, which negates the point of requiring the field in the first place.
Use lookup tables and you can bypass that entirely.
This will also allow you to change your mind and choose one of your options down the road.
If you use views, you can choose to treat missing data as a NULL or a default value without needing to alter the underlying data.
Personally, I would feel that even if you have a non-entry entry in your database with a key of -1, you would still be performing a check to see whether you want to display 'No Data' or not for each individual field.
I would stick to NULLs. NULL is meant to mean the absence of data, which is the case here.
B is a terrible approach. It is easier to remeber to handle nulls than to have to figure out what magic number you used and then you still have to handle them. Use number 1. But I like JNKs idea best.
I suggest option D. If not all users have a defined nationality then that information doesn't belong in the user table. Create a table called UserNationality keyed on UserId.
I like your B solution. Maybe it will be possible to map the values into other entities, so you have Country, and NullCountry that extends Country and is mapped to row with id=-1 and have special code in its methods to make it easy to handle special cases.
One problem is probably that it will be harder to do outer joins on that foreign key.
EDIT: no, there should be no problem with outer joins, because there would be no need to do outer joins.
I have a database storing customer enquiries about products.
The enquiry reference (text), product number (int) and revision number (int) together uniquely identifies a single discussion between sales and customer.
As a result, there are many tables each for a specific detail about a single enquiry, uqniuely idenified by enq, pdt and rev values combined.
The CREATE TABLE does not use any AUTO INCREMENT UNIQUE PRIMARY KEY for any field.
My question is, is this database design acceptable?
Should tables always be normalized?
Thanks for advise.
There's no need to use AUTOINCREMENT, but every table should have a PRIMARY KEY of some kind. A primary key can be a combination of several fields that together identify the record uniquely.
Based on what you've told us, yes, the design is acceptable, provided you explicitly declare the combination of the enquiry reference (text), product number (int) and revision number (int) as a primary key that together uniquely identifies a single discussion.
People sometimes denormalize a database for performance reasons. If select queries are far more frequent than inserts and updates, and the select query of interest is slow to return because of the number of tables it has to join, then consider denormalizing.
If you supply a specific query that is running slow for you, you'll get lots of specific advice.
Having a PRIMARY KEY (or a UNIQUE constraint) will, first, ensure that these values are really unique, and, second, will greatly improve the searches for a given enquiry.
A PRIMARY KEY implies creating an index over (enq, pdt, rev), and this query:
SELECT *
FROM enquiries
WHERE enq = 'enquiry'
AND pdt = 'product'
AND rev = 'revision'
will complete in a single index seek.
Without the index, this query will require scanning the whole table, and there is no guarantee that you won't end up with the duplicates.
Unless for very, very, very special conditions (like heavily inserted log tables), you should always have a PRIMARY KEY on your tables.
Personally, I ALWAYS always have some sort of primary key on all tables, even if it is an auto-incrment number used for nothing else
As to normalization, I think one should strive for normalized tables, but in reality there are many good reasons when a table design is good, but not normalized. This is where the 'theory' of DB design meets the reality - but it is good to know what normalization is, strive for it, and have good reasons when you are deviating from the rules (as opposed to just being ignorant of the rules or worse ignoring good design rules).
These are two questions.
(1) It is not required to have an auto increment key always. It is practical though, since you can use it for easy manipulation of your data. Also having no duplicates is not a must.
(2) Normalization is a must when you do homework for school, but if things get tough you can break it in order to make your life easier if you do not endanger your data integrity.
I am splitting from the herd on this one. Do NOT make your enquiry reference (text), product number (int) and revision number (int) the primary key. You indicated the enquiry reference was a text type and did you mean it would be 25 or 50 or 500 characters wide? If the primary key is made from those fields it will be too wide in my view as it will be appended to every index created for that table increasing the size of every index row by the size of the three fields and any table which needs to use a foreign key back to this table will also need the three fields.
Make the three fields a unique index. Place an auto-increment value as the primary key and make it the clustered index. The tables which will link back to this master table will have a small footprint in memory to link the data from table one to table two.
As far as normalized goes it does not matter, normalized or not, if your data is only a few thousand rows, or even 50,000 or 500,000. When the data starts getting bigger than the available RAM cache then it is an issue.
Design a view to present the data to the application to fulfill the business rule. Design stored procedures to accept data to store. Design the table stucture to meet the response time in the SLA. If you have to normalize or denormalize or patrtition or index or get a bigger server to meet the SLA the app will never know because you are always supplying the data via the view which meets the business rule.
There is nothing in normalization theory that deals with whether a table should have a simple or compound primary key. Believe it or not, the concept of "primary key" is not a component of the relational model of data.
Having said that, tables should nearly always be defined with a primary key. The primary key need not be a single column, and it need not be filled in by an autoincrement. In your case, it could be the three columns that taken together uniquely identify an enquiry.
If a table has no declared primary key, it could end up with duplicate rows. A table with duplicate rows represents a bag of tuples, not a set of tuples. Once you are dealing with bags instead of sets, the results predicted by the relational model need not apply. That is why preventing duplicate rows is so important.
I have a routine that will be creating individual tables (Sql Server 2008) to store the results of reports generated by my application (Asp.net 3.5). Each report will need its own table, as the columns for the table would vary based on the report settings. A table will contain somewhere between 10-5,000 rows, rarely more than 10,000.
The following usage rules will apply:
Once stored, the data will never be updated.
Whenever results for the table are accessed, all data will be retrieved.
No other table will need to perform a join with this table.
Knowing this, is there any reason to create a PK index column on the table? Will doing so aid the performance of retrieving the data in any way, and if it would, would this outweigh the extra load of updating the index when inserting data (I know that 10K records is a relatively small amount, but this solution needs to be able to scale).
Update: Here are some more details on the data being processed, which goes into the current design decision of one table per report:
Tables will record a set of numeric values (set at runtime based on the report settings) that correspond to a different set of reference varchar values (also set at runtime based on the report settings).
Whenever data is retrieved, it some post-processing on the server will be required before the output can be displayed to the user (thus I will always be retrieving all values).
I would also be suspicious of someone claiming that they had to create a new table for each time the report was run. However, given that different columns (both in number, name and datatype) could conceivably be needed for every time the report was run, I don't see a great alternative.
The only other thing I can think of is to have an ID column (identifying the ReportVersionID, corresponding to another table), ReferenceValues column (varchar field, containing all Reference values, in a specified order, separated by some delimiter) and NumericValues column (same as ReferenceValues, but for the numbers), and then when I retrieve the results, put everything into specialized objects in the system, separating the values based on the defined delimiter). Does this seem preferable?
Primary keys are not a MUST for any and all data tables. True, they are usually quite useful and to abandon them is unwise. However, in addition to a primary missions of speed (which I agree would doubtfully be positively affected) is also that of uniqueness. To that end, and valuing the consideration you've already obviously taken, I would suggest that the only need for a primary key would be to govern the expected uniqueness of the table.
Update:
You mentioned in a comment that if you did a PK that it would include an Identity column that presently does not exist and is not needed. In this case, I would advise against the PK altogether. As #RedFilter pointed out, surrogate keys never add any value.
I would keep it simple, just store the report results converted to json or xml, in a VARCHAR(MAX) column
One of the most useful and least emphasized (explicitly) benefits of data integrity (primary keys and foreign key references to start with) is that it forces a 'design by contract' between your data and your application(s); which stops quite a lot of types of bugs from doing any damage to your data. This is such a huge win and a thing that is implicitly taken for granted (it is not 'the database' that protects it, but the integrity rules you specify; forsaking the rules you expose your data to various levels of degradation).
This seems unimportant to you (from the fact that you did not even discuss what would be a possible primary key) and your data seems quite unrelated to other parts of the system (from the fact that you will not do joins to any other tables); but still - if all things are equal I would model the data properly and then if primary keys (or other data integrity rules) are not used and if chasing every last bit of performance I would consider dropping them in production (and test for any actual gains).
As for comments that creating tables is a performance hit - that is true, but you did not tell us how temporary are these tables? Once created will they be heavily used before scrapped? Or do you plan to create tables for just dozen of read operations.
In case you will heavily use these tables and if you will provide clean mechanism for managing them (removing them when not used, selecting them, etc...) I think that dynamically creating the tables would be perfectly fine (you could have shared more details on the tables themselves; use case would be nice)
Notes on other solutions:
EAV model
is horrible unless very specific conditions are met (for example: flexibility is paramount and automating DDL is too much of a hassle). Keep away from it (or be very, very good at anticipating what kinds of queries will you have to deal with and rigorous in validating data on the front end).
XML/BLOB approach
might be the right thing for you if you will consume the data as XML/BLOBs at presentation layer (always read all of the rows, always write the whole 'object' and finally, if your presentation layer likes XML/BLOBS)
EDIT:
Also, depending on the usage patterns, having primary key can indeed increase the speed of retrieval, and if I can read the fact that the data will not be updated as 'it will be written once and read many times' then there is a good chance that it will indeed overweight the cost of updating the index on inserts.
will it 1 table for every run of a given report, or one table to all runs of a given report? in other words, if you have Report #1 and you run it 5 times, over a different range of data, will you produce 5 tables, or will all 5 runs of the report be stored in the same table?
If you are storing all 5 runs of the report in the same table, then you'll need to filter the data so that it is appropriate to the run in question. in this case, having a primary key will let you do the where statement for the filter, much faster.
if you are creating a new table for every run of the report, then you don't need a primary key. however, you are going to run into to other performance problems as the number of tables in your system grows... assuming you don't have something in place to drop old data / tables.
If you are really not using the tables for anything other than as a chunk of read-only data, you could just as well store all the reports in a single table, as XML values.
What column or columns would the PK index be built on? If just a surrogate identity column, you'll have no performance hit when inserting rows, as they'd be inserted "in order". If it is not a surrogate key, then you have the admittedly minor but still useful assurance that you don't have duplicate entries.
Is the primary key used to control the order in which report rows are to be printed? If not, then how do you ensure proper ordering of the information? (Or is this just a data table that gets summed one way and another whenever a report is generated?)
If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index.
By and large, I find that while not every table requires a primary key, it does not hurt to have one present, and since proper relational database design requires primary keys on all tables, it's good practice to always include them.
I have ten or more(i don't know) tables that have a column named foo with same datatype.
how can i tell sql that values in all the tables should be unique.
I mean If(i have value "1" in table1) I should NOT be able to have value "1" in table2
Have a common ID's table, which these ten tables reference. That will work well in that it will ensure unique ID's, but doesn't mean you couldn't duplicate the ID's in the table if someone really wants to.
What I mean is a common ID's table ensures that you don't have duplicates for insert (by also inserting an ID into this common table), but the thing is the way to guarantee that it never happens is by building the business rules into the system or placing check constraints to cross reference the other tables (which would ensure uniqueness, but degrade performance).
The question is phrased vaguely; if you need to generate a column that's unique among several tables, use row GUIDs or a common ID generator table; if you need to enforce uniqueness (and the field values are already there), use triggers.
Generally, if you generate the values, you don't need to enforce anything. The generation logic, if done right, will take care of that. If you are inserting, say, user input, then you can and should enforce uniqueness during insertion. As a validation rule or something.
You can define the field as a GUID (or a UNIQUEIDENTIFIER in SQL server). Then it will always be unique no matter what.
How about setting a check constraint on each table, such that ID % 10 = N (where N is the table number, from 0-9). And use IDENTITY(N,10) each time.
I would suggest that possibly your design is flawed. Why are these separate tables? It ouwld be better to put them in one table with one id field and another filed to identify whatever is making these spearate tables (cusotmer id for instance). Then you can read about partioning tables if you want them to be split by customer for performance reasons.