At my job, we have pseudo-standard of creating one table to hold the "standard" information for an entity, and a second table, named like 'TableNameDetails', which holds optional data elements. On average, for every row in the main table will have about 8-10 detail rows in it.
My question is: What kind of performance impacts does this have over adding these details as additional nullable columns on the main table?
8-10 detail rows or 8-10 detail columns ?
If its rows, then you're mixing apples and oranges as a one-to-many relationship cannot be flatten out into columns.
If is columns, then you're talking vertical partitioning. For large and very large tables, moving seldom referenced columns into Extra or Details tables (ie partition the columns vertically into 'hot' and 'cold' tables) can have significant and event huge performance benefits. Narrower table means higher density of data per page, in turn means less pages needed for frequent queries, less IO, better cache efficiency, all goodness.
Mileage may vary, depending on the average width of the 'details' columns and how 'seldom' the columns are accessed.
I'm with Remus on all the "depends", but would just add that after choosing this design for a table/entity, you must also have a good process for determining what is "standard" and what is "details" for an entity.
Misplacing something as a detail which should be standard is probably the worst thing. Because you can't require a row to exist as easily as requiring a column to exist (big complex trigger code). Setting a default on a type of row is a lot harder (big complex constraint code). And indexing is also not easy either (sparse index, maybe?).
Misplacing something as a standard which should be a detail is less of a mistake, just taking up extra row space and potentially not being able to have a meaningful default.
If your details are very weakly structured, you could consider using an XML column for the "details" and still be able to query them using XPath/XQuery.
As a general rule, I would not use this pattern for every entity table, but only entity tables which have certain requirements and usage patterns which fit this solution's benefits well.
Is your details table an entity value table? In that case, yes you are asking for performance problems.
What you are describing is an Entity-Attribute-Value design. They have their place in the world, but they should be avoided like the plague unless absolutely necessary. The analogy I always give is that they are like drugs: in small quantities and in select circumstances they can be beneficial. Too much will kill you. Their performance will be awful and will not scale and you will not get any sort of data integrity on the values since they are all stored as strings.
So, the short answer to your question: if you never need to query for specific values nor ever need to make a columnar report of a given entity's attributes nor care about data integrity nor ever do anything other than spit the entire wad of data for an entity out as a list, they're fine. If you need to actually use them however, whatever query you write will not be efficient.
You are mixing two different data models - a domain specific one for the "standard" and a key/value one for the "extended" information.
I dislike key/value tables except when absolutely required. They run counter to the concept of an SQL database and generally represent an attempt to shoehorn object data into a data store that can't conveniently handle it.
If some of the extended information is very often NULL you can split that column off into a separate table. But if you do this to two different columns, put them in separate tables, not the same table.
Related
I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?
The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.
The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.
There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,
I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.
I have a scenario wherein I have users uploading their custom preferences to the conversion being done by our utility. So, the problem we are facing is
should we keep these custom mappings in one table with an extra column of UID or maybe GUID
,
or
should we create tables on the fly for every user that registers with us and provide him/her a separate table for their custom mappings altogether?
We're more inclined to having just an extra column and one table for the custom mappings, but does the redundant data in that one column have any effect on performance? On the other hand, logic dictates us to normalize our tables and have separate tables. I'm rather confused as to what to do.
Common practice is to use a single table with a user id. If there are commonly reused preferences, you can instead refactor these out into a separate table and reference them in one go, but that is up to you.
Why is the creating one table per user a bad idea? Well, I don't think you want one million tables in your database if you end up with one million users. Doing it this way will also guarantee your queries will be unwieldy at best, often slower than they ought to be, and often dynamically generated and EXECed - usually a last resort option and not necessarily safe.
I inherited a system where the original developers had chosen to go with the one table per client approach. We are now rewriting the system! From a dev perspective, you would end up with a tangled mess of dynamic SQL and looping. The DBAs will not be happy because they will wake up to an entirely different database every morning, which makes performance tuning and maintenance impossible.
You could have an additional BLOB or XML column in the users table, storing the preferences. This is called the property bag approach, but I would not recommend this either, as it will hamper query performance in many cases.
The best approach in this scenario is to solve the problem with normalisation. A separate table for the properties, joined to the users table with a PK/FK relationship. I would not recommend using a GUID. This GUiD would likely end-up as the PK, meaning you will have a 16-byte Clustered Index, and that will also be carried through to all non-clustered indexes on the table. You would probably also be building a non-cluster index on this in the Users table. Also, you could run the risk of falling into the pitfall of generating these GUIDs with NEW_ID() rather than NEW_SEQUENTIALID() which will lead to massive fragmentation.
If you take the normalisation approach and you have concerns about returning the results across the two tables in a tabular format, then you can simply use a PIVOT operator.
I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.
Is it normal to have a table with about 40-50 columns in database?
Depends on your data model. It is somehow "neater" to have data broken down into multiple tables and have them related to each other, but it can also be possible your data is such it cannot, or it makes no sense, to be broken down.
If you want to have less columns just "for the sake of it", and there is no significant performance degradations - no need. If you find yourself using less columns than there are in the table, break it down...
Yes, if those 40-50 columns are all dependent on the key, the whole key, and nothing but the key of the table.
It is not uncommon for a database to be de-normalised to improve performance: munging tables together results in fewer joins during queries.
So denormalised tables tend to have more columns, and duplicate data can become an issue, but sometimes that's the only way to get the performance that you need.
I seem to get asked that question at every job interview I go to:
When would you denormalise a database?
Depends on what you call normal. If you are a big enterprise corporation, it's not normal, because you have way too few columns.
But if you find it hard to work with that many columns, you probably have a problem and need to do something about it: either abstract the many columns away or split up your data model to something more manageable.
It doesn't sound very normalised, so you might want to look at this . But it really depends on what you're storing I suppose...
I don't know about "normal", but it should not be causing any problems. If you have many "optional" columns, that are null most of the time, or many fields are very large and not often queried, then maybe the schema could be normalized or tuned a bit more, but the number of columns itself is not an issue.
The number of columns has no relationship to whether the data is normalized or not. It is the content of the columns which will tell you that. Are the columns things like
Phone1, phone2, phone3? Then certainly the table is not normalized and should be broken apart. But if they are all differnt items which are all in a one-to-one raltionship with the key value, then 40-50 columns can be normalized.
This doesn't mean you always want to store them in one table though. If the combined size of those columns is larger than the actual bytes allowed per row of data in the database, you might be better off creating two or more tables in a one-to-one relationship with each other. Otherwise you will have trouble storing the data if all the fields are at or near their max size. And if some of the fields are not needed most of the time, a separate table may also be in order for them.
I have a routine that will be creating individual tables (Sql Server 2008) to store the results of reports generated by my application (Asp.net 3.5). Each report will need its own table, as the columns for the table would vary based on the report settings. A table will contain somewhere between 10-5,000 rows, rarely more than 10,000.
The following usage rules will apply:
Once stored, the data will never be updated.
Whenever results for the table are accessed, all data will be retrieved.
No other table will need to perform a join with this table.
Knowing this, is there any reason to create a PK index column on the table? Will doing so aid the performance of retrieving the data in any way, and if it would, would this outweigh the extra load of updating the index when inserting data (I know that 10K records is a relatively small amount, but this solution needs to be able to scale).
Update: Here are some more details on the data being processed, which goes into the current design decision of one table per report:
Tables will record a set of numeric values (set at runtime based on the report settings) that correspond to a different set of reference varchar values (also set at runtime based on the report settings).
Whenever data is retrieved, it some post-processing on the server will be required before the output can be displayed to the user (thus I will always be retrieving all values).
I would also be suspicious of someone claiming that they had to create a new table for each time the report was run. However, given that different columns (both in number, name and datatype) could conceivably be needed for every time the report was run, I don't see a great alternative.
The only other thing I can think of is to have an ID column (identifying the ReportVersionID, corresponding to another table), ReferenceValues column (varchar field, containing all Reference values, in a specified order, separated by some delimiter) and NumericValues column (same as ReferenceValues, but for the numbers), and then when I retrieve the results, put everything into specialized objects in the system, separating the values based on the defined delimiter). Does this seem preferable?
Primary keys are not a MUST for any and all data tables. True, they are usually quite useful and to abandon them is unwise. However, in addition to a primary missions of speed (which I agree would doubtfully be positively affected) is also that of uniqueness. To that end, and valuing the consideration you've already obviously taken, I would suggest that the only need for a primary key would be to govern the expected uniqueness of the table.
Update:
You mentioned in a comment that if you did a PK that it would include an Identity column that presently does not exist and is not needed. In this case, I would advise against the PK altogether. As #RedFilter pointed out, surrogate keys never add any value.
I would keep it simple, just store the report results converted to json or xml, in a VARCHAR(MAX) column
One of the most useful and least emphasized (explicitly) benefits of data integrity (primary keys and foreign key references to start with) is that it forces a 'design by contract' between your data and your application(s); which stops quite a lot of types of bugs from doing any damage to your data. This is such a huge win and a thing that is implicitly taken for granted (it is not 'the database' that protects it, but the integrity rules you specify; forsaking the rules you expose your data to various levels of degradation).
This seems unimportant to you (from the fact that you did not even discuss what would be a possible primary key) and your data seems quite unrelated to other parts of the system (from the fact that you will not do joins to any other tables); but still - if all things are equal I would model the data properly and then if primary keys (or other data integrity rules) are not used and if chasing every last bit of performance I would consider dropping them in production (and test for any actual gains).
As for comments that creating tables is a performance hit - that is true, but you did not tell us how temporary are these tables? Once created will they be heavily used before scrapped? Or do you plan to create tables for just dozen of read operations.
In case you will heavily use these tables and if you will provide clean mechanism for managing them (removing them when not used, selecting them, etc...) I think that dynamically creating the tables would be perfectly fine (you could have shared more details on the tables themselves; use case would be nice)
Notes on other solutions:
EAV model
is horrible unless very specific conditions are met (for example: flexibility is paramount and automating DDL is too much of a hassle). Keep away from it (or be very, very good at anticipating what kinds of queries will you have to deal with and rigorous in validating data on the front end).
XML/BLOB approach
might be the right thing for you if you will consume the data as XML/BLOBs at presentation layer (always read all of the rows, always write the whole 'object' and finally, if your presentation layer likes XML/BLOBS)
EDIT:
Also, depending on the usage patterns, having primary key can indeed increase the speed of retrieval, and if I can read the fact that the data will not be updated as 'it will be written once and read many times' then there is a good chance that it will indeed overweight the cost of updating the index on inserts.
will it 1 table for every run of a given report, or one table to all runs of a given report? in other words, if you have Report #1 and you run it 5 times, over a different range of data, will you produce 5 tables, or will all 5 runs of the report be stored in the same table?
If you are storing all 5 runs of the report in the same table, then you'll need to filter the data so that it is appropriate to the run in question. in this case, having a primary key will let you do the where statement for the filter, much faster.
if you are creating a new table for every run of the report, then you don't need a primary key. however, you are going to run into to other performance problems as the number of tables in your system grows... assuming you don't have something in place to drop old data / tables.
If you are really not using the tables for anything other than as a chunk of read-only data, you could just as well store all the reports in a single table, as XML values.
What column or columns would the PK index be built on? If just a surrogate identity column, you'll have no performance hit when inserting rows, as they'd be inserted "in order". If it is not a surrogate key, then you have the admittedly minor but still useful assurance that you don't have duplicate entries.
Is the primary key used to control the order in which report rows are to be printed? If not, then how do you ensure proper ordering of the information? (Or is this just a data table that gets summed one way and another whenever a report is generated?)
If you use a clustered primary key, you wouldn't use as much storage space as you would with a non-clustered index.
By and large, I find that while not every table requires a primary key, it does not hurt to have one present, and since proper relational database design requires primary keys on all tables, it's good practice to always include them.