We have a table that we denormalized because we have a big risk that the joins would be too slow for the amount of data of our users. So we created 10 columns info (INFO0, INFO1... INFO9). Most of the time, only the 2-3 first column are use, the others are null.
But now, we need to add two more type of infos with 10 columns each (for a total of 20 new columns). The tricky part, is that our design will make impossible for the users to use all of the 30 denormalized columns. At all time, they will always be able to use a maximum of 10 on each row. Moreover, we could need to add even more new denormalized columns, but we will never be able to use more than 10 on each row.
I know it is not a good design, but we don't really have the choice. So my question is : can this design become inefficient? Can having a lot of columns with null values slow down my queries? If yes, can it become a big deal?
Yes it could. You don't say what database you're using or what data type the extra columns are, but adding more columns is going to increase the 'width' of your table which means that more logical reads are needed to retrieve the same number of records, more reads equals slower speed. So what you gain by denormalisation may eventually be lost by adding too many columns, but the extent of this will depend on your database design.
If it does affect performance an intermediate solution could be to vertically split the table placing infrequently referenced columns in a second table.
Related
This is a question of what is the best practice and best performance.
I have inherited a database that contains data for turbine engines. I have found 20 data points that are calculated from several fields from the turbine. The way it was done in the past is a view was create to pull data for some turbines and calculate some of the 20 data point. Then other views for the same turbines but different data point and then other views for different turbines and data point. So the same equations are used over and over.
I want to consolidate all of the equations (20 data point) into one place. My debate is either creating a user function that will do all 20 calculations or creating them as computed columns in the table. With a function it would calculate all 20 for each turbine even thou I might only need 2 or 3 for a view. But as a computed column it would only calculate the columns the view pulled.
The answer is probably "it depends".
The factors when making this determination include:
Is the column deterministic? (e.g. can you persist it or not)
How often is data inserted into the table?
How often is data retrieved from the table?
The trade offs for computed and specifically persisted computed columns are similar to that when considering an index on your table. Having persisted columns will increase the amount of time an insert takes on the table, but allows retrieval to happen faster. Whereas on the other end, computed columns (that aren't persisted), or a function you would have faster on the insert but slower on the retrieval.
The end solution would likely depend on the utilization of the table (how often writes and reads occur) - which is something that you would need to determine.
Personally, I wouldn't do a function for the columns, but rather I'd persist them, or write a view/computed columns that accomplished them, depending on the nature of the usage on the table.
I have to create a database to store information being sent and received to / from a 3rd party web service portal. There are about 150 fields of information to be sent though I can remove about 50 of those fields by normalising (there are three sets addresses that can be saved in an address table, for example). However, this still leaves a table that could potentially have 100 columns.
I've come up with two ways of handling this though I'm not sure which to use:
1. Have a table with 100 columns and three references to an address table.
2. Break it down into maybe 15-20 separate dedicated tables.
Option 1 seems the quickest as it involves the fewest joins but the idea of a table with 100 columns doesn't feel right.
Option 2 feels better and would break things down in to more managable chunks but it won't save any database space and will increase the number of joins. Pretty much all the columns in the database will have a value and I cannot normalise these columns any further.
My question is, in this situation is it acceptable to have a table with c.100 columns in it or should I try and break it down over several tables for presentation?
Please note: The table structure will not change over the course of it's useage, a new database would be created for a new version of the web service portal. I have no control over the web service data structure.
Edit: #Oded's answer below has made me think a bit more about how the data will be accessed; it will really only be accessed in whole and not in part. I wouldn't for example, need to return columns 5-20 on a regular basis.
Answer: I accepted Oded's answer based on the comments after he posted it helped me make my mind up and I decided to go with option 1. As the data is accessed in full then having one table seems the better solution. If, for example, I regularly wanted to access columns 5-20 rather than the full table row then I'd see about breaking it up into separate tables for performance reasons.
Speaking from a relational purist point of view - first, there is nothing against having 100 columns in a table, if they are related. The point here is that if after normalizing you still have 100 columns, that's OK.
But you should normalize, and in the process you may very well end up with 15-20 separate dedicated tables, which most relational database professionals would agree is a better design (avoid data duplication with the update/delete issues associated, smaller data footprint etc...).
Pragmatically, however, if there is a measurable performance problem, it may be sensible to denormalize your design for performance benefit. The key here - measureable. Don't optimize before you have an actual problem.
In that respect, I'd say you should go with the set of 15-20 tables as an initial design.
From MSDN:Maximum Capacity Specifications for SQL Server :
Columns per nonwide table: 1,024
Columns per wide table: 30,000
So I think 100 columns is ok in your case. And also maybe you need to note(from same link):
Columns per primary key: 16
Of course this is only in the case if need data only as Log for a service.
If after reading from service you need to maintain data -> then normalising seems better...
If you find it easier to "manage" tables with fewer columns, however you happen to define manageability (e.g. less horizontal scrolling when looking at the table data in SSMS), you can break the table up into several tables with 1-to-1 relationships without violating the rules of normalization.
I have a database table (called Fields) which has about 35 columns. 11 of them always contains the same constant values for about every 300.000 rows - and act as metadata.
The down side of this structure is that, when i need to update those 11 columns values, i need to go and update all 300.000 rows.
I could move all the common data in a different table, and update it only one time, in one place, instead of 300.000 places.
However, if i do it like this, when i display the fields, i need to create INNER JOIN's between the two tables, which i know makes the SELECT statement slower.
I must say that updating the columns occurs more rarely than reading (displaying) the data.
How you suggest that i should store the data in database to obtain the best performances?
I could move all the common data in a different table, and update it only one time, in one
place, instead of 300.000 places.
I.e. sane database design and standad normalization.
This is not about "many empty fields", it is brutally about tons of redundant data. Constants you should have isolated. Separate table. This may also make things faster - it allows the database to use memory more efficient because your database is a lot smaller.
I would suggest to go with a separate table unless you've concealed something significant (of course it would be better to try and measure, but I suspect you already know it).
You can actually get faster selects as well: joining a small table would be cheaper then fetching the same data 300000 times.
This is a classic example of denormalized design. Sometimes, denormalization is done for (SELECT) performance, and always in a deliberate, measurable way. Have you actually measured whether you gain any performance by it?
If your data fits into cache, and/or the JOIN is unusually expensive1, then there may well be some performance benefit from avoiding the JOIN. However, the denormalized data is larger and will push at the limits of your cache sooner, increasing the I/O and likely reversing any gains you may have reaped from avoiding the JOIN - you might actually lose performance.
And of course, getting the incorrect data is useless, no matter how quickly you can do it. The denormalization makes your database less resilient to data inconsistencies2, and the performance difference would have to be pretty dramatic to justify this risk.
1 Which doesn't look to be the case here.
2 E.g. have you considered what happens in a concurrent environment where one application might modify existing rows and the other application inserts a new row but with old values (since the first application hasn't committed yet so there is no way for the second application to know that there was a change)?
The best way is to seperate the data and form second table with those 11 columns and call it as some MASTER DATA TABLE, which will be having a primary key.
This primary key can be referred as a foreign key in those 30,000 rows in the first table
I have a table with 158 columns in SQL Server 2005.
any disdvantage of keeping so many columns.
Also I have to keep those many columns,
how can i improve performance - like using SP's, Indexes?
Wide tables can be quite performant when you usually want all the fields for a particular row. Have you traced your user's usage patterns? If they're usually pulling just one or two fields from multiple rows then your performance will suffer. The main issue is when your total row size hits the 8k page mark. That means SQL has to hit the disk twice for every row (first page + overflow page), and thats not counting any index hits.
The guys at Universal Data Models will have some good ideas for refactoring your table. And Red Gate's SQL Refactor makes splitting a table heaps easier.
There is nothing inherently wrong with wide tables. The main case for normalization is database size, where lots of null columns take up a lot of space.
The more columns you have, the slower your queries will be.
That's just a fact. That isn't to say you aren't justified in having many columns. The above does not give one carte blanche to split one entity's worth of table with many columns into multiple tables with fewer columns. The administrative overhead of such a solution would most probably outweigh any perceived performance gains.
My number one recommendation to you, based off of my experience with abnormally wide tables (denormalized schemas of bulk imported data) is to keep the columns as thin as possible. I had to work with a lot of crazy data and left most of the columns as VARCHAR(255). I recommend against this. Although convenient for development purposes, performance would spiral out of control, especially when working with Perl. Shrinking the columns to their bare minimum (VARCHAR(18) for instance) helped immensely.
Stored procedures are just batches of SQL commands; they don't have any direct on speed other than that regular use of certain types of stored procedures will end up using cached query plans (which is a performance boost).
You can use indexes to speed up certain queries, but there's no hard and fast rule here. Good index design depends entirely on the type of queries you're running. Indexing will, by definition, make your writes slower; they exist only to make your reads faster.
The problem with having that many columns in a table is that finding rows using the clustered primary key can be expensive. If it were possible to change the schema, breaking this up into many normalized tables will be the best way to improve efficiency. I would strongly recommend this course.
If not, then you may be able to use indices to make some SELECT queries faster. If you have queries that only use a small number of the columns, adding indices on those columns could mean that the clustered index will not need to be scanned. Of course, there is always a price to pay with indices, in terms of storage space and INSERT, UPDATE and DELTETE time, so this may not be a good idea for you.
I have a request to allow a dynamic table to have 1000 columns(randomly selected by my end users). This seems like a bad idea to me. It's a customizable table so it will have a mixture of varchar(200) and float columns(float best matches the applications c++ double type). This database is mostly an index for a legacy application and serves as a reporting repository. It's not the system of record. The application has thousands of data points very few of which could be normalized out.
Any ideas as to what the performance implications of this are? Or an ideal table size to partition this down too?
Since I don't know what fields out of 20k worth of choices the end users will pick normalizing the tables is not feasible. I can separate this data out to several tables That I would have to dynamically manage (fields can be added or drooped. The rows are then deleted and the system of record is re parsed to fill the table.) My preference is to push back and normalize all 20k bits of data. But I don't see that happening.
This smells like a bad design to me.
Things to consider:
Will most of those columns be contain NULL values?
Will many be named Property001, Property002, Property003, etc...?
If so, I recommend you rethink your data normalization.
from SQL2005 documentation:
SQL Server 2005 can have up to two billion tables per database and 1,024 columns per table. (...) The maximum number of bytes per row is 8,060. This restriction is relaxed for tables with varchar, nvarchar, varbinary, or sql_variant columns that cause the total defined table width to exceed 8,060 bytes. The lengths of each one of these columns must still fall within the limit of 8,000 bytes, but their combined widths may exceed the 8,060 byte limit in a table.
what is the functionality of these columns? why not better split them into master table, properties (lookup tables) and values?
Whenever you feel the need to ask what limits the system has, you have a design problem.
If you were asking "How many characters can I fit into a varchar?" then you shouldn't be using varchars at all.
If you seriously want to know if 1000 columns is okay, then you desperately need to reorganize the data. (normalization)
MS SQL Server has a limit of 1024 columns per table, so you're going to be running right on the edge of this. Using varchar(200) columns, you'll be able to go past the 8k byte per row limit, since SQL will store 8k on the data page, and then overflow the data outside of the page.
SQL 2008 added Sparse Columns for scenarios like this - where you'd have a lot of columns with null values in them.
Using Sparse Columns
http://msdn.microsoft.com/en-us/library/cc280604.aspx
As a rule: the wider the table the slower the performance. Many thin tables are preferable to one fat mess of a table.
If your table is that wide it's almost certainly a design issue. There's no real rule on how many is preferable, I've never really come across tables with more than 20 columns in the real world. Just group by relation. It's a RDBMS after all.
This will have huge performance and data issues. It probably needs to be normalized.
While SQl server will let you create a table that has more than 8060 bytes inteh row, it will NOT let you store more data than that in it. You could have data unexpectedly truncated (and worse not until several months later could this happen by which time fixing this monstrosity is both urgent and exptremely hard).
Querying this will also be a real problem. How would you know which of the 1000 columns to look for the data? Should every query ask for all 1000 columns in the where clause?
And the idea that this would be user customizable is scary indeed. Why would the user need a 1000 fields to customize? Most applications I've seen which give the user a chance to customize some fields set a small limit (usually less than 10). If there is that much they need to customize, then the application hasn't done a good job of defining what the customer actually needs.
Sometimes as a developer you just have to stand up and say no, this is a bad idea. This is one of those times.
As to what you shoud do instead (other than normalize), I think we would need more information to point you in the right direction.
And BTW, float is an inexact datatype and should not be used for fields where calculations are taking place unless you like incorrect results.
I have to disagree with everyone here.....I know it sounds mad but using tables with hundreds of columns is the best thing I have ever done.
Yes many columns frequently have null values;
Yes I could normalise it to just a few tables and transpose;
Yes it is inefficient
However it is incredibly fast and easy to analyze column data in endless different ways
Wasteful and inelegant - you'll never build anything as useful!
That is too many. Any more than 50 columns wide and you are asking for trouble in performance, code maintenance, and troubleshooting when there are problems.
Seems like an awful lot. I would first make sure that the data is normalized. That might be part of your problem. What type of purpose will this data serve? Is it for reports? Will the data change?
I would think a table that wide would be a nightmare performance and maintenance-wise.
Did you think of viewing your final (1000 columns) table as the result of a crosstab query? Your original table would then have just a few columns but many thousand records.
Can you please elaborate on your problem? I think nobody really understand why you need these 1000 columns!