How to deal with extremely wide tables in SQL server? - sql-server

I have encountered the following dilemma several times and would be interested to hear how others have addressed this issue or if there is a canonical way that the situation can be addressed.
In some domains, one is naturally led to consider very wide tables. Take, for instance, time series surveys that evolve over many years. Such surveys can have hundreds, if not thousands, of variables. Typically though there are probably only a few thousand or tens-of-thousands of rows. It is absolutely natural to consider such a result set as a table where each variable corresponds to a column in the table however, in SQL Server at least, one is limited to 1024 (non sparse) columns.
The obvious workarounds are to
Distribute each record over multiple tables
Stuff the data into a single table with columns of say, ResponseId, VariableName, ResponseValue
Number 2. I think is very bad for a number of reasons (difficult to query, suboptimal storage, etc) so really the first choice is the only viable option I see. This choice can be improved perhaps by grouping columns that are likely to be queried together into the same table - but one can't really know this until the database is actually being used.
So, my basic question is: Are there better ways to handle this situation?

You might want to put a view in front of the tables to make them appear as if they are a single table. The upside is that you can rearrange the storage later without queries needing to change. The downside is that only modifications to the base table can be done through the view. If necessary, you could mitigate this with stored procedures for frequently used modifications. Based on your use case of time series surveys, it sounds like inserts and selects are far more frequent than updates or deletes, so this might be a viable way to stay flexible without forcing your clients to update if you need to rearrange things later.

Hmmm it really depends on what you do with it. If you want to keep the table as wide as it is (possibly this is for OLAP or data warehouse), I would just use proper indexes. Also based on the columns that are selected more often , I could also use covering indexes. Based on the rows that are searched more often, I could also use filtered indexes. If there are, let’s say, billions of records in the table, you could partition the table as well. If you just want to store the table over multiple tables, definitely use proper normalization techniques, probably up to 3NF or 3.5NF, to divide the big table into smaller tables. I would use the first method of yours, normalization, to store data for the big table just because it seems like it makes sense better to me that way.

This is an old topic but something we are currently working on resolving. Neither of the above answers really give as many benefits as the solution we feel we have found.
We previously believed that having wide tables wasn't really a problem. Having spent time analysing this we have seen the light and realise that costs on inserts/updates are indeed getting out of control.
As John states above, the solution is really to create a VIEW to provide your application with a consistent schema. One of the challenges in any redesign may be, as in our case, that you have thousands or millions of lines of code referencing an old wide table and you may want to provide backwards compatibility.
Views can also be used for UPDATES and INSERTS as John alludes to, however a problem we found initially was that if you take the example of myWideTable which may have hundreds of columns and you want to split this to myWideTable_a with columns a, b and c and myWideTable_b with columns x, y and z then an insert to a view which only sets column a will only insert a record for myWideTable_a
This causes a problem when you want to later update your record and set myWideTable.z as this will fail.
The solution we're adopting, and performance testing, is to have an 'insteadof' trigger on the View insert to always insert to our split-tables so that we can continue to update or read from the view with impunity.
The question as to whether using this trigger on inserts provides more overhead than a wide table is still open, but it is clear that it will improve subsequent writes to columns in each split table.

Related

Database design: one large table versus several smaller tables

I have to create a database to store information being sent and received to / from a 3rd party web service portal. There are about 150 fields of information to be sent though I can remove about 50 of those fields by normalising (there are three sets addresses that can be saved in an address table, for example). However, this still leaves a table that could potentially have 100 columns.
I've come up with two ways of handling this though I'm not sure which to use:
1. Have a table with 100 columns and three references to an address table.
2. Break it down into maybe 15-20 separate dedicated tables.
Option 1 seems the quickest as it involves the fewest joins but the idea of a table with 100 columns doesn't feel right.
Option 2 feels better and would break things down in to more managable chunks but it won't save any database space and will increase the number of joins. Pretty much all the columns in the database will have a value and I cannot normalise these columns any further.
My question is, in this situation is it acceptable to have a table with c.100 columns in it or should I try and break it down over several tables for presentation?
Please note: The table structure will not change over the course of it's useage, a new database would be created for a new version of the web service portal. I have no control over the web service data structure.
Edit: #Oded's answer below has made me think a bit more about how the data will be accessed; it will really only be accessed in whole and not in part. I wouldn't for example, need to return columns 5-20 on a regular basis.
Answer: I accepted Oded's answer based on the comments after he posted it helped me make my mind up and I decided to go with option 1. As the data is accessed in full then having one table seems the better solution. If, for example, I regularly wanted to access columns 5-20 rather than the full table row then I'd see about breaking it up into separate tables for performance reasons.
Speaking from a relational purist point of view - first, there is nothing against having 100 columns in a table, if they are related. The point here is that if after normalizing you still have 100 columns, that's OK.
But you should normalize, and in the process you may very well end up with 15-20 separate dedicated tables, which most relational database professionals would agree is a better design (avoid data duplication with the update/delete issues associated, smaller data footprint etc...).
Pragmatically, however, if there is a measurable performance problem, it may be sensible to denormalize your design for performance benefit. The key here - measureable. Don't optimize before you have an actual problem.
In that respect, I'd say you should go with the set of 15-20 tables as an initial design.
From MSDN:Maximum Capacity Specifications for SQL Server :
Columns per nonwide table: 1,024
Columns per wide table: 30,000
So I think 100 columns is ok in your case. And also maybe you need to note(from same link):
Columns per primary key: 16
Of course this is only in the case if need data only as Log for a service.
If after reading from service you need to maintain data -> then normalising seems better...
If you find it easier to "manage" tables with fewer columns, however you happen to define manageability (e.g. less horizontal scrolling when looking at the table data in SSMS), you can break the table up into several tables with 1-to-1 relationships without violating the rules of normalization.

SQL Server: Many columns in a table vs Fewer columns in two tables

I have a database table (called Fields) which has about 35 columns. 11 of them always contains the same constant values for about every 300.000 rows - and act as metadata.
The down side of this structure is that, when i need to update those 11 columns values, i need to go and update all 300.000 rows.
I could move all the common data in a different table, and update it only one time, in one place, instead of 300.000 places.
However, if i do it like this, when i display the fields, i need to create INNER JOIN's between the two tables, which i know makes the SELECT statement slower.
I must say that updating the columns occurs more rarely than reading (displaying) the data.
How you suggest that i should store the data in database to obtain the best performances?
I could move all the common data in a different table, and update it only one time, in one
place, instead of 300.000 places.
I.e. sane database design and standad normalization.
This is not about "many empty fields", it is brutally about tons of redundant data. Constants you should have isolated. Separate table. This may also make things faster - it allows the database to use memory more efficient because your database is a lot smaller.
I would suggest to go with a separate table unless you've concealed something significant (of course it would be better to try and measure, but I suspect you already know it).
You can actually get faster selects as well: joining a small table would be cheaper then fetching the same data 300000 times.
This is a classic example of denormalized design. Sometimes, denormalization is done for (SELECT) performance, and always in a deliberate, measurable way. Have you actually measured whether you gain any performance by it?
If your data fits into cache, and/or the JOIN is unusually expensive1, then there may well be some performance benefit from avoiding the JOIN. However, the denormalized data is larger and will push at the limits of your cache sooner, increasing the I/O and likely reversing any gains you may have reaped from avoiding the JOIN - you might actually lose performance.
And of course, getting the incorrect data is useless, no matter how quickly you can do it. The denormalization makes your database less resilient to data inconsistencies2, and the performance difference would have to be pretty dramatic to justify this risk.
1 Which doesn't look to be the case here.
2 E.g. have you considered what happens in a concurrent environment where one application might modify existing rows and the other application inserts a new row but with old values (since the first application hasn't committed yet so there is no way for the second application to know that there was a change)?
The best way is to seperate the data and form second table with those 11 columns and call it as some MASTER DATA TABLE, which will be having a primary key.
This primary key can be referred as a foreign key in those 30,000 rows in the first table

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

Limits on number of Rows in a SQL Server Table

Are there any hard limits on the number of rows in a table in a sql server table? I am under the impression the only limit is based around physical storage.
At what point does performance significantly degrade, if at all, on tables with and without an index. Are there any common practicies for very large tables?
To give a little domain knowledge, we are considering usage of an Audit table which will log changes to fields for all tables in a database and are wondering what types of walls we might run up against.
You are correct that the number of rows is limited by your available storage.
It is hard to give any numbers as it very much depends on your server hardware, configuration, and how efficient your queries are.
For example, a simple select statement will run faster and show less degradation than a Full Text or Proximity search as the number of rows grows.
BrianV is correct. It's hard to give a rule because it varies drastically based on how you will use the table, how it's indexed, the actual columns in the table, etc.
As to common practices... for very large tables you might consider partitioning. This could be especially useful if you find that for your log you typically only care about changes in the last 1 month (or 1 day, 1 week, 1 year, whatever). You could then archive off older portions of the data so that it's available if absolutely needed, but won't be in the way since you will almost never actually need it.
Another thing to consider is to have a separate change log table for each of your actual tables if you aren't already planning to do that. Using a single log table makes it VERY difficult to work with. You usually have to log the information in a free-form text field which is difficult to query and process against. Also, it's difficult to look at data if you have a row for each column that has been changed because you have to do a lot of joins to look at changes that occur at the same time side by side.
In addition to all the above, which are great reccomendations I thought I would give a bit more context on the index/performance point.
As mentioned above, it is not possible to give a performance number as depending on the quality and number of your indexes the performance will differ. It is also dependent on what operations you want to optimize. Do you need to optimize inserts? or are you more concerned about query response?
If you are truly concerned about insert speed, partitioning, as well a VERY careful index consideration is going to be key.
The separate table reccomendation of Tom H is also a good idea.
With audit tables another approach is to archive the data once a month (or week depending on how much data you put in it) or so. That way if you need to recreate some recent changes with the fresh data, it can be done against smaller tables and thus more quickly (recovering from audit tables is almost always an urgent task I've found!). But you still have the data avialable in case you ever need to go back farther in time.

What's the better database design: more tables or more columns?

A former coworker insisted that a database with more tables with fewer columns each is better than one with fewer tables with more columns each. For example rather than a customer table with name, address, city, state, zip, etc. columns, you would have a name table, an address table, a city table, etc.
He argued this design was more efficient and flexible. Perhaps it is more flexible, but I am not qualified to comment on its efficiency. Even if it is more efficient, I think those gains may be outweighed by the added complexity.
So, are there any significant benefits to more tables with fewer columns over fewer tables with more columns?
I have a few fairly simple rules of thumb I follow when designing databases, which I think can be used to help make decisions like this....
Favor normalization. Denormalization is a form of optimization, with all the requisite tradeoffs, and as such it should be approached with a YAGNI attitude.
Make sure that client code referencing the database is decoupled enough from the schema that reworking it doesn't necessitate a major redesign of the client(s).
Don't be afraid to denormalize when it provides a clear benefit to performance or query complexity.
Use views or downstream tables to implement denormalization rather than denormalizing the core of the schema, when data volume and usage scenarios allow for it.
The usual result of these rules is that the initial design will favor tables over columns, with a focus on eliminating redundancy. As the project progresses and denormalization points are identified, the overall structure will evolve toward a balance that compromises with limited redundancy and column proliferation in exchange for other valuable benefits.
It doesn't sound so much like a question about tables/columns, but about normalization. In some situations have a high degree of normalization ("more tables" in this case) is good, and clean, but it typically takes a high number of JOINs to get relevant results. And with a large enough dataset, this can bog down performance.
Jeff wrote a little about it regarding the design of StackOverflow. See also the post Jeff links to by Dare Obasanjo.
I would argue in favor of more tables, but only up to a certain point. Using your example, if you separated your user's information into two tables, say USERS and ADDRESS, this gives you the flexibility to have multiple addresses per user. One obvious application of this is a user who has separate billing and shipping addresses.
The argument in favor of having a separate CITY table would be that you only have to store each city's name once, then refer to it when you need it. That does reduce duplication, but in this example I think it's overkill. It may be more space efficient, but you'll pay the price in joins when you select data from your database.
A fully normalized design (i.e, "More Tables") is more flexible, easier to maintain, and avoids duplication of data, which means your data integrity is going to be a lot easier to enforce.
Those are powerful reasons to normalize. I would choose to normalize first, and then only denormalize specific tables after you saw that performance was becoming an issue.
My experience is that in the real world, you won't reach the point where denormalization is necessary, even with very large data sets.
Each table should only include columns that pertain to the entity that's uniquely identified by the primary key. If all the columns in the database are all attributes of the same entity, then you'd only need one table with all the columns.
If any of the columns may be null, though, you would need to put each nullable column into its own table with a foreign key to the main table in order to normalize it. This is a common scenario, so for a cleaner design, you're likley to be adding more tables than columns to existing tables. Also, by adding these optional attributes to their own table, they would no longer need to allow nulls and you avoid a slew of NULL-related issues.
It depends on your database flavor. MS SQL Server, for example, tends to prefer narrower tables. That's also the more 'normalized' approach. Other engines might prefer it the other way around. Mainframes tend to fall in that category.
The multi-table database is a lot more flexible if any of these one to one relationships may become one to many or many to many in the future. For example, if you need to store multiple addresses for some customers, it's a lot easier if you have a customer table and an address table. I can't really see a situation where you might need to duplicate some parts of an address but not others, so separate address, city, state, and zip tables may be a bit over the top.
Like everything else: it depends.
There is no hard and fast rule regarding column count vs table count.
If your customers need to have multiple addresses, then a separate table for that makes sense. If you have a really good reason to normalize the City column into its own table, then that can go, too, but I haven't seen that before because it's a free form field (usually).
A table heavy, normalized design is efficient in terms of space and looks "textbook-good" but can get extremely complex. It looks nice until you have to do 12 joins to get a customer's name and address. These designs are not automatically fantastic in terms of performance that matters most: queries.
Avoid complexity if possible. For example, if a customer can have only two addresses (not arbitrarily many), then it might make sense to just keep them all in a single table (CustomerID, Name, ShipToAddress, BillingAddress, ShipToCity, BillingCity, etc.).
Here's Jeff's post on the topic.
There are advantages to having tables with fewer columns, but you also need to look at your scenario above and answer these questions:
Will the customer be allowed to have more than 1 address? If not, then a separate table for address is not necessary. If so, then a separate table becomes helpful because you can easily add more addresses as needed down the road, where it becomes more difficult to add more columns to the table.
i would consider normalizing as the first step, so cities, counties, states, countries would be better as separate columns... the power of SQL language, together with today DBMS-es allows you to group your data later if you need to view it in some other, non-normalized view.
When the system is being developed, you might consider 'unnormalizing' some part if you see that as an improvement.
I think balance is in order in this case. If it makes sense to put a column in a table, then put it in the table, if it doesn't, then don't. Your coworkers approach would definately help to normalize the database, but that might not be very useful if you have to join 50 tables together to get the information you need.
I guess what my answer would be is, use your best judgement.
There are many sides to this, but from an application efficiency perspective mote tables can be more efficient at times. If you have a few tables with a bunch of columns every time the db as to do an operation it has a chance of making a lock, more data is made unavailable for the duration of the lock. If locks get escalated to page and tables (well hopefully not tables :) ) you can see how this can slow down the system.
Hmm.
I think its a wash and depends on your particular design model. Definitely factor out entities that have more than a few fields out into their own table, or entities whose makeup will likely change as your application's requirements changes (for instance - I'd factor out address anyways, since it has so many fields, but I'd especially do it if you thought there was any chance you'd need to handle foreign country addresses, which can be of a different form. The same with phone numbers).
That said, when you're got it working, keep an eye out on performance. If you've spun an entity out that requires you to do large, expensive joins, maybe it becomes a better design decision to spin that table back into the original.
When you design your database, you should be as close as possible from the meaning of data and NOT your application need !
A good database design should stand over 20 years without a change.
A customer could have multiple adresses, that's the reality. If you decided that's your application is limited to one adresse for the first release, it's concern the design of your application not the data !
It's better to have multiple table instead of multiple column and use view if you want to simplify your query.
Most of time you will have performance issue with a database it's about network performance (chain query with one row result, fetch column you don't need, etc) not about the complexity of your query.
There are huge benefits to queries using as few columns as possible. But the table itself can have a large number. Jeff says something on this as well.
Basically, make sure that you don't ask for more than you need when doing a query - performance of queries is directly related to the number of columns you ask for.
I think you have to look at the kind of data you're storing before you make that decision. Having an address table is great but only if the likelihood of multiple people sharing the same address is high. If every person had different addresses, keeping that data in a different table just introduces unnecessary joins.
I don't see the benefit of having a city table unless cities in of themselves are entities you care about in your application. Or if you want to limit the number of cities available to your users.
Bottom line is decisions like this have to take the application itself into considering before you start shooting for efficiency. IMO.
First, normalize your tables. This ensures you avoid redundant data, giving you less rows of data to scan, which improves your queries. Then, if you run into a point where the normalized tables you are joining are causing the query to take to long to process (expensive join clause), denormalize where more appropriate.
Good to see so many inspiring and well based answers.
My answer would be (unfortunately): it depends.
Two cases:
* If you create a datamodel that is to be used for many years and thus possibly has to adept many future changes: go for more tables and less rows and pretty strict normalization.
* In other cases you can choose between more tables-less rows or less tables-more rows. Especially for people relatively new to the subject this last approach can be more intuitive and easy to comprehend.
The same is valid for the choosing between the object oriented approach and other options.

Resources