What is the number of columns that make table really big? - sql-server

I have two tables in my database, one for login and second for user details (the database is not only two tables). Logins table has 12 columns (Id, Email, Password, PhoneNumber ...) and user details has 23 columns (Job, City, Gender, ContactInfo ..). The two tables have one-to-one relationship.
I am thinking to create one table that contain the columns of both tables but I not sure because this may make the size of the table big.
So this lead to my question, what the number of columns that make table big? Is there a certain or approximate number that make size of table big and make us stop adding columns to a table and create another one? or it is up to the programmer to decide such number?

The number of columns isn't realistically a problem. Any kind of performance issues you seem to be worried with can be attributed to the size of the DATA on the table. Ie, if the table has billions of rows, or if one of the columns contains 200 MB of XML data on each separate row, etc.
Normally, the only issue arising from a multitude of columns is how it pertains to indexing, as it can get troublesome trying to create 100 different indexes covering each variation of each query.
Point here is, we can't really give you any advice since just the number of tables and columns and relations isn't enough information to go on. It could be perfectly fine, or not. The nature of the data, and how you account for that data with proper normalization, indexing and statistics, is what really matters.

The constraint that makes us stop adding columns to an existing table in SQL is if we exceed the maximum number of columns that the database engine can support for a single table. As can be seen here, for SQLServer that is 1024 columns for a non-wide table, or 30,000 columns for a wide table.
35 columns is not a particularly large number of columns for a table.

There are a number of reasons why decomposing a table (splitting up by columns) might be advisable. One of the first reasons a beginner should learn is data normalization. Data normalization is not directly concerned with performance, although a normalized database will sometimes outperform a poorly built one, especially under load.
The first three steps in normalization result in 1st, 2nd, and 3rd normal forms. These forms have to do with the relationship that non-key values have to the key. A simple summary is that a table in 3rd normal form is one where all the non-key values are determined by the key, the whole key, and nothing but the key.
There is a whole body of literature out there that will teach you how to normalize, what the benefits of normalization are, and what the drawbacks sometimes are. Once you become proficient in normalization, you may wish to learn when to depart from the normalization rules, and follow a design pattern like Star Schema, which results in a well structured, but not normalized design.
Some people treat normalization like a religion, but that's overselling the idea. It's definitely a good thing to learn, but it's only a set of guidelines that can often (but not always) lead you in the direction of a satisfactory design.
A normalized database tends to outperform a non normalized one at update time, but a denormalized database can be built that is extraordinarily speedy for certain kinds of retrieval.
And, of course, all this depends on how many databases you are going to build, and their size and scope,

I take it that the login tables contains data that is only used when the user logs into your system. For all other purposes, the details table is used.
Separating these sets of data into separate tables is not a bad idea and could work perfectly well for your application. However, another option is having the data in one table and separating them using covering indexes.
One aspect of an index no one seems to consider is that an index can be thought of as a sub-table within a table. When a SQL statement accesses only the fields within an index, the I/O required to perform the operation can be limited to only the index rather than the entire row. So creating a "login" index and "details" index would achieve the same benefits as separate tables. With the added benefit that any operations that do need all the data would not have to perform a join of two tables.

Related

Database Design : Is it a good idea to combine two tables with the same columns and too many data?

I have two table with the same columns,one is use to save bank's amount and the other to save cashdesk,they both might have many data,so i'm concerned about data retrieving speed.I don't know it's better to combine them by adding extra column to determine type of each record or create a separate table for each one?
The main question you should be asking is - how am I querying the tables.
If there is no real logical connection between the 2 tables (you don't want to get the rows in the same query) - use 2 different tables, since the other why around you will need to hold another column to tell you what type of row you are working on, and that will slow you down and make your queries more complex
In addition FKs might be a problem if the same column if a FK to 2 different places
In addition (2nd) - locks might be an issue - if you work on one type you might block the other
conclusion - 2 tables, not just for speed
In theory you have one unique entity, So you need to consider one table to your accounts and another one for your types of accounts, for better performance you could separate these types of account on two different file groups and partitions and create an index on the typeFK for account table, in this scenario you have logically one entity that is ruled by relational theory and physically your data is separated and data retrieval process would be fast and beneficial.

Main table with hundreds vs few smaller

I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.

What is more important, normalization or ease of coding?

I have an excel spreadsheet i am going to be turning into a DB to mine data and build an interactive app. There are about 20 columns and 80,000 records. Practically all records have about half of their column data as null, but which column has data is random for each record.
The options would be to:
Create a more normalized DB with a table for each column and use 20 joins to view all data. I would think the benefits would be a DB with really no NULL values so the size would be smaller. One of the major cons would be more code to update each table from the application side.
Create a flat file with one table that has all columns. I figure this will be easier for the application side to do updates, but will result in a table that has a butt load of empty dataspace.
I don't get why you think updating a normalized db is harder than a flat table. It's very much the other way around.
Think about inserting a relation between a customer and a product (basically an order). You'd have to:
select the row that describes the rest of the data, but has nulls or something in the product columns
you have to update the product columns
you have to insert a HUGE row to the db
What about the first time? What do you do with the initial nulls? Do you modify your selects to ignore them? What if you want the nulls?
What if you delete the last product? Do you change it into an update and set nulls for just a few columns?
Joins aside, working with a normalized table is trivial by design. You pay for its triviality with performance, that's the actual trade-off.
If you are going to be using a relational database, you should normalize your tables, if nothing else in order to ease data maintenance and ensure you don't have duplicate data.
You can investigate the use of a document database for storage instead of a relational database, though it is not the only option.
Generally normalized databases will end up being easier to write code against as SQl code is deisgned with normalized tables in mind.
Normalizing doesn't have to be done on all columns, so there's a middle ground between the two options you present. A good rule of thumb is that if you have columns that have values being repeated heavily across records, those can be good candidates for normalizing into one or more separate tables. Putting each column in its own table and joining across them is almost certainly overdoing it.
Don't normalize too much. It's hard to maintain a canonical model as your application grows. Storage is cheap. Don't get fooled into coding head aches because of concerns that were valid 20 years ago. No need to go nosql unless you need it.

Is it normal to have a table with about 40-50 columns in database?

Is it normal to have a table with about 40-50 columns in database?
Depends on your data model. It is somehow "neater" to have data broken down into multiple tables and have them related to each other, but it can also be possible your data is such it cannot, or it makes no sense, to be broken down.
If you want to have less columns just "for the sake of it", and there is no significant performance degradations - no need. If you find yourself using less columns than there are in the table, break it down...
Yes, if those 40-50 columns are all dependent on the key, the whole key, and nothing but the key of the table.
It is not uncommon for a database to be de-normalised to improve performance: munging tables together results in fewer joins during queries.
So denormalised tables tend to have more columns, and duplicate data can become an issue, but sometimes that's the only way to get the performance that you need.
I seem to get asked that question at every job interview I go to:
When would you denormalise a database?
Depends on what you call normal. If you are a big enterprise corporation, it's not normal, because you have way too few columns.
But if you find it hard to work with that many columns, you probably have a problem and need to do something about it: either abstract the many columns away or split up your data model to something more manageable.
It doesn't sound very normalised, so you might want to look at this . But it really depends on what you're storing I suppose...
I don't know about "normal", but it should not be causing any problems. If you have many "optional" columns, that are null most of the time, or many fields are very large and not often queried, then maybe the schema could be normalized or tuned a bit more, but the number of columns itself is not an issue.
The number of columns has no relationship to whether the data is normalized or not. It is the content of the columns which will tell you that. Are the columns things like
Phone1, phone2, phone3? Then certainly the table is not normalized and should be broken apart. But if they are all differnt items which are all in a one-to-one raltionship with the key value, then 40-50 columns can be normalized.
This doesn't mean you always want to store them in one table though. If the combined size of those columns is larger than the actual bytes allowed per row of data in the database, you might be better off creating two or more tables in a one-to-one relationship with each other. Otherwise you will have trouble storing the data if all the fields are at or near their max size. And if some of the fields are not needed most of the time, a separate table may also be in order for them.

How efficient is a details table?

At my job, we have pseudo-standard of creating one table to hold the "standard" information for an entity, and a second table, named like 'TableNameDetails', which holds optional data elements. On average, for every row in the main table will have about 8-10 detail rows in it.
My question is: What kind of performance impacts does this have over adding these details as additional nullable columns on the main table?
8-10 detail rows or 8-10 detail columns ?
If its rows, then you're mixing apples and oranges as a one-to-many relationship cannot be flatten out into columns.
If is columns, then you're talking vertical partitioning. For large and very large tables, moving seldom referenced columns into Extra or Details tables (ie partition the columns vertically into 'hot' and 'cold' tables) can have significant and event huge performance benefits. Narrower table means higher density of data per page, in turn means less pages needed for frequent queries, less IO, better cache efficiency, all goodness.
Mileage may vary, depending on the average width of the 'details' columns and how 'seldom' the columns are accessed.
I'm with Remus on all the "depends", but would just add that after choosing this design for a table/entity, you must also have a good process for determining what is "standard" and what is "details" for an entity.
Misplacing something as a detail which should be standard is probably the worst thing. Because you can't require a row to exist as easily as requiring a column to exist (big complex trigger code). Setting a default on a type of row is a lot harder (big complex constraint code). And indexing is also not easy either (sparse index, maybe?).
Misplacing something as a standard which should be a detail is less of a mistake, just taking up extra row space and potentially not being able to have a meaningful default.
If your details are very weakly structured, you could consider using an XML column for the "details" and still be able to query them using XPath/XQuery.
As a general rule, I would not use this pattern for every entity table, but only entity tables which have certain requirements and usage patterns which fit this solution's benefits well.
Is your details table an entity value table? In that case, yes you are asking for performance problems.
What you are describing is an Entity-Attribute-Value design. They have their place in the world, but they should be avoided like the plague unless absolutely necessary. The analogy I always give is that they are like drugs: in small quantities and in select circumstances they can be beneficial. Too much will kill you. Their performance will be awful and will not scale and you will not get any sort of data integrity on the values since they are all stored as strings.
So, the short answer to your question: if you never need to query for specific values nor ever need to make a columnar report of a given entity's attributes nor care about data integrity nor ever do anything other than spit the entire wad of data for an entity out as a list, they're fine. If you need to actually use them however, whatever query you write will not be efficient.
You are mixing two different data models - a domain specific one for the "standard" and a key/value one for the "extended" information.
I dislike key/value tables except when absolutely required. They run counter to the concept of an SQL database and generally represent an attempt to shoehorn object data into a data store that can't conveniently handle it.
If some of the extended information is very often NULL you can split that column off into a separate table. But if you do this to two different columns, put them in separate tables, not the same table.

Resources