Adding a column efficently in SQL Server - sql-server

I want to add an integer column to a table with a large number of rows and many indexes (Its a data warehouse Fact Table).
To keep the row width as narrow as possible all the columns in this table are defined as not null. So I want the new column to be not null with a default of zero.
From experience adding this column will take some time, presumably because the database will need to rewrite all the rows with a new column with a filled value. And this presumably will involve updating the clustered index and all the non-clustered indexes.
So should I drop all the indexes before adding the column and then recreate them all.
Or is there an easier way to do this?
Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row.

It will require updating the clustered index, yes - this IS the table data, after all.
But I don't see why any of the non-clustered indices would have to updated - your new column won't be member of any of the non-clustered indices.
Also, I don't see how dropping and recreating the indices would benefit you in this scenario. If you were bulk-loading several million existing rows from another table or database - yes, then it might be faster (due to the INSERTs being much faster) - but adding a column doesn't really suffer from any indices or constraints being around, I don't think.
Marc

SQL Server is a row oriented database. This is in contrast to a column oriented database. This means that in SQL Server, all of the data for a given row is stored together on the disk. Let's have an example:
Say you have a Customer table with 3 columns, FirstName, MiddleInitial, and LastName. Then, say you have 3 records in this table for Jabba T. Hutt, Dennis T. Menace, and George W. Bush.
In a row oriented database (like SQL Server), the records will be stored on disk as such:
Jabba, T, Hutt; Dennis, T, Menace; George, W, Bush;
In contrast, a column oriented database would store the records on disk like this:
Jabba, Dennis, George; T, T, W; Hutt Menace, Bush;
Where columns are grouped together instead of rows.
Now, when you go to add a column to a table in a row oriented database (SQL Server, for example), the new data for each column has to be inserted alongside the existing rows, shifting the rows requiring a lot of read/write operations. So, if you were to insert a new column for the customer prefix that defaults to 'Mr', this is what you'd get:
Mr, Jabba, T, Hutt; Mr, Dennis, T, Menace; Mr, George, W, Bush;
As you can see, all of the original data has been shifted to the right. On the other hand, when you insert a new column that defaults to NULL, no new data has to be put into the existing rows. Thus, there is less shifting, requiring fewer disk read/write operations.
Of course, this an oversimplification of what's actually going on on disk. There are other things to take into account when dealing with indexes, pages, etc. But, it should help you get the picture.
For clarification I'm not at all suggesting you move to a column oriented database, I just put that info in there to help explain what Row oriented meant.

"Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row."
Adding a nullable column merely changes the definition of the table. The individual records are not affected.

Related

Reducing disk space of sql database

I got a database that have 2TB of data, and i wanna reduce it to 500Go by dropping some rows and removing some useless columns, but i have other ideas of optimizations, and i need an answer of some questions before.
My database got one .mdf file, and 9 other .ndf file and each file has an initiale size of 100Go.
Should I reduce the initiale size of each .ndf file to 50Go? can this operation affect my data?
Dropping an index help to reduce space?
PS : My Database contains only one single table, that has one clustered index and two other non clustered indexes,
I want to remove the two non clustered indexes
Remove the insertdate column
If you have any other ideas of optimizations, it would be very helpful
Before droping any indexes run these two views.
sys.dm_db_index_usage_stats
sys.dm_db_index_operational_stats
They will let you know if any of them are being used to support queries. The last thing you want is to remove an index and start seeing full table scans on a 2TB table.
If you can't split up the table into a relational model then try these for starters.
Check your data types.
-Can you replace NVARCHAR with VARCHAR or NCHAR with CHAR? (they take up half the space)
-Does your table experience a lot of Updates or a lot of Inserts (above view will tell you this)? If there are very few updates then consider changing CHAR fields to VARCHAR fields. Heavy updates can cause page splits and result in poor Page fullness.
-Check that columns only storing a Date with no time are not declared as Datetime
-Check value ranges in numeric fields i.e. try and use Smallint instead of Int.
Look at the activity on the table, update & insert behaviour. If the activity means very few Pages are re-arranged then consider increasing your Fill Factor.
Look at the Plan Cache, get an idea of how the table is being queried, if the bulk of queries focus on a specific portion of the table then implement a Filtered Index.
Is your Clustered Index Unique? If not then SQL creates a "hidden extra Integer column" that creates uniqueness under the bonnet.

Oracle - Make one table with many columns or split in many tables

What is the best way to model a database? I have many known channels with values. Is it better create one table with many columns, one for each channel or create two table one for values and one for channels? Like that:
Table RAW_VALUES: SERIE_ID, CHANNEL_1, ..., CHANNEL_1000
or
Table RAW_VALUES: SERIE_ID, CHANNEL_ID, VALUE
Table CHANNELS: CHANNEL_ID, NAME, UNIT, ....
My question is about performance to search some data or save database space.
Thanks.
Usually, one would want to know what type of queries you will run against the tables as well as the data distribution etc to choose between two designs. However, I think that there are more fundamental issues here to guide you.
The second alternative is certainly more flexible. Adding one more channel ("Channel_1001") can be done simply by inserting rows in the two tables (a simple DML operation), whereas if you use the first option, you need to add a column to the table (a DDL operation), and that will not be usable by any programs using this table unless you modify them.
That type of flexibility alone is probably a good reason to go with the second option.
Searching will also be better served with the second option. You may create one index on the raw_values table and support indexed searches on the Channel/Value columns. (I would avoid the name "value" for a column by the way.)
Now if you consider what column(s) to index under the first option, you will probably be stumped: you have 1001 columns there. If you want to support indexed searches on the values, would you index them all? Even if you were dealing with just 10 channels, you would still need to index those 10 columns under your first option; not a good idea in general to load a table with more than a few indexes.
As an aside, if I am not mistaken, the limit is 1000 columns per table these days, but a table with more than 255 columns will store a row in multiple row pieces, each storing 255 columns and that would create a lot of avoidable I/O for each select you issue against this table.

Database : Primary key columns in beginning of table

Dos it impact having all the primary key columns at the beginning of the table?
I know partial index reads most likely involve table scans that brings whole row into buffer pool for predicate matching. I am curious to know any performance gain having primary keys at the top of the table would provide.
In Oracle, the order of the columns of a table has little impact in general on performance.
The reason is that all columns of a row are generally contained on a single block and that the difference in time between finding the first column and the last column of a row in a block is infinitesimal compared to finding/reading the block.
Furthermore, when you reach the database block to read a row, the primary key may not be the most important column.
Here are a few exceptions where column order might have an impact:
when you have > 255 columns in your table, the rows will be split in two blocks (or more). Accessing the first 255 columns may be cheaper than accessing the remaining columns.
the last columns of a row take 0 byte of space if they are NULL. As such, columns that contain many NULL values are best left at the end of a row if possible to reduce space usage and therefore IO. In general the impact will be minimal since other NULL columns take 1 byte each so the space saved is small.
when compression is enabled, the efficiency of the compression may depend upon the column order. A good rule of thumb would be that columns with few distinct values should be grouped to enhance the chance that they will be merged by the compression algorithm.
You should think about the order of columns when you use Index Organized Table (IOT) with the overflow clause. With this clause, all columns after a determined dividing column will be stored out of line and accessing them will incur additional cost. Primary keys are always stored physically at the beginning of the rows in IOT.
At least in SQL Server there is no performance benefit based on the order of the columns in the table, primary key or not. The only benefit to having your primary key columns at the top of the list is organizational. Kind of like having a table with these columns Id, FirstName, LastName, Address1, Address2, City, State, Zip. It's a lot easier to follow in that order than Address2, State, Firstname, Id, Address1, Lastname, Zip, City. I don't know much about Oracle or DB2 but I believe it's the same.
In DB2, (and I think the answers about the other database manager systems should check the answers) the columns that have less modification should be at the beginning of each row, because when performing an update it takes from the first modified column till the end of the row, to write that in the transaction logs.
It only impacts the update operation, inserts, delete or select do not have problems. And the impact is that the IO is a little reduced, because less information should be written if just the last columns have to be written. This could be important when performing updates over a few small columns on tables with big rows with lots of record. If the first column is modified, DB2 will write the whole row.
Ordering columns to minimize update logging: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.dbobj.doc/doc/c0024496.html
(for ORACLE)
Is it fair to say then, that any and all primary key columns, even if there is just 1, should be the first or among the first few columns in a row. Further, tagging them on the END of the row is bad practice, particularly after a series of possibly/likely null attribute fields?
Thus, a row like:
pkcol(s), att1,att2,att3, varchar2(2000)
is better organized for all the reasons stated above than
att1, att2, att3, varchar2(2000), pkcol(s)
Why am I asking? Well, don't judge, but we are simplifying the PK for some tables and the developers have happily tagged the new GUID pk (don' judge #2) onto the end of the row. I am bothered by this but need some feedback to justify my fears. Also does this matter at all for SQL Server?

Main table with hundreds vs few smaller

I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.

Database table design for redundancy

(with SQL Server 2008) I have a big table (~50M records) that is fully normalized. There are 4 primary columns, and one of the them has only three possible entries- A, B, and C. The issue is, often there is much redundancy for this column. That is to say, there can be many records with value A, and then many repeated records that are identical in all respects, except with value B (and/or C). This redundancy does not always happen, but it's frequent enough that it greatly increases the record count and I wish to be rid of it.
My idea is that instead of A, B, C being choices for a column, I've thought about creating 3 bit columns titled A, B, C. Then in the case of the aforementioned redundancies for these values, I don't have to create repeated records, but instead just have one record and then flag the A, B, and/or C columnns as necessary.
These seems unorthodox so I thought I'd see what the experts think. One thing is that there would be three different uniqueness contraints for this table, each including all the other primary keys plus one of the three flag columns.
[EDIT] To clarify on the meaning of "many repeated records", one of the other PK's is a date column. So for example, there could be 1000 records of different dates with entry A, and then another 1000 records of the same dates (and other columns identical) but with entry B. So that is how even with only three choices there can still be lots of redundancy.
You can't have "many repeated records that are identical in all respects" except for the 4th column in the PK that takes one of A OR B or C. This means to me that you have at most 3 rows (over the other 3 PK columns) differentiated by either A or B or C
This means you should have one unique constraint because of this.
I'd do nothing based on this and also
a row with A is a different row with C
only 50 millions
it's simple (no extra tables or fancy bit columns)
no stated performance issues (until you add extra tables or fancy bit columns)
you have a clear, normalised schema
Edit:
Your redundancy isn't in the ABC column. The row multiplication is caused by the datetime.
Can you change the datetime to smalldatetime and suppress near-duplicates that way? eg resolve to nearest minute not 3.33 milliseconds? Or for SQL Server 2008 use datetime2 and pick your resolution
How about creating a separate table that stores these "flags", foreign key'd back to your original table?
Table1 (original table)
----------------------
PriKey1 (PK for Table1)
Col1
Col2
Table2 (new table)
------------------
PriKey2 (PK for Table2)
PriKey1 (FK to Table1)
A
B
C
I personally wouldn't do it that way, I would create another table that would store either the A, B, or C and the RecordID.
The only issue I can think of off the top of my head is that you will need to change your existing code and include all 3 fields if you want to get any use out of indexing on those bit columns.
Bit fields by their nature are not very selective. To get good selectivity you will need to create a covering index on all 3 fields, and then include all 3 in your WHERE clauses so you get optimum seeks.
Most databases will allocate a minimimum of the most efficient processing unit per field in any case so calling them bit fields would only be a metadata difference. But unpacking bits into words is just overhead anyway. You might as well use probably ints. And I'm pretty sure Sql Server doesn't index bit fields - cardinality of 2 doesn't help much.
50M records? A small number by most accounts.
Have you tried to quantify the overhead you're trying to reduce? If nothing else you're going to add work for the increased complexity.
I'd have to think a long time before increasing complexity.
Is this a really stable design otherwise, and you have some extra time?

Resources