Dos it impact having all the primary key columns at the beginning of the table?
I know partial index reads most likely involve table scans that brings whole row into buffer pool for predicate matching. I am curious to know any performance gain having primary keys at the top of the table would provide.
In Oracle, the order of the columns of a table has little impact in general on performance.
The reason is that all columns of a row are generally contained on a single block and that the difference in time between finding the first column and the last column of a row in a block is infinitesimal compared to finding/reading the block.
Furthermore, when you reach the database block to read a row, the primary key may not be the most important column.
Here are a few exceptions where column order might have an impact:
when you have > 255 columns in your table, the rows will be split in two blocks (or more). Accessing the first 255 columns may be cheaper than accessing the remaining columns.
the last columns of a row take 0 byte of space if they are NULL. As such, columns that contain many NULL values are best left at the end of a row if possible to reduce space usage and therefore IO. In general the impact will be minimal since other NULL columns take 1 byte each so the space saved is small.
when compression is enabled, the efficiency of the compression may depend upon the column order. A good rule of thumb would be that columns with few distinct values should be grouped to enhance the chance that they will be merged by the compression algorithm.
You should think about the order of columns when you use Index Organized Table (IOT) with the overflow clause. With this clause, all columns after a determined dividing column will be stored out of line and accessing them will incur additional cost. Primary keys are always stored physically at the beginning of the rows in IOT.
At least in SQL Server there is no performance benefit based on the order of the columns in the table, primary key or not. The only benefit to having your primary key columns at the top of the list is organizational. Kind of like having a table with these columns Id, FirstName, LastName, Address1, Address2, City, State, Zip. It's a lot easier to follow in that order than Address2, State, Firstname, Id, Address1, Lastname, Zip, City. I don't know much about Oracle or DB2 but I believe it's the same.
In DB2, (and I think the answers about the other database manager systems should check the answers) the columns that have less modification should be at the beginning of each row, because when performing an update it takes from the first modified column till the end of the row, to write that in the transaction logs.
It only impacts the update operation, inserts, delete or select do not have problems. And the impact is that the IO is a little reduced, because less information should be written if just the last columns have to be written. This could be important when performing updates over a few small columns on tables with big rows with lots of record. If the first column is modified, DB2 will write the whole row.
Ordering columns to minimize update logging: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.dbobj.doc/doc/c0024496.html
(for ORACLE)
Is it fair to say then, that any and all primary key columns, even if there is just 1, should be the first or among the first few columns in a row. Further, tagging them on the END of the row is bad practice, particularly after a series of possibly/likely null attribute fields?
Thus, a row like:
pkcol(s), att1,att2,att3, varchar2(2000)
is better organized for all the reasons stated above than
att1, att2, att3, varchar2(2000), pkcol(s)
Why am I asking? Well, don't judge, but we are simplifying the PK for some tables and the developers have happily tagged the new GUID pk (don' judge #2) onto the end of the row. I am bothered by this but need some feedback to justify my fears. Also does this matter at all for SQL Server?
Related
I'm working on synchronizing clients with data for eventual consistency. The server will publish a list of database ids and rowversion/timestamp. Client will then request data with incorrect version number. The primary reason for inconsistent data is networking issues between broker nodes, split brain, etc.
When I read data from my tables, I request data based on a predicate that is not the primary key.
I iterate available regions to read data per region. This is my select:
SELECT DatabaseId, VersionTimestamp, OperationId
FROM TableX
WHERE RegionId = 1
Since this leads to an index scan per query, I'm wondering if a non-clustered index on my RegionId column, and include the selected columns in that index:
CREATE NONCLUSTERED INDEX [ID_TableX_RegionId_Sync]
ON [dbo].[TableX] ([RegionId])
INCLUDE ([DatabaseId],[VersionTimestamp],[OperationId])
VersionTimestamp is rowversion/timestamp column, and will of course change whenever a row is updated, so I'm wondering if it is a poor design choice to include this column in an index since it will need to be updated at every insert/update/delete?
Since this will result in n index scans, rather than n index seeks, it might be better to read all the data once, and then group by regionId and fill in empty lists of rows where a regionId doesn't have any data.
The real life scenario is a bit more complicated, as there are table relationships that will also have to be queried. I haven not yet looked at including one to many relationships in my version queries.
This is primarily about better understanding the impact of covering indexes and figuring out how to better use them. Since I am going to read all the data from the table in any case, it is probably cheaper to load them all at once. However, reading them as from the query above, it makes my code a lot cleaner for this simple no-relationship example alone.
Edit:
Alternative 2
Another option that came to mind, is creating a covering index on RegionId, and include my primary key (DatabaseId).
SELECT DatabaseId
FROM TableX WHERE RegionId=1
And then a new query where I select the needed columns WHERE DatabaseId IN(list, of, databaseId)
For the current scenario, there are only max thousands of rows in the table, and not in the millions. Network traffic for the two (x n) queries might most likely outweigh the benefits of using indexes, and be premature optimization.
I was wondering which approach is better for designing databases?
I have currently one big table (97 columns per row) with references to lookup tables where I could.
Wouldn't it be better for performance to group some columns into smaller tables and add them key columns for referencing one whole row?
If you split up your table into several parts, you'll need additional joins to get all your columns for a single row - that will cost you time.
97 columns isn't much, really - I've seen way beyond 100.
It all depends on how your data is being used - if your row just has 97 columns, all the time, and needs to 97 columns - then it really hardly ever makes sense to split those up into various tables.
It might make sense if:
you can move some "large" columns (like XML, VARCHAR(MAX) etc.) into a separate table, if you don't need those all the time -> in that case, your "basic" row becomes smaller and your basic table will perform better - as long as you don't need those extra large column
you can move away some columns to a separate table that aren't always present, e.g. columns that might be "optional" and only present for e.g. 20% of the rows - in that case, you might save yourself some processing for the remaining 80% of the cases where those columns aren't needed.
It would be better to group relevant columns into different tables. This will improve the performance of your database as well as your ease of use as the programmer. You should try to first find all the different relationships between your columns and following that you should attempt to break everything into tables while keeping in mind these relationships (using primary keys, forking keys, references and so forth).Try to create a diagram as this http://www.simple-talk.com/iwritefor/articlefiles/354-image008.gif and take it from there.
Unless your data is denormalized it is likely best to keep all the columns in the same table. SQL Server reads pages into the buffer pool from individual tables. Thus you will have the cost of the joins on every access even if the pages accessed are already in the buffer pool. If you access just a few rows of the data per query with a key then an index will serve that query fine with all columns in the same table. Even if you will scan a large percentage of the rows (> 1% of a large table) but only a few of the 97 columns you are still better off keeping the columns in the same table as you can use a non clustered index that covers the query. However, if the data is heavily denormalized then normalizing it, which by definition breaks it into many tables based upon the rules of normalization to eliminate redundancy, will result in much improved performance and you will be able to write queries to access only the specific data elements you need.
I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.
I want to add an integer column to a table with a large number of rows and many indexes (Its a data warehouse Fact Table).
To keep the row width as narrow as possible all the columns in this table are defined as not null. So I want the new column to be not null with a default of zero.
From experience adding this column will take some time, presumably because the database will need to rewrite all the rows with a new column with a filled value. And this presumably will involve updating the clustered index and all the non-clustered indexes.
So should I drop all the indexes before adding the column and then recreate them all.
Or is there an easier way to do this?
Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row.
It will require updating the clustered index, yes - this IS the table data, after all.
But I don't see why any of the non-clustered indices would have to updated - your new column won't be member of any of the non-clustered indices.
Also, I don't see how dropping and recreating the indices would benefit you in this scenario. If you were bulk-loading several million existing rows from another table or database - yes, then it might be faster (due to the INSERTs being much faster) - but adding a column doesn't really suffer from any indices or constraints being around, I don't think.
Marc
SQL Server is a row oriented database. This is in contrast to a column oriented database. This means that in SQL Server, all of the data for a given row is stored together on the disk. Let's have an example:
Say you have a Customer table with 3 columns, FirstName, MiddleInitial, and LastName. Then, say you have 3 records in this table for Jabba T. Hutt, Dennis T. Menace, and George W. Bush.
In a row oriented database (like SQL Server), the records will be stored on disk as such:
Jabba, T, Hutt; Dennis, T, Menace; George, W, Bush;
In contrast, a column oriented database would store the records on disk like this:
Jabba, Dennis, George; T, T, W; Hutt Menace, Bush;
Where columns are grouped together instead of rows.
Now, when you go to add a column to a table in a row oriented database (SQL Server, for example), the new data for each column has to be inserted alongside the existing rows, shifting the rows requiring a lot of read/write operations. So, if you were to insert a new column for the customer prefix that defaults to 'Mr', this is what you'd get:
Mr, Jabba, T, Hutt; Mr, Dennis, T, Menace; Mr, George, W, Bush;
As you can see, all of the original data has been shifted to the right. On the other hand, when you insert a new column that defaults to NULL, no new data has to be put into the existing rows. Thus, there is less shifting, requiring fewer disk read/write operations.
Of course, this an oversimplification of what's actually going on on disk. There are other things to take into account when dealing with indexes, pages, etc. But, it should help you get the picture.
For clarification I'm not at all suggesting you move to a column oriented database, I just put that info in there to help explain what Row oriented meant.
"Also I don't really understand why adding a column that is nullable is so much quicker. Why does this not involve re-writng the records with an extra Is Null bit flipped for each row."
Adding a nullable column merely changes the definition of the table. The individual records are not affected.
Are there any best practices to column ordering when designing a database? Will order effect performance, space, or the ORM layer?
I am aware of SQL Server - Does column order matter?. I am looking for more general advice.
I don't believe that the column order will necessarily affect performance nor space. To improve performance, you can create indexes on the table, and the order of the columns defined in the index will effect performance.
I've seen tables have their fields ordered alphabetically, as well as "logically" (in a way that makes sense for the data that is being represented). All in all, I can see benefits in both, but I would tend to go for the "logically" method.
I try to stick with the most important columns first. Typically I always keep my ID column as the first in any table. Then whatever information is important and is updated frequently usually follows, then the rest which may or may not be updated frequently.
I don't think it will affect performance, but from a developer stance, it's easier to read the first few columns which will be updated frequently than try and scan the hole table for that one field at the end.
In Oracle there can be significant storage space savings if your table has a number of NULLable columns and you place the NULLable columns at the end of the list. NULL values on the end of a row take up no space.
e.g. imagine this table: (id NOT NULL, name VARCHAR2(100), surname VARCHAR2(100), blah VARCHAR2(100, date_created DATE NOT NULL)
the row (100, NULL, NULL, NULL, '10-JAN-2000') will require storage for the values 100, some space for the three NULLs, followed by the date.
Alternatively, the same table but with different ordering: (id NOT NULL, date_created DATE NOT NULL, name VARCHAR2(100), surname VARCHAR2(100), blah VARCHAR2(100))
the row (100, '10-JAN-2000', NULL, NULL, NULL) will only require storage for the values 100 and the date - the trailing NULLs are omitted entirely.
Normally this makes little difference but for very large tables with many NULLable columns, significant savings may be made - less space used can translate to more rows per block, meaning less IO and CPU required to query the table.
I think the answer is no.
RDBMS servers optimise these kinds of things internally for queries so I suspect it's unimportant.
column order only matters in a composite index
If your index is on ( Lastname, firstname) and you always search for last name then you are good to go even if you don't include first name
if your index looks like this (Firstname, Lastname) and your where clause is
where lastname like 'smith%'
then you have to scan the whole index
More general advice isn't really available since you're asking for implementation details rather than the SQL standard.
Different DBMS will implement these things differently.
However, a clever DBMS would implement the internals such that the column ordering is not of consequence.
Therefore, I would order my columns to be intuitive for human readers.
In designing a database, I would probably put the most important columns first in a logical order (idfield, firstname, middlename, lastname for instance). It does make it easier to see them when you are looking for the columns you need the most out of a long column list.
I would however not rearrange the columns later on to support a more logical grouping.