Is there a best practice to database column ordering? - database

Are there any best practices to column ordering when designing a database? Will order effect performance, space, or the ORM layer?
I am aware of SQL Server - Does column order matter?. I am looking for more general advice.

I don't believe that the column order will necessarily affect performance nor space. To improve performance, you can create indexes on the table, and the order of the columns defined in the index will effect performance.
I've seen tables have their fields ordered alphabetically, as well as "logically" (in a way that makes sense for the data that is being represented). All in all, I can see benefits in both, but I would tend to go for the "logically" method.

I try to stick with the most important columns first. Typically I always keep my ID column as the first in any table. Then whatever information is important and is updated frequently usually follows, then the rest which may or may not be updated frequently.
I don't think it will affect performance, but from a developer stance, it's easier to read the first few columns which will be updated frequently than try and scan the hole table for that one field at the end.

In Oracle there can be significant storage space savings if your table has a number of NULLable columns and you place the NULLable columns at the end of the list. NULL values on the end of a row take up no space.
e.g. imagine this table: (id NOT NULL, name VARCHAR2(100), surname VARCHAR2(100), blah VARCHAR2(100, date_created DATE NOT NULL)
the row (100, NULL, NULL, NULL, '10-JAN-2000') will require storage for the values 100, some space for the three NULLs, followed by the date.
Alternatively, the same table but with different ordering: (id NOT NULL, date_created DATE NOT NULL, name VARCHAR2(100), surname VARCHAR2(100), blah VARCHAR2(100))
the row (100, '10-JAN-2000', NULL, NULL, NULL) will only require storage for the values 100 and the date - the trailing NULLs are omitted entirely.
Normally this makes little difference but for very large tables with many NULLable columns, significant savings may be made - less space used can translate to more rows per block, meaning less IO and CPU required to query the table.

I think the answer is no.
RDBMS servers optimise these kinds of things internally for queries so I suspect it's unimportant.

column order only matters in a composite index
If your index is on ( Lastname, firstname) and you always search for last name then you are good to go even if you don't include first name
if your index looks like this (Firstname, Lastname) and your where clause is
where lastname like 'smith%'
then you have to scan the whole index

More general advice isn't really available since you're asking for implementation details rather than the SQL standard.
Different DBMS will implement these things differently.
However, a clever DBMS would implement the internals such that the column ordering is not of consequence.
Therefore, I would order my columns to be intuitive for human readers.

In designing a database, I would probably put the most important columns first in a logical order (idfield, firstname, middlename, lastname for instance). It does make it easier to see them when you are looking for the columns you need the most out of a long column list.
I would however not rearrange the columns later on to support a more logical grouping.

Related

SQL Query on single table-valued parameter slow on large input

I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime

What is the best practice for a database table with a two-column primary key where one of the columns is optional?

At work, were reviewing a table that consists of three columns:
Row Id (Not Null, Integer)
Context Id (Not Null, Integer)
Value (Not Null, Variable Character)
The total number of rows is small (less than 100). The Context Id is currently set to 0 if there is no Context Id. Context Id is optional.
The primary key is (Row Id,Context Id)
In this situation, there were two choices proposed:
Keep it as it is
Divide the table into two tables. One table when Context Id is 0 (Row Id, Value) and one table when Context Id has a value (Row Id, Context Id, Value).
If there are a large number of rows, then I agree with the decision to split up the table. If there are a small number of rows, dividing up the table seems like overkill.
I would be very interested what folks recommend in this situation? Is it better to always divide up the single table into two tables?
Thanks,
-Larry
First, if this is the entity-attribute-value antipattern, avoid that. The rest of this answer assumes that it isn't.
When deciding if you are giong to model something generically (single table) or specifically (two tables) you need to consider if your code will be able to process things generically or need special cases.
Are Special Cases Required?
Will your code give special treatment to a ContextId of 0? For example, might you run a select WHERE ContextId=0 and then go on and look at a different context in a generic manner? Is there special logic that you would only do with ContextId=0? Would you feel the need for a special constant representing this contextId in your code? Would you this constant appear in some if statements?
If the answers to these questions are generally yes, then create the separate table. You will not gain from putting this stuff in a single table if you treat it differently anyway.
It's my guess that this is the case based on your question.
Or is it all Generic?
If the ContextId of zero is treated just like all the other context ids throughout your code, then by all means put it in the same table as the others.
Tradeoffs
If things are not so clear-cut, you have a tradeoff decision to make. You would need to make this based on how much of your usage of this information is generic and how much is specific. I would be biased towards creating two tables.
There is not enough information to decide but:
In the perspective of database design, attending to minimum side-effect changes, I will prefer adding a new column to the table, called ID and setting it as the Primary Key of the table, having a new unique key on (Row Id + Context Id) columns.
The logic behind (Row Id + Context id) is your applications logic and application logic is bind to changes.
Its recommended to use primary key of your tables being separated of business identifiers and logic.
Any field is showed to the end user, is at risk of being requested to change, making maintenance and develop hard.
Hope be helpful.

Database : Primary key columns in beginning of table

Dos it impact having all the primary key columns at the beginning of the table?
I know partial index reads most likely involve table scans that brings whole row into buffer pool for predicate matching. I am curious to know any performance gain having primary keys at the top of the table would provide.
In Oracle, the order of the columns of a table has little impact in general on performance.
The reason is that all columns of a row are generally contained on a single block and that the difference in time between finding the first column and the last column of a row in a block is infinitesimal compared to finding/reading the block.
Furthermore, when you reach the database block to read a row, the primary key may not be the most important column.
Here are a few exceptions where column order might have an impact:
when you have > 255 columns in your table, the rows will be split in two blocks (or more). Accessing the first 255 columns may be cheaper than accessing the remaining columns.
the last columns of a row take 0 byte of space if they are NULL. As such, columns that contain many NULL values are best left at the end of a row if possible to reduce space usage and therefore IO. In general the impact will be minimal since other NULL columns take 1 byte each so the space saved is small.
when compression is enabled, the efficiency of the compression may depend upon the column order. A good rule of thumb would be that columns with few distinct values should be grouped to enhance the chance that they will be merged by the compression algorithm.
You should think about the order of columns when you use Index Organized Table (IOT) with the overflow clause. With this clause, all columns after a determined dividing column will be stored out of line and accessing them will incur additional cost. Primary keys are always stored physically at the beginning of the rows in IOT.
At least in SQL Server there is no performance benefit based on the order of the columns in the table, primary key or not. The only benefit to having your primary key columns at the top of the list is organizational. Kind of like having a table with these columns Id, FirstName, LastName, Address1, Address2, City, State, Zip. It's a lot easier to follow in that order than Address2, State, Firstname, Id, Address1, Lastname, Zip, City. I don't know much about Oracle or DB2 but I believe it's the same.
In DB2, (and I think the answers about the other database manager systems should check the answers) the columns that have less modification should be at the beginning of each row, because when performing an update it takes from the first modified column till the end of the row, to write that in the transaction logs.
It only impacts the update operation, inserts, delete or select do not have problems. And the impact is that the IO is a little reduced, because less information should be written if just the last columns have to be written. This could be important when performing updates over a few small columns on tables with big rows with lots of record. If the first column is modified, DB2 will write the whole row.
Ordering columns to minimize update logging: http://publib.boulder.ibm.com/infocenter/db2luw/v9r7/topic/com.ibm.db2.luw.admin.dbobj.doc/doc/c0024496.html
(for ORACLE)
Is it fair to say then, that any and all primary key columns, even if there is just 1, should be the first or among the first few columns in a row. Further, tagging them on the END of the row is bad practice, particularly after a series of possibly/likely null attribute fields?
Thus, a row like:
pkcol(s), att1,att2,att3, varchar2(2000)
is better organized for all the reasons stated above than
att1, att2, att3, varchar2(2000), pkcol(s)
Why am I asking? Well, don't judge, but we are simplifying the PK for some tables and the developers have happily tagged the new GUID pk (don' judge #2) onto the end of the row. I am bothered by this but need some feedback to justify my fears. Also does this matter at all for SQL Server?

Which is faster comparing an uniqueidentifier or a string in tsql?

I have a table which holds the guid for an user and their actual name as a string. I would like to grab some information based on an user. But which field should I use? Should my code say:
select *
from userinboxcount
where countDate >= startDate and countDate <= endDate and userid = '<guid here>'
or
select *
from userinboxcount
where countDate >= startDate and countDate <= endDate and username = "FirstName LastName"
The biggest difference is if one field has an index that the database can use, and the other doesn't. If the database has to read all the data in the table to scan for the value, the disk access takes so much resources that the difference in data type is not relevant.
If both fields have indexes, then the index that is smaller would be somewhat faster, because it loads faster, and it's more likely that it remains in the cache.
Ideally you would have an index for all the fields in the condition, which has the fields that you want to return as included fields. That way the query can produce the result from only the index, and doesn't have to read from the actual table at all. You should of course not use select *, but specify the fields that you actually need to return.
Other than that, it would be somewhat faster to compare GUID values because it's a simple numeric comparison and doesn't have to consider lexical rules.
See the query plan and you can see it for yourself.
But the unique identifier usually has an index and the string (username) might not have. If so, and if there are many records, prolly the unique identifier would be faster!
To the the query plan, check THIS article.
GUID will be good enough.
1. GUID will produce unique values in the table.
2. Create Non Clustered Index on this column.
Reference - Non-clustered indexes are particularly handy when we want to return a single row from a table.
Are you completely married to the GUID? You should use a GUID when you need a primary key that will be unique across multiple systems. I would suggest skipping the GUID and using a composite key. For example, you could use an identity plus a GETDATE() as a composite key. This would give you an easy way to query your data (try to remember a GUID over an integer). This will also perform much, much better than GUID. Probably twice as fast.
If userid is a primary key, you should use that. If you use first and last name, you could have two John Smith entries, for example, and that could create an issue for you. Using the PK should be safer
On the performance side, it's a good idea to become familiar with explain plan (execution path?) of the query. I'd expect using the userid would be faster, but checking the plan should tell you for certain.

Too many columns in single table - is it good normal form?

A normalized table should have less number columns and can have reference fields as much as possible. Is it right approach?
Is there any relationship between number of columns and a good normalization process?
Is there any relationship between
number of columns and a good
normalization process?
In short, no. A 3NF normalized table will have as many columns as it needs, provided that
data within the table is dependent
on the key, the whole key, and nothing
but the key (so help me Codd).
There are situations where (some) denormalization may actually improve performance and the only real measure of when this should be done is to test it.
You should follow the normalization principles rather than be concerned with the sheer number of columns in a table. The business requirements will drive the entities, their attributes and their relationship and no absolute number is the "correct" one.
Here is an approach you can use if you feel your table has too many fields. Example:-
CREATE TABLE Person
Person_ID int not null primary key,
Forename nvarchar(50) not null,
Surname nvarchar(50) not null,
Username varchar(20) null,
PasswordHash varchar(50) null
This table represents people but clearly not all people need be users hence the Username and PasswordHash fields are nullable. However its possible that there will be 1 or 2 orders of magnitude more people than there are users.
In such case we could create a User table to hold the Username and PasswordHash fields with a one-to-one relationship to the Person table.
You can generalise this approach by looking for sets of nullable fields that either null together of have values together and significantly likely to be null. This indicates that there is another table you could extract.
Edit
Thanks to Stephanie (see comments) this technique is apparently called "Vertical Partitioning"
While I agree with #ocdecio, I would also observe that a database that is normalized will generally have fewer columns per table and more tables than one that is not, given the same data storage requirements. Similar to code smells a database smell would be relatively few tables given a reasonably large application. This would be a hint that perhaps your data is not in normal form. Applying normalization rules, where appropriate, would alleviate this "smell".
Each column must have a direct and exclusive relationship to the primary key. If you have an attribute-heavy item that there is only so much you can do to simplify the model. Any attempt to split into multiple tables will be counter-productive and pointless.

Resources