SQL server - worth indexing large string keys? - sql-server

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)
What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.

An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).
It depends on how you plan to query the values. An index is useful in several cases:
when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).
So, why do you need the index for? What kind of queries would use it?
For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.
For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.
after the comment
I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:
create table values_dictionary (
id int not null identity(1,1),
value varchar(8000) not null,
value_hash = checksum(value) persisted,
constraint pk_values_dictionary_id
primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go
create procedure usp_get_or_create_value_id (
#value varchar(8000),
#id int output)
begin
declare #hash = CHECKSUM(#value);
set #id = NULL;
select #id = id
from table
where value_hash = #hash
and value = #value;
if #id is null
begin
insert into values_dictionary (value)
values (#value);
set #id = scope_identity();
end
end
In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for #value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.
Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate #value entries, which may or may not be correct.

General Index Design Guidelines
When you design an index consider the following column guidelines:
Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique
or nonnull columns. For more information, see Clustered Index Design
Guidelines.
Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as
index key columns. However, varchar(max), nvarchar(max),
varbinary(max), and xml data types can participate in a nonclustered
index as nonkey index columns. For more information, see Index with
Included Columns.
Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by
performing a join on such a column. This is a fundamental problem with
the data and query, and generally cannot be resolved without
identifying this situation. For example, a physical telephone
directory sorted alphabetically on last name will not expedite
locating a person if all people in the city are named Smith or Jones

Related

SQL Query on single table-valued parameter slow on large input

I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime

Will adding a clustered index to an existing table improve performance?

I have a SQL 2005 database I've inherited, with a table that has grown to about 17 million records over the course of about 15 years, and is now horribly slow.
The table layout looks about like this:
id_column = nvarchar(20),indexed, not unique
column2 = nvarchar(20), indexed, not unique
column3 = nvarchar(10), indexed, not unique
column4 = nvarchar(10), indexed, not unique
column5 = numeric(8,0), indexed, not unique
column6 = numeric(8,0), indexed, not unique
column7 = nvarchar(20), indexed, not unique
column8 = nvarchar(10), indexed, not unique
(and about 5 more columns that look pretty much the same, not indexed)
The 'id' field is a value entered in a front-end application by the end-user.
There are no defined primary keys, and no columns that can be combined to make a unique row (unless all columns are combined). The table actually is a 'details' table to another table, but there are no constraints ensuring referential integrity.
Every column is heavily used in 'where' clauses in queries, which is why I assume there's an index on every one, or perhaps a desperate attempt to speed things up by another DBA.
Having said all that, my question is: could adding a clustered index do me any good at this point?
If I did add a clustered index, I assume it would have to be a new column, ie., an identity column? Basically, is it worth the trouble?
Appreciate any advice.
I would say only add the clustered index if there is a reasoning for needing it. So ask these questions;
Does the order of the data make sense?
Is there sequential value to the way the data is inserted?
Do I need to use a feature that requires it have a clustered index, such as full text index?
If the answer to these questions is all "No" than a clustered index might not be of any additional help over a good non-clustered index strategy. Instead you might want to consider how and when you update statistics, when you refresh the indexes and whether or not filtered indexes make sense in your situation. Looking at the table you have as an example it tough to say, but maybe it makes sense to normalize the table further and use numeric keys instead of nvarchar.
http://www.mssqltips.com/sqlservertip/3041/when-sql-server-nonclustered-indexes-are-faster-than-clustered-indexes/
the article is a great example of when non-clustered indexes might make more sense.
I would recommend adding a clustered index, even if it's an identity column for 3 reasons:
Assuming that your existing queries have to go through the entire table every time, a clustered index scan is still faster than a table scan.
The table is a child to some other table. With some extra works, you can use the new child_id to join against the parent table. This enables clustered index seek, which is a lot faster than scan in some cases.
Depend on how they are setup, the existing indices may not do much good. I've come across some terrible indices that cover 1 column each, or indices that don't include the appropriate columns, causing costly Key Lookups operations. Check your index stats to see if they are being used at all.

Why does SQL Server use a non-clustered index over the clustered PK in a "select *" operation?

I've got a very simple table which stores Titles for people ("Mr", "Mrs", etc). Here's a brief version of what I'm doing (using a temporary table in this example, but the results are the same):
create table #titles (
t_id tinyint not null identity(1, 1),
title varchar(20) not null,
constraint pk_titles primary key clustered (t_id),
constraint ux_titles unique nonclustered (title)
)
go
insert #titles values ('Mr')
insert #titles values ('Mrs')
insert #titles values ('Miss')
select * from #titles
drop table #titles
Notice that the primary key of the table is clustered (explicitly, for the sake of the example) and there's a non-clustered uniqueness constraint the the title column.
Here's the results from the select operation:
t_id title
---- --------------------
3 Miss
1 Mr
2 Mrs
Looking at the execution plan, SQL uses the non-clustered index over the clustered primary key. I'm guessing this explains why the results come back in this order, but what I don't know is why it does this.
Any ideas? And more importantly, any way of stopping this behavior? I want the rows to be returned in the order they were inserted.
Thanks!
If you want order, you need to specify an explicit ORDER BY - anything else does not produce an order (it's "order" is random and could change). There is no implied ordering in SQL Server - not by anything. If you need order - say so with ORDER BY.
SQL Server probably uses the non-clustered index (if it can - if that index has all the columns your query is asking for) since that it smaller - usually just the index column(s) and the clustering key (again: one or multiple columns). The clustered index on the other hand is the whole data (at the leaf level), so it might require a lot more data to be read, in order to get your answer (not in this over-simplified example, of course - but in the real world).
The only way to (absolutely and correctly) guarantee row order is to use ORDER BY -- anything else is an implementation detail and apt to explode, as demonstrated.
As to why the engine chose the unique index: it just didn't matter.
There was no criteria favoring one index over another
The unique index covered the data (title and PK) returned; this is somewhat speculative on my part, but SQL Server is doing what it thinks best.
Try it on a table with an additional column which is not covered -- no bets, but it may make the query planner change its mind.
Happy coding.
SQLServer probably chose the non clustered index because all the data you requested (the id and title) could be retrieved from that index.
For such a trivial table it doesn't really matter which access path was chosen as the worse path is still only two IOs.
As someone commented above if you want your data in a particular order you must specificaly request this using the "ORDER BY" clause otherwise its pretty random what you get back.
Nonclustered indexes are usually smaller than clustered ones so it is usually faster to scan a nonclustered index rather than a clustered one. That probably explains SQL Server's preference for a nonclustered index, even though in your case the indexes are the same size.
The only way to guarantee the order of rows returned is to specify ORDER BY. If you don't specify ORDER BY then you are implicitly telling the optimizer that it can choose what order to return the rows in.

Is adding an INT to a table who's PRIMARY KEY is a UNIQUEIDENTIFIER for the purpose of a JOIN table worth while?

I've got two tables in my SQL Server 2008 database, Users and Items
tblUser
--------------------------
UserID uniqueidentifier
Name nvarchar(50)
etc..
tblItem
--------------------------
ItemID uniqueidentifier
ItemName nvarchar(50)
etc..
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
I want to join these together in a many to many join table that will get huge (potentially more than a billion rows as the application logic requires stats over shared user --> item joins)
The join table needs to be indexed on the UserID_A and UserID_B columns since the lookups are based on a user against their peers.
My question is this:
Is it worth adding an auto increment INT on the user table to use as a non primary key then use that in the join table? So the User table looks like:
tblUser
---------------------------------
UserID uniqueidentifier
Name nvarchar(50)
UserIDJoinKey int identity(1,1)
etc..
Doing that, will it be faster to do something like:
declare #ID int
select * from tblJoin where UserIDJoinKey_A = #ID or UserIDJoinKey_B = #ID
when the join table looks like this:
tlmUserUserItem
-----------------------------------
ItemID uniqueidentifier
UserIDJoinKey_A int
UserIDJoinKey_B int
rather than this:
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
Thanks in advance.
If you're having a performance problem on join operations to the table with the uniqueidentifier, first check the index fragmentation. Hot tables with a uniqueidentifier clustered index tend to get fragmented quickly. There's good info on how to do that at http://msdn.microsoft.com/en-us/library/ms189858.aspx
If you are able to move the clustered index to the new int column and rewrite your queries to use the new int column instead of the old uniqueidentifier, you're biggest benefit is going to be that you'll reduce rate of fragmentation. This helps avoid having your queries slow down after a a bunch of writes to the table.
In most cases, you will not notice a huge difference in the time to process join operations on a uniqueidentifier column versus an int in MSSQL 2008 -- assuming all other things (including fragmentation) are equal.
I may be misunderstanding something along the line, but you're looking to add an identity AND a uniqueidentifier to a each record? When I see you using a GUID, I assume there is either offline functionality that will be merged when the user goes online, or there is some extraneous reason that the GUID was chosen. That reason should hinder you from being able to correctly implement an identity column on each item.
If there is no specific reason why you needed to use a guid over an identity, I'd say scrap the GUID all together. It's bloating your tables, indexes, and slowing down your joins. If I'm misunderstanding please let me know and I apologize!
To find out what is the best solution there is first some indexing theory. SQL Server stores it's clustered index data in a B+ Tree of data pages which allow for about 8K data per page.
When you know that a uniqueidentifier is 16 bytes per key and an int is 4 bytes per key this means there will be 4 times more keys per index page with an int.
To have a faster join with the int column you will most likely have to make it the clustered index. Be aware that having an additional index on such a large table might create an unwanted performance hit on insert statements as there is a more information to write to disk.
It all boils down to benchmark both solutions and choosing the one which performs best for you. If the table is more read heavy, the int column will offer overall better performance.
#MikeM,
Personally I would always choose a uniqueidentifier over an int as the primary key of a table every time. I would however use NEWSEQUENTIALID() and not NEWGUID() to ensure there is less index fragmentation.
The reason I make this choice is simple:
Integers are too easy to get mixed up, and on a table which has several foreign keys, the chances of "accidentally" putting a value in the wrong field is too high. You will never see the problem because ALL identity columns start at a seed of 1 and so most tables tend to have matching integer values in each table. By using uniqueidentifier I absolutely guarantee for all instances of a column that has a foreign key that the value I place in it is correct, because the table it references is the only table capable of having that unique identifier.
What's more... in code, your arguments would all be int, which again opens you up to the possibility of accidentally putting the wrong value in the wrong parameter and you would never know any different. By using unique identifiers instead, once again you are guaranteeing the correct reference.
Trying to track down bugs due to cross posted integers is insidious and the worst part is that you never know the problem has occurred until it is too late and data has become far too corrupted for you to ever unjumble. All it takes is one cross matched integer field and you could potentially create millions of inconsistent rows, none of which you would be aware of until you just "happen" to try and insert a value that doesn't exist in the referenced table... and by then it could be too late.

Why can't I put a constraint on nvarchar(max)?

Why can't I create a constraint on an nvarchar(max) column? SQL Server will not allow me to put a unique constraint on it. But, it allows me to create a unique constraint on an nvarchar(100) column.
Both of these columns are NOT NULL. Is there any reason I can't add a constraint to the nvarchar(max) column?
nvarchar(max) is really a different data type from nvarchar(integer-length). It's characteristics are more like the deprecated text data type.
If nvarchar(max) value becomes too large, like text, it will be stored outside the row (a row is constrained to 8000 bytes maximum) and a pointer to it is stored in the row itself. You cannot efficiently index such a large field and the fact that data can be stored somewhere else further complicates searching and scanning the index.
A unique constraint requires an index to be enforced and as a result, SQL Server designers decided to disallow creating a unique constraint on it.
Because MAX is really big (231-1 bytes) and could lead to a server meltdown if the server had to check for uniqueness on multi-megabyte-sized entries.
From the documentation on Create Index, I would assume this holds true for unique constraints as well.
The maximum allowable size of the
combined index values is 900 bytes.
EDIT: If you really needed uniqueness, you could, potentially approximate it by computing a hash of the data and storing that in a unique index. Even a large hash would be small enough to fit in an indexable column. You'd have to figure out how to handle collisions -- perhaps manually check on collisions and pad the data (changing the hash) if an errant collision is found.
A unique constraint is actually an index, and nvarchar(max) cannot be used as a key in an index.

Resources