SQL Query on single table-valued parameter slow on large input - sql-server

I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.

What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.

So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.

The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime

Related

Joining on a non-PK field, does length of varchar datatype determine query speed? SQL Server 2008

I was given a ragtag assortment of data to analyze and am running into a predicament. I've got a ~2 million row table with a non-unique identifier of datatype varchar(50). This identifier is unique to a personID. Until I figure out exactly how I need to normalize this junk I've got another question that might help me right now: If I change the datatype to a varchar(25) for instance, will that help queries run faster when they're joined on a non-PK field? All of the characters in the string are integers, but trying to convert them to an int would cause overflow. Or could I possibly somehow index the column for the time being to get some of the queries to run faster?
EDIT: The personID will be a foreign key to another table with demographic information about a person.
Technically, the length of a varchar specifies it's maximum length.
The actual length is variable (thus the name) so a lower maximum value won't change the evaluation because it will be made on the actual string.
For more information :
Check this MSDN article and this
Stack overflow Post
Varchar(50) to varchar(25) would certainly reduce the size of record in that table, thereby reducing the number of database pages that contain the table, improving the perfomance of queries (may be to a marginal extent), but such an ALTER TABLE statement might take a long time.
Alternatively, if you define index on the join columns, and if your retrieval list is small, you can include those columns also in the index definition (Covering index), that too would bring down the query execution times significantly.

Which is faster comparing an uniqueidentifier or a string in tsql?

I have a table which holds the guid for an user and their actual name as a string. I would like to grab some information based on an user. But which field should I use? Should my code say:
select *
from userinboxcount
where countDate >= startDate and countDate <= endDate and userid = '<guid here>'
or
select *
from userinboxcount
where countDate >= startDate and countDate <= endDate and username = "FirstName LastName"
The biggest difference is if one field has an index that the database can use, and the other doesn't. If the database has to read all the data in the table to scan for the value, the disk access takes so much resources that the difference in data type is not relevant.
If both fields have indexes, then the index that is smaller would be somewhat faster, because it loads faster, and it's more likely that it remains in the cache.
Ideally you would have an index for all the fields in the condition, which has the fields that you want to return as included fields. That way the query can produce the result from only the index, and doesn't have to read from the actual table at all. You should of course not use select *, but specify the fields that you actually need to return.
Other than that, it would be somewhat faster to compare GUID values because it's a simple numeric comparison and doesn't have to consider lexical rules.
See the query plan and you can see it for yourself.
But the unique identifier usually has an index and the string (username) might not have. If so, and if there are many records, prolly the unique identifier would be faster!
To the the query plan, check THIS article.
GUID will be good enough.
1. GUID will produce unique values in the table.
2. Create Non Clustered Index on this column.
Reference - Non-clustered indexes are particularly handy when we want to return a single row from a table.
Are you completely married to the GUID? You should use a GUID when you need a primary key that will be unique across multiple systems. I would suggest skipping the GUID and using a composite key. For example, you could use an identity plus a GETDATE() as a composite key. This would give you an easy way to query your data (try to remember a GUID over an integer). This will also perform much, much better than GUID. Probably twice as fast.
If userid is a primary key, you should use that. If you use first and last name, you could have two John Smith entries, for example, and that could create an issue for you. Using the PK should be safer
On the performance side, it's a good idea to become familiar with explain plan (execution path?) of the query. I'd expect using the userid would be faster, but checking the plan should tell you for certain.

SQL server - worth indexing large string keys?

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)
What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.
An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).
It depends on how you plan to query the values. An index is useful in several cases:
when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).
So, why do you need the index for? What kind of queries would use it?
For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.
For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.
after the comment
I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:
create table values_dictionary (
id int not null identity(1,1),
value varchar(8000) not null,
value_hash = checksum(value) persisted,
constraint pk_values_dictionary_id
primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go
create procedure usp_get_or_create_value_id (
#value varchar(8000),
#id int output)
begin
declare #hash = CHECKSUM(#value);
set #id = NULL;
select #id = id
from table
where value_hash = #hash
and value = #value;
if #id is null
begin
insert into values_dictionary (value)
values (#value);
set #id = scope_identity();
end
end
In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for #value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.
Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate #value entries, which may or may not be correct.
General Index Design Guidelines
When you design an index consider the following column guidelines:
Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique
or nonnull columns. For more information, see Clustered Index Design
Guidelines.
Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as
index key columns. However, varchar(max), nvarchar(max),
varbinary(max), and xml data types can participate in a nonclustered
index as nonkey index columns. For more information, see Index with
Included Columns.
Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by
performing a join on such a column. This is a fundamental problem with
the data and query, and generally cannot be resolved without
identifying this situation. For example, a physical telephone
directory sorted alphabetically on last name will not expedite
locating a person if all people in the city are named Smith or Jones

Using "varchar" as the primary key? bad idea? or ok?

Is it really that bad to use "varchar" as the primary key?
(will be storing user documents, and yes it can exceed 2+ billion documents)
It totally depends on the data. There are plenty of perfectly legitimate cases where you might use a VARCHAR primary key, but if there's even the most remote chance that someone might want to update the column in question at some point in the future, don't use it as a key.
If you are going to be joining to other tables, a varchar, particularly a wide varchar, can be slower than an int.
Additionally if you have many child records and the varchar is something subject to change, cascade updates can causes blocking and delays for all users. A varchar like a car VIN number that will rarely if ever change is fine. A varchar like a name that will change can be a nightmare waiting to happen. PKs should be stable if at all possible.
Next many possible varchar Pks are not really unique and sometimes they appear to be unique (like phone numbers) but can be reused (you give up the number, the phone company reassigns it) and then child records could be attached to the wrong place. So be sure you really have a unique unchanging value before using.
If you do decide to use a surrogate key, then make a unique index for the varchar field. This gets you the benefits of the faster joins and fewer records to update if something changes but maintains the uniquess that you want.
Now if you have no child tables and probaly never will, most of this is moot and adding an integer pk is just a waste of time and space.
I realize I'm a bit late to the party here, but thought it would be helpful to elaborate a bit on previous answers.
It is not always bad to use a VARCHAR() as a primary key, but it almost always is. So far, I have not encountered a time when I couldn't come up with a better fixed size primary key field.
VARCHAR requires more processing than an integer (INT) or a short fixed length char (CHAR) field does.
In addition to storing extra bytes which indicate the "actual" length of the data stored in this field for each record, the database engine must do extra work to calculate the position (in memory) of the starting and ending bytes of the field before each read.
Foreign keys must also use the same data type as the primary key of the referenced parent table, so processing further compounds when joining tables for output.
With a small amount of data, this additional processing is not likely to be noticeable, but as a database grows you will begin to see degradation.
You said you are using a GUID as your key, so you know ahead of time that the column has a fixed length. This is a good time to use a fixed length CHAR(36) field, which incurs far less processing overhead.
I think int or bigint is often better.
int can be compared with less CPU instructions (join querys...)
int sequence is ordered by default -> balanced index tree -> no reorganisation if you use an PK as clustered index
index need potentially less space
Use an ID (this will become handy if you want to show only 50 etc...). Than set a constraint UNIQUE on your varchar with the file-names (I assume, that is what you are storing).
This will do the trick and will increase speed.

Is adding an INT to a table who's PRIMARY KEY is a UNIQUEIDENTIFIER for the purpose of a JOIN table worth while?

I've got two tables in my SQL Server 2008 database, Users and Items
tblUser
--------------------------
UserID uniqueidentifier
Name nvarchar(50)
etc..
tblItem
--------------------------
ItemID uniqueidentifier
ItemName nvarchar(50)
etc..
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
I want to join these together in a many to many join table that will get huge (potentially more than a billion rows as the application logic requires stats over shared user --> item joins)
The join table needs to be indexed on the UserID_A and UserID_B columns since the lookups are based on a user against their peers.
My question is this:
Is it worth adding an auto increment INT on the user table to use as a non primary key then use that in the join table? So the User table looks like:
tblUser
---------------------------------
UserID uniqueidentifier
Name nvarchar(50)
UserIDJoinKey int identity(1,1)
etc..
Doing that, will it be faster to do something like:
declare #ID int
select * from tblJoin where UserIDJoinKey_A = #ID or UserIDJoinKey_B = #ID
when the join table looks like this:
tlmUserUserItem
-----------------------------------
ItemID uniqueidentifier
UserIDJoinKey_A int
UserIDJoinKey_B int
rather than this:
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
Thanks in advance.
If you're having a performance problem on join operations to the table with the uniqueidentifier, first check the index fragmentation. Hot tables with a uniqueidentifier clustered index tend to get fragmented quickly. There's good info on how to do that at http://msdn.microsoft.com/en-us/library/ms189858.aspx
If you are able to move the clustered index to the new int column and rewrite your queries to use the new int column instead of the old uniqueidentifier, you're biggest benefit is going to be that you'll reduce rate of fragmentation. This helps avoid having your queries slow down after a a bunch of writes to the table.
In most cases, you will not notice a huge difference in the time to process join operations on a uniqueidentifier column versus an int in MSSQL 2008 -- assuming all other things (including fragmentation) are equal.
I may be misunderstanding something along the line, but you're looking to add an identity AND a uniqueidentifier to a each record? When I see you using a GUID, I assume there is either offline functionality that will be merged when the user goes online, or there is some extraneous reason that the GUID was chosen. That reason should hinder you from being able to correctly implement an identity column on each item.
If there is no specific reason why you needed to use a guid over an identity, I'd say scrap the GUID all together. It's bloating your tables, indexes, and slowing down your joins. If I'm misunderstanding please let me know and I apologize!
To find out what is the best solution there is first some indexing theory. SQL Server stores it's clustered index data in a B+ Tree of data pages which allow for about 8K data per page.
When you know that a uniqueidentifier is 16 bytes per key and an int is 4 bytes per key this means there will be 4 times more keys per index page with an int.
To have a faster join with the int column you will most likely have to make it the clustered index. Be aware that having an additional index on such a large table might create an unwanted performance hit on insert statements as there is a more information to write to disk.
It all boils down to benchmark both solutions and choosing the one which performs best for you. If the table is more read heavy, the int column will offer overall better performance.
#MikeM,
Personally I would always choose a uniqueidentifier over an int as the primary key of a table every time. I would however use NEWSEQUENTIALID() and not NEWGUID() to ensure there is less index fragmentation.
The reason I make this choice is simple:
Integers are too easy to get mixed up, and on a table which has several foreign keys, the chances of "accidentally" putting a value in the wrong field is too high. You will never see the problem because ALL identity columns start at a seed of 1 and so most tables tend to have matching integer values in each table. By using uniqueidentifier I absolutely guarantee for all instances of a column that has a foreign key that the value I place in it is correct, because the table it references is the only table capable of having that unique identifier.
What's more... in code, your arguments would all be int, which again opens you up to the possibility of accidentally putting the wrong value in the wrong parameter and you would never know any different. By using unique identifiers instead, once again you are guaranteeing the correct reference.
Trying to track down bugs due to cross posted integers is insidious and the worst part is that you never know the problem has occurred until it is too late and data has become far too corrupted for you to ever unjumble. All it takes is one cross matched integer field and you could potentially create millions of inconsistent rows, none of which you would be aware of until you just "happen" to try and insert a value that doesn't exist in the referenced table... and by then it could be too late.

Resources