Allocate unique codes to records - sql-server

We are being given a range of numbers which need to be used on some of our orders.
We must ensure that each number is allocated only once.
The range of numbers may be more than a million items (not sure at this stage how many digits but assuming less than 10).
I was thinking of prepopulating a table with these numbers so that every time one is used we can update that row with perhaps the orderid.
Table would be:
Id int, Code int, OrderId uniqueidentifier, Date datetime
When getting the latest number for allocating to an order we would find the min(Code) where OrderId is null
Id would be Primary Key but Clustered index will be set on Code column as that will be used for most important query.
Does this sound like a good way to go about doing this? Is there a better way?

Related

Using LIKE in WHERE clause for GUIDs results in full table scan

I have a table that looks something like this:
CREATE TABLE Records
(
ID UNIQUEIDENTIFIER PRIMARY KEY NONCLUSTERED,
owner UNIQUEIDENTIFIER,
value FLOAT,
timestamp DATETIME
)
There is a multi-column clustered index on some other columns not relevant to this question.
The table currently has about 500,000,000 rows, and I need to operate on the table but it's too large to deal with currently (I am hampered by slow hardware). So I decided to work on it in chunks.
But if I say
SELECT ID
FROM records
WHERE ID LIKE '0000%'
The execution plan shows that the ENTIRE TABLE is scanned. I thought that with an index, only those rows that matched the original condition would be scanned until SQL reached the '0001' records. With the % in front, I could clearly see why it would scan the whole table. But with the % at the end, it shouldn't have to scan the whole table.
I am guessing this works different with GUIDs rather than CHAR or VARCHAR columns.
So my question is this: how can I search for a subsection of GUIDs without having to scan the whole table?
From your comments, I see the actual need is to break the rows of random GUID values into chunks (ordered) based on range. In this case, you can specify a range instead of LIKE along with a filter on the desired start/end values in the last group:
SELECT ID
FROM dbo.records
WHERE
ID BETWEEN '00000000-0000-0000-0000-000000000000'
AND '00000000-0000-0000-0000-000FFFFFFFFF';
This article explains how uniqueidentifiers (GUIDs) are stored and ordered in SQL Server, comparing and sorting the last group first rather than left-to-right as you might expect. By filtering on the last group, you'll get a sargable expression and touch only those rows in the specified range (assuming an index on ID is used).

SQL Query on single table-valued parameter slow on large input

I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime

Concatenate bigint values in SQL Server

I have a table with 3 columns:
PersonId uniqueidentifier -- key
DeviceId uniqueidentifier -- key
Counter bigint
The counter comes in ascending value but sometimes has gaps. An example of the counter values is (1,2,3,1000,10000,10001,10002,...). The counter value is saved one at a time. If I insert one row per counter value, the table gets big very fast. I must keep the last 1000 counter values and can delete early values.
Is it possible to concatenate the counter values into 1 or a few rows in a varbinary(8000) type, and remove early values at the beginning of the binary as part of the insert operation? I would like help in writing this query. I prefer not to use varchar because each character would take up 2 bytes. There may be better way than I can envision. Any help is appreciated!
Why are you trying to do that versus using a table with PersonID, DeviceID, Counter, and allow only a certain number of Counters per PersonID and DeviceID pair?
If your goal is to save space, remember that varbinary(8000) is going to reserve 8000 bytes, allowing 1000 bigint values maximum, with no consideration of how many counters you have.
How likely is it that most of these PersonID and DeviceID pair will have 1000 counters?
In the end, you are just making it more complicated for yourself and harder to maintain by future employees but are you really saving space?
You are also going to have to add some transaction process which will take away more of your server's resources.
But to answer your question strictly: yes, it is possible. I guess a trigger or a process at the end of a sproc could handle what you are trying to do.
Maybe you can choose an in-between solution: Make each row have 10 or so columns of type bigint. That will cut the number of rows by ten and reduce per-row overhead by 10x also. Maybe that is enough for you.

SQL server - worth indexing large string keys?

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)
What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.
An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).
It depends on how you plan to query the values. An index is useful in several cases:
when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).
So, why do you need the index for? What kind of queries would use it?
For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.
For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.
after the comment
I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:
create table values_dictionary (
id int not null identity(1,1),
value varchar(8000) not null,
value_hash = checksum(value) persisted,
constraint pk_values_dictionary_id
primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go
create procedure usp_get_or_create_value_id (
#value varchar(8000),
#id int output)
begin
declare #hash = CHECKSUM(#value);
set #id = NULL;
select #id = id
from table
where value_hash = #hash
and value = #value;
if #id is null
begin
insert into values_dictionary (value)
values (#value);
set #id = scope_identity();
end
end
In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for #value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.
Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate #value entries, which may or may not be correct.
General Index Design Guidelines
When you design an index consider the following column guidelines:
Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique
or nonnull columns. For more information, see Clustered Index Design
Guidelines.
Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as
index key columns. However, varchar(max), nvarchar(max),
varbinary(max), and xml data types can participate in a nonclustered
index as nonkey index columns. For more information, see Index with
Included Columns.
Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by
performing a join on such a column. This is a fundamental problem with
the data and query, and generally cannot be resolved without
identifying this situation. For example, a physical telephone
directory sorted alphabetically on last name will not expedite
locating a person if all people in the city are named Smith or Jones

Is adding an INT to a table who's PRIMARY KEY is a UNIQUEIDENTIFIER for the purpose of a JOIN table worth while?

I've got two tables in my SQL Server 2008 database, Users and Items
tblUser
--------------------------
UserID uniqueidentifier
Name nvarchar(50)
etc..
tblItem
--------------------------
ItemID uniqueidentifier
ItemName nvarchar(50)
etc..
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
I want to join these together in a many to many join table that will get huge (potentially more than a billion rows as the application logic requires stats over shared user --> item joins)
The join table needs to be indexed on the UserID_A and UserID_B columns since the lookups are based on a user against their peers.
My question is this:
Is it worth adding an auto increment INT on the user table to use as a non primary key then use that in the join table? So the User table looks like:
tblUser
---------------------------------
UserID uniqueidentifier
Name nvarchar(50)
UserIDJoinKey int identity(1,1)
etc..
Doing that, will it be faster to do something like:
declare #ID int
select * from tblJoin where UserIDJoinKey_A = #ID or UserIDJoinKey_B = #ID
when the join table looks like this:
tlmUserUserItem
-----------------------------------
ItemID uniqueidentifier
UserIDJoinKey_A int
UserIDJoinKey_B int
rather than this:
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
Thanks in advance.
If you're having a performance problem on join operations to the table with the uniqueidentifier, first check the index fragmentation. Hot tables with a uniqueidentifier clustered index tend to get fragmented quickly. There's good info on how to do that at http://msdn.microsoft.com/en-us/library/ms189858.aspx
If you are able to move the clustered index to the new int column and rewrite your queries to use the new int column instead of the old uniqueidentifier, you're biggest benefit is going to be that you'll reduce rate of fragmentation. This helps avoid having your queries slow down after a a bunch of writes to the table.
In most cases, you will not notice a huge difference in the time to process join operations on a uniqueidentifier column versus an int in MSSQL 2008 -- assuming all other things (including fragmentation) are equal.
I may be misunderstanding something along the line, but you're looking to add an identity AND a uniqueidentifier to a each record? When I see you using a GUID, I assume there is either offline functionality that will be merged when the user goes online, or there is some extraneous reason that the GUID was chosen. That reason should hinder you from being able to correctly implement an identity column on each item.
If there is no specific reason why you needed to use a guid over an identity, I'd say scrap the GUID all together. It's bloating your tables, indexes, and slowing down your joins. If I'm misunderstanding please let me know and I apologize!
To find out what is the best solution there is first some indexing theory. SQL Server stores it's clustered index data in a B+ Tree of data pages which allow for about 8K data per page.
When you know that a uniqueidentifier is 16 bytes per key and an int is 4 bytes per key this means there will be 4 times more keys per index page with an int.
To have a faster join with the int column you will most likely have to make it the clustered index. Be aware that having an additional index on such a large table might create an unwanted performance hit on insert statements as there is a more information to write to disk.
It all boils down to benchmark both solutions and choosing the one which performs best for you. If the table is more read heavy, the int column will offer overall better performance.
#MikeM,
Personally I would always choose a uniqueidentifier over an int as the primary key of a table every time. I would however use NEWSEQUENTIALID() and not NEWGUID() to ensure there is less index fragmentation.
The reason I make this choice is simple:
Integers are too easy to get mixed up, and on a table which has several foreign keys, the chances of "accidentally" putting a value in the wrong field is too high. You will never see the problem because ALL identity columns start at a seed of 1 and so most tables tend to have matching integer values in each table. By using uniqueidentifier I absolutely guarantee for all instances of a column that has a foreign key that the value I place in it is correct, because the table it references is the only table capable of having that unique identifier.
What's more... in code, your arguments would all be int, which again opens you up to the possibility of accidentally putting the wrong value in the wrong parameter and you would never know any different. By using unique identifiers instead, once again you are guaranteeing the correct reference.
Trying to track down bugs due to cross posted integers is insidious and the worst part is that you never know the problem has occurred until it is too late and data has become far too corrupted for you to ever unjumble. All it takes is one cross matched integer field and you could potentially create millions of inconsistent rows, none of which you would be aware of until you just "happen" to try and insert a value that doesn't exist in the referenced table... and by then it could be too late.

Resources