Huge Integer keys in SQL Server - sql-server

I need to create a table with a key that is a 256 Bit hash number. Fast searching and retrieving is crucial, so I am wondering what data structure to use as key ?
One option would be a varchar[32], but I guess searching will be very slow. The stored data amount will be much higher than a numerical solution.
A second option would be two different decimal[16] integers and combine them into a compound key, but I am sceptical if that would have a faster search performance than option #1.
I googled that topic, but didn't find solutions; perhaps some third option ? Any hints appreciated.

It is good the PRIMARY KEY of a table to be surrogate key and a number if possible. Using SMALLINT, ``INT or BIGINT with IDENTITY applied. Using such definition:
[RowID] INT IDENTITY(1,1)
will help you solving some common issues - most importantly, when new records are created they will be appended at the end of the last index page, so no page splitting/fragmentation on insert.
Additional column can be added - your hash value and you can create an index on it, to make searching by hash faster.
For example, I have a IP addresses table holding all addresses used in the application (basically addresses used by users to log in).
The table looks like this:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE TABLE [dbo].[IPAddresses]
(
[IPAddressID] BIGINT IDENTITY(10000,1) NOT NULL
,[IPAddress] VARBINARY(84) NOT NULL
,[IPAddressHash] VARBINARY(64) NULL
,CONSTRAINT [PK_IPAddresses] PRIMARY KEY
(
[IPAddressID] ASC
)
,INDEX [IX_IPAddresses_IPAddressHash]
(
[IPAddressHash] ASC
)
)
GO
As the IPAddress is personal data, it should be encrypted. And as I want my data to be normalized and I do not want to have duplicated records, I need to check each time a user log in if the address exists - if not create one. I am doing this using the following routine:
Here, I am passing the address and calculating a hash by which I am searching. The original version was instead using hash, to decpryt all the values and search by the text, but for millions of IPs this was very slow and this routine is executing constantly. On the other hand, I am performing only inserts in this table and records are appended only - so there is no fragmentation at all.
So, my advice is:
use number column with identity as primary key
add the rest of the columns in your table
add hash column and build a hash by the column(s) used for searches
create index on this hash column
then when you need to search for a record use the hash, located the PK ID and then use the ID to extract the record

Related

In SQL Server is it possible to create a table that has a unique Id?

My boss has assigned a SQL task to me and as I am new to SQL, I am struggling to know where to start.
Task: Create a Customer table to hold the data written in the #Customer temporary table in the PopulateCustomers stored procedure. This table will also need to have a unique id to ensure multiple instances of the populate functionality can be run concurrently.
I know how to create a table in SQL and I am guessing I can look in the PopulateCustomer stored procedure to know what data will be written in the temp Customer table in order to create columns for the Customer table.
But what I am really struggling with is the concept of a unique Id for a database table. I immediately thought primary key for each row in the table. Which my boss responded no, I didn't want to push for more as not to come across as a newbie.
I have tried to google this myself and all I keep coming up with is pages that tell me about identifiers vs primary keys. But nothing ever tells me about a table having its own unique ID unless its in reference to the rows within the table each having an Identifier or primary key. This is leading me to think that I am not searching for the right key word for what this functionality is.
The closest thing I found was here. http://sqlservercodebook.blogspot.com/2008/03/check-if-temporary-table-exists.html
This query looks to me like its creating a temp table with an id.
CREATE TABLE #temp(id INT)
I have not pasted any of my work queries because I really want to research myself and figure this out. I just want to make sure I am looking in the right direction with what term I need to search for to find out how to create a table that has a unique ID. Or maybe I have misinterpreted the task and there is no such thing.
What I got from your story is that you need a table with an unique id, automatically generated, and use this id as the primary key.
This table can be created like:
create table example
(
id int identity(1,1) primary key clustered,
other_data varchar(200)
)
The key terms here are:
identity - for the id column be auto-incremented
primary key - so SQL Server ensures this column is unique
clustered - for all the data in this table be organized physically by this column (and make it faster to be searched by it)

SQL Query on single table-valued parameter slow on large input

I have a table with this simple definition:
CREATE TABLE Related
(
RelatedUser NVARCHAR(100) NOT NULL FOREIGN KEY REFERENCES User(Id),
RelatedStory BIGINT NOT NULL FOREIGN KEY REFERENCES Story(Id),
CreationTime DateTime NOT NULL,
PRIMARY KEY(RelatedUser, RelatedStory)
);
with these indexes:
CREATE INDEX i_relateduserid
ON Related (RelatedUserId) INCLUDE (RelatedStory, CreationTime)
CREATE INDEX i_relatedstory
ON Related(RelatedStory) INCLUDE (RelatedUser, CreationTime)
And I need to query the table for all stories related to a list of UserIds, ordered by Creation Time, and then fetch only X and skip Y.
I have this stored procedure:
CREATE PROCEDURE GetStories
#offset INT,
#limit INT,
#input UserIdInput READONLY
AS
BEGIN
SELECT RelatedStory
FROM Related
WHERE EXISTS (SELECT 1 FROM #input WHERE UID = RelatedUser)
GROUP BY RelatedStory, CreationTime
ORDER BY CreationTime DESC
OFFSET #offset ROWS FETCH NEXT #limit ROWS ONLY;
END;
Using this User-Defined Table Type:
CREATE TYPE UserIdInput AS TABLE
(
UID nvarchar(100) PRIMARY KEY CLUSTERED
)
The table has 13 million rows, and gets me good results when using few userids as input, but very bad (30+ seconds) results when providing hundreds or a couple thousand userids as input. The main problem seems to be that it uses 63% of the effort on sorting.
What index am I missing? this seems to be a pretty straight forward query on a single table.
What types of values do you have for RelatedUser / UID ? Why, exactly, are you using NVARCHAR(100) for it? NVARCHAR is usually a horrible choice for a PK / FK field. Even if the value is a simple, alphanumeric code (e.g. ABTY1245) there are better ways of handling this. One of the main problems with NVARCHAR (and even with VARCHAR for this particular issue) is that, unless you are using a binary collation (e.g. Latin1_General_100_BIN2), every sort and comparison operation will apply the full range of linguistic rules, which can be well worth it when working with strings, but unnecessarily expensive when working with codes, especially when using the typically default case-insensitive collations.
Some "better" (but not ideal) solutions would be:
If you really do need Unicode characters, at least specify a binary collation, such as Latin1_General_100_BIN2.
If you do not need Unicode characters, then switch to using VARCHAR which will take up half the space and sort / compare faster. Also, still use a binary Collation.
Your best bet is to:
Add an INT IDENTITY column to the User table, named UseID
Make UserID the Clustered PK
Add an INT (no IDENTITY) column to the Related table, named UserID
Add an FK from Related back to User on UserID
Remove the RelatedUser column from the Related table.
Add a non-clustered, Unique Index to the User table on the UserCode column (this makes it an "alternate key")
Drop and recreate the UserIdInput User-Defined Table Type to have an INT datatype instead of NVARCHAR(100)
If at all possible, alter the ID column of the User table to have a binary collation (i.e. Latin1_General_100_BIN2)
If possible, rename the current Id column in the User table to be UserCode or something like that.
If users are entering in the "Code" values (meaning: cannot guarantee they will always use all upper-case or all lower-case), then best to add an AFTER INSERT, UPDATE Trigger on the User table to ensure that the values are always all upper-case (or all lower-case). This will also mean that you need to make sure that all incoming queries using the same all upper-case or all lower-case values when searching on the "Code". But that little bit of extra work will pay off.
The entire system will thank you, and show you its appreciation by being more efficient :-).
One other thing to consider: the TVP is a table-variable, and by default those only ever appear to the query optimizer to have a single row. So it makes some sense that adding a few thousand entries into the TVP would slow it down. One trick to help speed up TVP in this scenario is to add OPTION (RECOMPILE) to the query. Recompiling queries with table variables will cause the query optimizer to see the true row count. If that doesn't help any, the other trick is to dump the TVP table variable into a local temporary table (i.e. #TempUserIDs) as those do maintain statistics and optimize better when you have more than a small number of rows in them.
From O.P.'s comment on this answer:
[UID] is an ID used across our system (XXX-Y-ZZZZZZZZZZ...), XXX being letters, Y being a number and Z being numbers
Yes, I figured it was an ID or code of some sort, so that doesn't change my advice. NVARCHAR, especially if using a non-binary, case-insensitive collation, is probably one of the worst choices of datatype for this value. This ID should be in a column named UserCode in the User table with a non-clustered index defined on it. This makes it an "alternate" key and a quick and easy lookup from the app layer, one time, to get the "internal" integer value for that row, the INT IDENTITY column as the actual UserID (is usually best to name ID columns as {table_name}ID for consistency / easier maintenance over time). The UserID INT value is what goes into all related tables to be the FK. An INT column will JOIN much faster than an NVARCHAR. Even using a binary collation, this NVARCHAR column, while being faster than its current implementation, will still be at least 32 bytes (based on the given example of XXX-Y-ZZZZZZZZZZ) whereas the INT will be just 4 bytes. And yes, those extra 28 bytes do make a difference, especially when you have 13 million rows. Remember, this isn't just disk space that these values take up, it is also memory since ALL data that is read for queries goes through the Buffer Pool (i.e. physical memory!).
In this scenario, however, we're not following the foreign keys anywhere, but directly querying on them. If they're indexed, should it matter?
Yes, it still does matter since you are essentially doing the same operation as a JOIN: you are taking each value in the main table and comparing it to the values in the table variable / TVP. This is still a non-binary, case-insensitive (I assume) comparison that is very slow compared to a binary comparison. Each letter needs to be evaluated against not just upper and lower case, but against all other Unicode Code Points that could equate to each letter (and there are more than you think that will match A - Z!). The index will make it faster than not having an index, but nowhere near as fast as comparing one simple value that has no other representation.
So I finally found a solution.
While #srutzky had good suggestions of normalizing the tables by changing the NVARCHAR UserId to an Integer to minimize comparison cost, this was not what solved my problem. I will definitely do this at some point for the added theoretical performance, but I saw very little change in performance after implementing it right off the bat.
#Paparazzi suggested I added an index for (RelatedStory, CreationTime), and that did not do what I needed either. The reason was, that I also needed to also index RelatedUser as that's the way the query goes, and it groups and orders by both CreationTime and RelatedStory, so all three are needed. So:
CREATE INDEX i_idandtime ON Related (RelatedUser, CreationTime DESC, RelatedStory)
solved my problem, bringing my unacceptable query times of 15+ seconds down to mostly 1-second or a couple of seconds querytimes.
I think what gave me the revelation was #srutzky noting:
Remember, "Include" columns are not used for sorting or comparisons,
only for covering.
which made me realize I needed all my groupby and orderby columns in the index.
So while I can't mark either of the above posters post as the Answer, I'd like to sincerely thank them for their time.
The main problem seems to be that it uses 63% of the effort on
sorting.
ORDER BY CreationTime DESC
I would suggest and index on CreationTime
Or try an index on RelatedStory, CreationTime

generating a reliable system wide unique identifier

We store documents in our database (sql server), the documents are spread across various tables so there is no one table that contains all of them.
I now have a requirement to give all documents a system wide unique id, one that is semi-readable, not like a guid.
I've seen this done before by creating a single table with a single row/column with just a number that gets incremented when a new document is created.
Is this the best way to go about it, how do I ensure that no one reads the current number if someone is about to update and and vice versa?
In this case the number can be something like 001 and auto-increment as required, I'm mainly worried about stopping collisions rather than getting a fancy identifier.
If you want the single row/column approach, I've used:
declare #MyRef int
update CoreTable set #MyRef = LastRef = LastRef + 1
The update will be safe - each person who executes it will receive a distinct result in #MyRef. This is safer than doing separate read, increment, update.
Table defn:
create table CoreTable (
X char(1) not null,
LastRef int not null,
constraint PK_CoreTable PRIMARY KEY (X),
constraint CK_CoreTable_X CHECK (X = 'X')
)
insert into CoreTable (X,LastRef) values ('X',0)
You can use Redis for this. Have a look at this article: http://rediscookbook.org/create_unique_ids.html
Redis is a very fast in-memory NoSQL database, but one with persistence capabilities. You can quickly utilize a Redis instance and use it to create incremental numbers which will be unique.
You can then leverage Redis for other many purposes in your app.
Another suggestion for your inquiry that does not involve installing Redis is to use a single DB row/column as you suggested, while encapsulating it in a transaction. That way you won't run into conflicts.
One 'classic' approach would indeed be to have a seperate table (e.g. Documents) with (at least) an ID column (int identity). Then, add a foreign key column and constraint to all your existing document tables. This ensures the uniqueness of the document ID over all tables.
Something like this:
CREATE TABLE Documents (Id int identity not null)
ALTER TABLE DocumentTypeOne
ADD CONSTRAINT (DocumentId) DocumentTypeOne_Documents_FK Documents(Id)

SQL server - worth indexing large string keys?

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)
What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.
An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).
It depends on how you plan to query the values. An index is useful in several cases:
when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).
So, why do you need the index for? What kind of queries would use it?
For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.
For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.
after the comment
I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:
create table values_dictionary (
id int not null identity(1,1),
value varchar(8000) not null,
value_hash = checksum(value) persisted,
constraint pk_values_dictionary_id
primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go
create procedure usp_get_or_create_value_id (
#value varchar(8000),
#id int output)
begin
declare #hash = CHECKSUM(#value);
set #id = NULL;
select #id = id
from table
where value_hash = #hash
and value = #value;
if #id is null
begin
insert into values_dictionary (value)
values (#value);
set #id = scope_identity();
end
end
In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for #value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.
Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate #value entries, which may or may not be correct.
General Index Design Guidelines
When you design an index consider the following column guidelines:
Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique
or nonnull columns. For more information, see Clustered Index Design
Guidelines.
Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as
index key columns. However, varchar(max), nvarchar(max),
varbinary(max), and xml data types can participate in a nonclustered
index as nonkey index columns. For more information, see Index with
Included Columns.
Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by
performing a join on such a column. This is a fundamental problem with
the data and query, and generally cannot be resolved without
identifying this situation. For example, a physical telephone
directory sorted alphabetically on last name will not expedite
locating a person if all people in the city are named Smith or Jones

Is adding an INT to a table who's PRIMARY KEY is a UNIQUEIDENTIFIER for the purpose of a JOIN table worth while?

I've got two tables in my SQL Server 2008 database, Users and Items
tblUser
--------------------------
UserID uniqueidentifier
Name nvarchar(50)
etc..
tblItem
--------------------------
ItemID uniqueidentifier
ItemName nvarchar(50)
etc..
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
I want to join these together in a many to many join table that will get huge (potentially more than a billion rows as the application logic requires stats over shared user --> item joins)
The join table needs to be indexed on the UserID_A and UserID_B columns since the lookups are based on a user against their peers.
My question is this:
Is it worth adding an auto increment INT on the user table to use as a non primary key then use that in the join table? So the User table looks like:
tblUser
---------------------------------
UserID uniqueidentifier
Name nvarchar(50)
UserIDJoinKey int identity(1,1)
etc..
Doing that, will it be faster to do something like:
declare #ID int
select * from tblJoin where UserIDJoinKey_A = #ID or UserIDJoinKey_B = #ID
when the join table looks like this:
tlmUserUserItem
-----------------------------------
ItemID uniqueidentifier
UserIDJoinKey_A int
UserIDJoinKey_B int
rather than this:
tlmUserUserItem
----------------------------
ItemID uniqueidentifier
UserID_A uniqueidentifier
UserID_B uniqueidentifier
Thanks in advance.
If you're having a performance problem on join operations to the table with the uniqueidentifier, first check the index fragmentation. Hot tables with a uniqueidentifier clustered index tend to get fragmented quickly. There's good info on how to do that at http://msdn.microsoft.com/en-us/library/ms189858.aspx
If you are able to move the clustered index to the new int column and rewrite your queries to use the new int column instead of the old uniqueidentifier, you're biggest benefit is going to be that you'll reduce rate of fragmentation. This helps avoid having your queries slow down after a a bunch of writes to the table.
In most cases, you will not notice a huge difference in the time to process join operations on a uniqueidentifier column versus an int in MSSQL 2008 -- assuming all other things (including fragmentation) are equal.
I may be misunderstanding something along the line, but you're looking to add an identity AND a uniqueidentifier to a each record? When I see you using a GUID, I assume there is either offline functionality that will be merged when the user goes online, or there is some extraneous reason that the GUID was chosen. That reason should hinder you from being able to correctly implement an identity column on each item.
If there is no specific reason why you needed to use a guid over an identity, I'd say scrap the GUID all together. It's bloating your tables, indexes, and slowing down your joins. If I'm misunderstanding please let me know and I apologize!
To find out what is the best solution there is first some indexing theory. SQL Server stores it's clustered index data in a B+ Tree of data pages which allow for about 8K data per page.
When you know that a uniqueidentifier is 16 bytes per key and an int is 4 bytes per key this means there will be 4 times more keys per index page with an int.
To have a faster join with the int column you will most likely have to make it the clustered index. Be aware that having an additional index on such a large table might create an unwanted performance hit on insert statements as there is a more information to write to disk.
It all boils down to benchmark both solutions and choosing the one which performs best for you. If the table is more read heavy, the int column will offer overall better performance.
#MikeM,
Personally I would always choose a uniqueidentifier over an int as the primary key of a table every time. I would however use NEWSEQUENTIALID() and not NEWGUID() to ensure there is less index fragmentation.
The reason I make this choice is simple:
Integers are too easy to get mixed up, and on a table which has several foreign keys, the chances of "accidentally" putting a value in the wrong field is too high. You will never see the problem because ALL identity columns start at a seed of 1 and so most tables tend to have matching integer values in each table. By using uniqueidentifier I absolutely guarantee for all instances of a column that has a foreign key that the value I place in it is correct, because the table it references is the only table capable of having that unique identifier.
What's more... in code, your arguments would all be int, which again opens you up to the possibility of accidentally putting the wrong value in the wrong parameter and you would never know any different. By using unique identifiers instead, once again you are guaranteeing the correct reference.
Trying to track down bugs due to cross posted integers is insidious and the worst part is that you never know the problem has occurred until it is too late and data has become far too corrupted for you to ever unjumble. All it takes is one cross matched integer field and you could potentially create millions of inconsistent rows, none of which you would be aware of until you just "happen" to try and insert a value that doesn't exist in the referenced table... and by then it could be too late.

Resources