SQL Server - more efficient hashing? - sql-server

I am caching data locally from a remote SQL server, and need to know when the remote data has changed so I can update my cache. The remote data has no modified date column, and I don't have permissions to add columns or create triggers, so there are a lot of options that are closed to me. My current process is to calculate the checksum of the remote table:
select checksum_agg(binary_checksum(*)) from remoteTable;
and then if that value changes, I know I need to update my cache. This basically works, with two flaws:
There are lots of changes that woun't cause the checksum to change
All I know is "something changed" and I need to copy the entire remote table into my local cache.
In order to be more efficient and more correct, I want to upgrade my process to:
Use MD5 hash
Return multiple hash values, one for each "chunk" of the output table (in my case, every 10,000 rows) so that I can update only the "chunk" that changed
I've created this stored procedure that relies on the (apparently undocumented) fn_repl_hash_binary function:
use dataBase;
if exists (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'GetHash')
drop procedure GetHash;
create procedure GetHash
as
begin
declare #j int, #x nvarchar(max), #b varbinary(max), #h nvarchar(40)
set #j = 0
IF OBJECT_ID('tempdb.dbo.##TempHashTable', 'U') IS NOT NULL
DROP TABLE ##TempHashTable;
create table ##TempHashTable(id int, xlen int, hash nvarchar(40))
while (#j < 7)
begin
set #x = (select * from myTable where (id/10000) = #j for xml auto )
set #b = convert(varbinary(max), #x)
set #h = convert(char, sys.fn_repl_hash_binary(#b), 2)
insert into ##TempHashTable(id, xlen, hash) values(#j, len(#x), #h)
set #j = #j + 1
end
end;
This works as expected, the only drawback is it's rather slow. Checksumming the entire table takes 0.1 seconds (including all the network roundtrip), while "exec GetHash" takes about 0.7 seconds. That's not bad, but I'm wondering if I can do better.
'id' is the primary key of myTable, so the select should be efficient, what I'm seeing is that the XML being generated is around 2 megabytes for each "chunk", with most of that being repetitive XML tag names and angle brackets.
Is there a way to output the values from myTable in a less readable (but more efficient) format than XML?
Are there other obvious inefficiencies in my stored procedure that would be causing it to be slow?
Yes, there are issues with this stored procedure (like, it only works on one table and I'm assuming the table will always contain fewer than 70,000 rows). Once I'm happy with the performance I'll put in all the niceties.
I just added some timing code, here's what I'm seeing:
Each "select" is taking about 60 msec
Calculating each hash is taking about 25 msec

Related

Is there a simple(r) way to REPLACE a character across all columns in one table in SQL Server?

There are ~10 different subquestions that could be answered here, but the main question is in the title. TLDR version: I have a table like the example below and I want to replace all double quote marks across the whole table. Is there a simple way to do this?
My solution using cursor seems fairly straightforward. I know there's some CURSOR hatred in the SQL Server community (bad runtime?). At what point (num rows and/or num columns) would CURSOR stink at this?
Create Reproducible Example Table
DROP TABLE IF EXISTS #example;
CREATE TABLE #example (
NumCol INT
,CharCol NVARCHAR(20)
,DateCol NVARCHAR(100)
);
INSERT INTO #example VALUES
(1, '"commas, terrible"', '"2021-01-01 20:15:57,2021:04-08 19:40:50"'),
(2, '"loadsrc,.txt"', '2020-01-01 00:00:05'),
(3, '".txt,from.csv"','1/8/2021 10:14')
Right now, my identified solutions are:
Manually update for each column UPDATE X SET CharCol = REPLACE(CharCol, '"',''). Horribly annoying to do at any more than 2 columns IMO.
Use a CURSOR to update (similar to annoyingly complicated looking solution at SQL Server- SQL Replace on all columns in all tables across an entire DB
REPLACE character using CURSOR
This gets a little convoluted with all the cursor-related script, but seems to work well otherwise.
-- declare variable to store colnames, cursor to filter through list, string for dynamic sql code
DECLARE #colname VARCHAR(10)
,#sql VARCHAR(MAX)
,#namecursor CURSOR;
-- run cursor and set colnames and update table
SET #namecursor = CURSOR FOR SELECT ColName FROM #colnames
OPEN #namecursor;
FETCH NEXT FROM #namecursor INTO #colname;
WHILE (##FETCH_STATUS <> -1) -- alt: WHILE ##FETCH_STATUS = 0
BEGIN;
SET #sql = 'UPDATE #example SET '+#colname+' = REPLACE('+#colname+', ''"'','''')'
EXEC(#sql); -- parentheses VERY important: EXEC(sql-as-string) NOT EXEC storedprocedure
FETCH NEXT FROM #namecursor INTO #colname;
END;
CLOSE #namecursor;
DEALLOCATE #namecursor;
GO
-- see results
SELECT * FROM #example
Subquestion: While I've seen it in our database elsewhere, for this particular example I'm opening a .csv file in Excel and exporting it as tab delimited. Is there a way to change the settings to export without the double quotes? If I remember correctly, BULK INSERT doesn't have a way to handle that or a way to handle importing a csv file with extra commas.
And yes, I'm going to pretend that I'm fine that there's a list of datetimes in the date column (necessitating varchar data type).
Why not just dynamically build the SQL?
Presumably it's a one-time task you'd be doing so just run the below for your table, paste into SSMS and run. But if not you could build an automated process to execute it - better of course to properly sanitize when inserting the data though!
select
'update <table> set ' +
String_Agg(QuoteName(COLUMN_NAME) + '=Replace(' + QuoteName(column_name) + ',''"'','''')',',')
from INFORMATION_SCHEMA.COLUMNS
where table_name='<table>' and TABLE_SCHEMA='<schema>' and data_type in ('varchar','nvarchar')
example DB<>Fiddle
You might try this approach, not fast, but easy to type (or generate).
SELECT NumCol = y.value('(NumCol/text())[1]','int')
,CharCol = y.value('(CharCol/text())[1]','nvarchar(100)')
,DateCol = y.value('(DateCol/text())[1]','nvarchar(100)')
FROM #example e
CROSS APPLY(SELECT e.* FOR XML PATH('')) A(x)
CROSS APPLY(SELECT CAST(REPLACE(A.x,'"','') AS XML)) B(y);
The idea in short:
The first APPLY will transform all columns to a root-less XML.
Without using ,TYPE this will be of type nvarchar(max) implicitly
The second APPLY will first replace any " in the whole text (which is one row actually) and cast this to XML.
The SELECT uses .value to fetch the values type-safe from the XML.
Update: Just add INTO dbo.SomeNotExistingTableName right before FROM to create a new table with this data. This looks better than updating the existing table (might be a #-table too). I'd see this as a staging environment...
Good luck, messy data is always a pain in the neck :-)

SQL Server Memory Optimized Table - poor performance compared to temporary table

I'm trying to benchmark memory optimized tables in Microsoft SQL Server 2016 with classic temporary tables.
SQL Server version:
Microsoft SQL Server 2016 (SP2) (KB4052908) - 13.0.5026.0 (X64) Mar 18 2018 09:11:49
Copyright (c) Microsoft Corporation
Developer Edition (64-bit) on Windows 10 Enterprise 10.0 <X64> (Build 17134: ) (Hypervisor)
I'm following steps described here: https://learn.microsoft.com/en-us/sql/relational-databases/in-memory-oltp/faster-temp-table-and-table-variable-by-using-memory-optimization?view=sql-server-ver15.
CrudTest_TempTable 1000, 100, 100
go 1000
versus
CrudTest_memopt_hash 1000, 100, 100
go 1000
What this test does?
1000 inserts
100 random updates
100 random deletes
And this is repeated 1000 times.
First stored procedure that uses classic temporary tables takes about 6 seconds to run.
Second stored procedure takes at least 15 seconds and usually errors out:
Beginning execution loop
Msg 3998, Level 16, State 1, Line 3
Uncommittable transaction is detected at the end of the batch. The transaction is rolled back.
Msg 701, Level 17, State 103, Procedure CrudTest_memopt_hash, Line 16 [Batch Start Line 2]
There is insufficient system memory in resource pool 'default' to run this query.
I have done following optimizations (before it was even worse):
hash index includes both Col1 and SpidFilter
doing everything in single transaction makes it works faster (however it would be nice to run without it)
I'm generating random ids - without it records from every iteration ended up in the same buckets
I haven't created natively compiled SP yet since my results are awful.
I have plenty of free RAM on my box and SQL Server can consume it - in different scenarios it allocates much memory but in this test case it simply errors out.
For me these results mean that memory optimized tables cannot replace temporary tables. Do you have similar results or am I doing something wrong?
The code that uses temporary tables is:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
DROP PROCEDURE IF EXISTS CrudTest_TempTable;
GO
CREATE PROCEDURE CrudTest_TempTable
#InsertsCount INT, #UpdatesCount INT, #DeletesCount INT
AS
BEGIN
SET NOCOUNT ON;
BEGIN TRAN;
CREATE TABLE #tempTable
(
Col1 INT NOT NULL PRIMARY KEY CLUSTERED,
Col2 NVARCHAR(4000),
Col3 NVARCHAR(4000),
Col4 DATETIME2,
Col5 INT NOT NULL
);
DECLARE #cnt INT = 0;
DECLARE #currDate DATETIME2 = GETDATE();
WHILE #cnt < #InsertsCount
BEGIN
INSERT INTO #tempTable (Col1, Col2, Col3, Col4, Col5)
VALUES (#cnt,
'sdkfjsdjfksjvnvsanlknc kcsmksmk ms mvskldamvks mv kv al kvmsdklmsdkl mal mklasdmf kamfksam kfmasdk mfksamdfksafeowa fpmsad lak',
'msfkjweojfijm skmcksamepi eisjfi ojsona npsejfeji a piejfijsidjfai spfdjsidjfkjskdja kfjsdp fiejfisjd pfjsdiafjisdjfipjsdi s dfipjaiesjfijeasifjdskjksjdja sidjf pajfiaj pfsdj pidfe',
#currDate, 100);
SET #cnt = #cnt + 1;
END
SET #cnt = 0;
WHILE #cnt < #UpdatesCount
BEGIN
UPDATE #tempTable SET Col5 = 101 WHERE Col1 = cast ((rand() * #InsertsCount) as int);
SET #cnt = #cnt + 1;
END
SET #cnt = 0;
WHILE #cnt < #DeletesCount
BEGIN
DELETE FROM #tempTable WHERE Col1 = cast ((rand() * #InsertsCount) as int);
SET #cnt = #cnt + 1;
END
COMMIT;
END
GO
The objects used in the in-memory test are :
DROP PROCEDURE IF EXISTS CrudTest_memopt_hash;
GO
DROP SECURITY POLICY IF EXISTS tempTable_memopt_hash_SpidFilter_Policy;
GO
DROP TABLE IF EXISTS tempTable_memopt_hash;
GO
DROP FUNCTION IF EXISTS fn_SpidFilter;
GO
CREATE FUNCTION fn_SpidFilter(#SpidFilter smallint)
RETURNS TABLE
WITH SCHEMABINDING , NATIVE_COMPILATION
AS
RETURN
SELECT 1 AS fn_SpidFilter
WHERE #SpidFilter = ##spid;
GO
CREATE TABLE tempTable_memopt_hash
(
Col1 INT NOT NULL,
Col2 NVARCHAR(4000),
Col3 NVARCHAR(4000),
Col4 DATETIME2,
Col5 INT NOT NULL,
SpidFilter SMALLINT NOT NULL DEFAULT (##spid),
INDEX ix_SpidFiler NONCLUSTERED (SpidFilter),
INDEX ix_hash HASH (Col1, SpidFilter) WITH (BUCKET_COUNT=100000),
CONSTRAINT CHK_SpidFilter CHECK ( SpidFilter = ##spid )
) WITH (MEMORY_OPTIMIZED = ON, DURABILITY = SCHEMA_ONLY);
GO
CREATE SECURITY POLICY tempTable_memopt_hash_SpidFilter_Policy
ADD FILTER PREDICATE dbo.fn_SpidFilter(SpidFilter)
ON dbo.tempTable_memopt_hash
WITH (STATE = ON);
GO
And the stored procedure that uses them is:
CREATE PROCEDURE CrudTest_memopt_hash
#InsertsCount INT, #UpdatesCount INT, #DeletesCount int
AS
BEGIN
SET NOCOUNT ON;
BEGIN TRAN;
DECLARE #cnt INT = 0;
DECLARE #currDate DATETIME2 = GETDATE();
DECLARE #IdxStart INT = CAST ((rand() * 1000) AS INT);
WHILE #cnt < #InsertsCount
BEGIN
INSERT INTO tempTable_memopt_hash(Col1, Col2, Col3, Col4, Col5)
VALUES (#IdxStart + #cnt,
'sdkfjsdjfksjvnvsanlknc kcsmksmk ms mvskldamvks mv kv al kvmsdklmsdkl mal mklasdmf kamfksam kfmasdk mfksamdfksafeowa fpmsad lak',
'msfkjweojfijm skmcksamepi eisjfi ojsona npsejfeji a piejfijsidjfai spfdjsidjfkjskdja kfjsdp fiejfisjd pfjsdiafjisdjfipjsdi s dfipjaiesjfijeasifjdskjksjdja sidjf pajfiaj pfsdj pidfe',
#currDate, 100);
SET #cnt = #cnt + 1;
END
SET #cnt = 0;
WHILE #cnt < #UpdatesCount
BEGIN
UPDATE tempTable_memopt_hash
SET Col5 = 101
WHERE Col1 = #IdxStart + cast ((rand() * #InsertsCount) as int);
SET #cnt = #cnt + 1;
END
SET #cnt = 0;
WHILE #cnt < #DeletesCount
BEGIN
DELETE FROM tempTable_memopt_hash
WHERE Col1 = #IdxStart + cast ((rand() * #InsertsCount) as int);
SET #cnt = #cnt + 1;
END
DELETE FROM tempTable_memopt_hash;
COMMIT;
END
GO
Index stats:
table index total_bucket_count empty_bucket_count empty_bucket_percent avg_chain_length max_chain_length
[dbo].[tempTable_memopt_hash] PK__tempTabl__3ED0478731BB5AF0 131072 130076 99 1 3
UPDATE
I'm including my final test cases and sql code for creating procedures, tables, etc. I've performed test on empty database.
SQL Code: https://pastebin.com/9K6SgAqZ
Test cases: https://pastebin.com/ckSTnVqA
My last run looks like this (temp table is the fastest one when it comes to tables, but I am able to achieve fastest times using memory optimized table variable):
Start CrudTest_TempTable 2019-11-18 10:45:02.983
Beginning execution loop
Batch execution completed 1000 times.
Finish CrudTest_TempTable 2019-11-18 10:45:09.537
Start CrudTest_SpidFilter_memopt_hash 2019-11-18 10:45:09.537
Beginning execution loop
Batch execution completed 1000 times.
Finish CrudTest_SpidFilter_memopt_hash 2019-11-18 10:45:27.747
Start CrudTest_memopt_hash 2019-11-18 10:45:27.747
Beginning execution loop
Batch execution completed 1000 times.
Finish CrudTest_memopt_hash 2019-11-18 10:45:46.100
Start CrudTest_tableVar 2019-11-18 10:45:46.100
Beginning execution loop
Batch execution completed 1000 times.
Finish CrudTest_tableVar 2019-11-18 10:45:47.497
IMHO, the test in OP cannot show the advantages of memory-optimized tables
because the greatest advantage of these tables is that they are lock-and-latch free, this means your update/insert/delete do not take locks at all that permits concurrent changes to these tables.
But the test made does not include concurrent changes at all, the code shown make all the changes in one session.
Another observation: hash index defined on the table is wrong as you search only on one column and hash index is defined on two columns. Hash index on two columns means that hash function is applied to both arguments, but you search only on one column so hash index just cannot be used.
Do you think by using mem opt tables I can get performance
improvements over temp tables or is it just for limiting IO on tempdb?
Memory-optimized tables are not supposed to substitute temporary tables, as already mentioned, you'll see the profit in highly concurrent OLTP environment, while as you guess temporary table is visible only to your session, there is no concurrency at all.
Eliminate latches and locks. All In-Memory OLTP internal data structures are latch- and lock-free. In-Memory OLTP uses a new
multi-version concurrency control (MVCC) to provide transaction
consistency. From a user standpoint, it behaves in a way similar to
the regular SNAPSHOT transaction isolation level; however, it does
not use locking under the hood. This schema allows multiple sessions
to work with the same data without locking and blocking each other
and improves the scalability of the system allowing fully utilize
modern multi-CPU/multi-core hardware.
Cited book: Pro SQL Server Internals by Dmitri Korotkevitch
What do you think about the title "Faster temp table and table
variable by using memory optimization"
I opened this article and see these examples (in the order they are in the article)
A. Basics of memory-optimized table variables
B. Scenario: Replace global tempdb ##table
C. Scenario: Replace session tempdb #table
A. I use table variables only in cases they contain very few rows. Why should I even take care about this few rows?
B. Replace global tempdb ##table . I just don't use them at all.
C. Replace session tempdb #table. As already mentioned, session tempdb #table is not visible to any other session, so what is the gain? That the data don't go to the disk? May be you shuold think about fastest SSD disk for your tempdb if you really have problems with tempdb? Starting with 2014 tempdb objects don't necessarily goes to disk even in case of bulk inserts, in any case I have even RCSI enabled on my databases and have no problems with tempdb.
Likely will not see performance improvement, only in very special applications. SQL dabbled with things like 'pin table' in the past but the optimizer, choosing what pages are in memory based on real activity, is probably as good as it gets for almost all cases. This has been performance tuned over decades. I think that 'in memory' is more a marketing touchpoint than any practical use. Prove me wrong please.

Why is my UPDATE stored procedure executed multiple times?

I use stored procedures to manage a warehouse. PDA scanners scan the added stock and send it in bulk (when plugged back) to a SQL database (SQL Server 2016).
The SQL database is fairly remote (other country), so there's sometimes delay in some queries, but this particular one is problematic: even if the stock table is fine, I had some problems with updating the occupancy of the warehouse spots. The PDA tracks the added items in every spot as a SMALLINT, then send back this value to the stored procedure below.
PDA "send_spots" query:
SELECT spot, added_items FROM spots WHERE change=1
Stored procedure:
CREATE PROCEDURE [dbo].[update_spots]
#spot VARCHAR(10),
#added_items SMALLINT
AS
BEGIN
BEGIN TRAN
UPDATE storage_spots
SET remaining_capacity = remaining_capacity - #added_items
WHERE storage_spot=#spot
IF ##ROWCOUNT <> 1
BEGIN
ROLLBACK TRAN
RETURN - 1
END
ELSE
BEGIN
COMMIT TRAN
RETURN 0
END
END
GO
If the remaining_capacity value is 0, the PDAs can't add more items to it on next round. But with this process, I had negative values because the query allegedly ran two times (so subtracted #added_items two times).
Is there a way for that to be possible? How could I fix it? From what I understood the transaction should be cancelled (ROLLBACK) if the affected rows are != 1, but maybe that's something else.
EDIT: current solution with the help of #Zero:
CREATE PROCEDURE [dbo].[update_spots]
#spot VARCHAR(10),
#added_racks SMALLINT
AS
BEGIN
-- Recover current capacity of the spot
DECLARE #orig_capacity SMALLINT
SELECT TOP 1
#orig_capacity = remaining_capacity
FROM storage_spots
WHERE storage_spot=#spot
-- Test if double is present in logs by comparing dates (last 10 seconds)
DECLARE #is_double BIT = 0
SELECT #is_double = CASE WHEN EXISTS(SELECT *
FROM spot_logs
WHERE log_timestamp >= dateadd(second, -10, getdate()) AND storage_spot=#spot AND delta=#added_racks)
THEN 1 ELSE 0 END
BEGIN
BEGIN TRAN
UPDATE storage_spots
SET remaining_capacity= #orig_capacity - #added_racks
WHERE storage_spot=#spot
IF ##ROWCOUNT <> 1 OR #is_double <> 0
-- If double, rollback UPDATE
ROLLBACK TRAN
ELSE
-- If no double, commit UPDATE
COMMIT TRAN
-- write log
INSERT INTO spot_logs
(storage_spot, former_capacity, new_capacity, delta, log_timestamp, double_detected)
VALUES
(#spot, #orig_capacity, #orig_capacity-#added_racks, #added_racks, getdate(), #is_double)
END
END
GO
I was thinking about possible causes (and a way to trace them) and then it hit me - you have no value validation!
Here's a simple example to illustrate the problem:
Spot | capacity
---------------
x1 | 1
Update spots set capacity = capacity - 2 where spot = 'X1'
Your scanner most likely gave you higher quantity than you had capacity to take in.
I am not sure how your business logic goes, but you need to perform something in lines of
Update spots set capacity = capacity - #added_items where spot = 'X1' and capacity >= #added_items
if ##rowcount <> 1;
rollback;
EDIT: few methods to trace your problem without implementing validation:
Create a logging table (with timestamp, user id (user, that is connected to DB), session id, old value, new value, delta value (added items).
Option 1:
Log all updates that change value from positive to negative (at least until you figure out the problem).
The drawback of this option is that it will not register double calls that do not result in a negative capacity.
Option 2: (logging identical updates):
Create script that creates a global temporary table and deletes records, from that table timestamps older than... let's say 10 minutes once every minute or so (play around with numbers).
This temporary table should hold the data passed to your update statement so 'spot', 'added_items' + 'timestamp' (for tracking).
Now the crucial part: When you call your update statement check if a similar record exists in the temporary table (same spot and added_items and where current timestamp is between timestamp and [timestamp + 1 second] - again play around with numbers). If a record like that exist log that update, if not add it to temporary table.
This will register identical updates that go within 1 second of each other (or whatever time-frame you choose).
I found here an alternative SQL query that does the update the way I need, but with a temporary value using DECLARE. Would it work better in my case, or is my initial query correct?
Initial query:
UPDATE storage_spots
SET remaining_capacity = remaining_capacity - #added_items
WHERE storage_spot=#spot
Alternative query:
DECLARE #orig_capacity SMALLINT
SELECT TOP 1 #orig_capacity = remaining_capacity
FROM storage_spots
WHERE spot=#spot
UPDATE Products
SET remaining_capacity = #orig_capacity - #added_items
WHERE spot=#spot
Also, should I get rid of the ROLLBACK/COMMIT instructions?

Sql server query with lots of substrings performance

I have stored procedure in which i perform a bulk insert into a temp table and perform substring on its field to get the different row required for the main table.
The no. of columns for the main table is 66 and the row added after each run of the sp is approx 5500.
Code for bulk insert part:
CREATE TABLE [dbo].[#TestData] (
logdate DATETIME,
id CHAR(15),
value VARCHAR(max)
)
BEGIN TRANSACTION
DECLARE #sql VARCHAR(max)
SET #sql = 'BULK INSERT [dbo].[#TestData] FROM ''' + #pfile + ''' WITH (
firstrow = 2,
fieldterminator = ''\t'',
rowterminator = ''\n''
)'
EXEC(#sql)
IF (##ERROR <> 0)
BEGIN
ROLLBACK TRANSACTION
RETURN 1
END
COMMIT TRANSACTION
Code for substring part :
CASE
WHEN (PATINDEX('%status="%', value) > 0)
THEN (nullif(SUBSTRING(value, (PATINDEX('%status="%', value) + 8), (CHARINDEX('"', value, (PATINDEX('%status="%', value) + 8)) - (PATINDEX('%status="%', value) + 8))), ''))
ELSE NULL
END,
This substring code is used in insert into and is similar for all the 66 columns.
It takes around 20-25 sec for the sp to execute. I have tried indexing on temp table,droped foreign keys,droped all indexes,droped primary key but still it takes the same time.
So my question is can the performance be improved?
Edit: The application for interface is visual foxpro 6.0.
As sql server is slow with string manipulation and doing all the string manipulations on foxpro now. New to foxpro Any suggestions how to send null from foxpro to sqlserver?
Never worked with null in foxpro 6.0.
Since you are not really leveraging the features of PATINDEX() here you may want to examine the use of CHARINDEX() instead, which, despite its name, operates on strings and not only on characters. CHARINDEX() may prove to be faster than PATINDEX() since it is a somewhat simpler function.
Indexes won't help you with those string operations because you're not searching for prefixes of strings.
You should definitely look into options to avoid the excessive use of PATINDEX() or CHARINDEX() inside the statement; there are up to 4(!) invocations thereof in your CASE for each of your 66 columns in every record processed.
For this you may want to split the string operations into multiple statements to pre-compute the values for the start and end index of the substring of interest, like
UPDATE temptable
SET value_start_index = CHARINDEX('status="', value) + 8
UPDATE temptable
SET value_end_index = CHARINDEX('"', value, value_start_index)
WHERE value_start_index >= 8
UPDATE temptable
SET value_str = SUBSTRING(value, value_start_index, value_end_index - value_start_index)
WHERE value_end_index IS NOT NULL
SQL Server is rather slow in dealing with strings. For this number of executions it would be best to use SQL CLR user defined function. There is not much more you can do beyond that.

SQL Server - Implementing sequences

I have a system which requires I have IDs on my data before it goes to the database. I was using GUIDs, but found them to be too big to justify the convenience.
I'm now experimenting with implementing a sequence generator which basically reserves a range of unique ID values for a given context. The code is as follows;
ALTER PROCEDURE [dbo].[Sequence.ReserveSequence]
#Name varchar(100),
#Count int,
#FirstValue bigint OUTPUT
AS
BEGIN
SET NOCOUNT ON;
-- Ensure the parameters are valid
IF (#Name IS NULL OR #Count IS NULL OR #Count < 0)
RETURN -1;
-- Reserve the sequence
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION
-- Get the sequence ID, and the last reserved value of the sequence
DECLARE #SequenceID int;
DECLARE #LastValue bigint;
SELECT TOP 1 #SequenceID = [ID], #LastValue = [LastValue]
FROM [dbo].[Sequences]
WHERE [Name] = #Name;
-- Ensure the sequence exists
IF (#SequenceID IS NULL)
BEGIN
-- Create the new sequence
INSERT INTO [dbo].[Sequences] ([Name], [LastValue])
VALUES (#Name, #Count);
-- The first reserved value of a sequence is 1
SET #FirstValue = 1;
END
ELSE
BEGIN
-- Update the sequence
UPDATE [dbo].[Sequences]
SET [LastValue] = #LastValue + #Count
WHERE [ID] = #SequenceID;
-- The sequence start value will be the last previously reserved value + 1
SET #FirstValue = #LastValue + 1;
END
COMMIT TRANSACTION
END
The 'Sequences' table is just an ID, Name (unique), and the last allocated value of the sequence. Using this procedure I can request N values in a named sequence and use these as my identifiers.
This works great so far - it's extremely quick since I don't have to constantly ask for individual values, I can just use up a range of values and then request more.
The problem is that at extremely high frequency, calling the procedure concurrently can sometimes result in a deadlock. I have only found this to occur when stress testing, but I'm worried it'll crop up in production. Are there any notable flaws in this procedure, and can anyone recommend any way to improve on it? It would be nice to do with without transactions for example, but I do need this to be 'thread safe'.
MS themselves offer a solution and even they say it locks/deadlocks.
If you want to add some lock hints then you'd reduce concurrency for your high loads
Options:
You could develop against the "Denali" CTP which is the next release
Use IDENTITY and the OUTPUT clause like everyone else
Adopt/modify the solutions above
On DBA.SE there is "Emulate a TSQL sequence via a stored procedure": see dportas' answer which I think extends the MS solution.
I'd recommend sticking with the GUIDs, if as you say, this is mostly about composing data ready for a bulk insert (it's simpler than what I present below).
As an alternative, could you work with a restricted count? Say, 100 ID values at a time? In that case, you could have a table with an IDENTITY column, insert into that table, return the generated ID (say, 39), and then your code could assign all values between 3900 and 3999 (e.g. multiply up by your assumed granularity) without consulting the database server again.
Of course, this could be extended to allocating multiple IDs in a single call - provided that your okay with some IDs potentially going unused. E.g. you need 638 IDs - so you ask the database to assign you 7 new ID values (which imply that you've allocated 700 values), use the 638 you want, and the remaining 62 never get assigned.
Can you get some kind of deadlock trace? For example, enable trace flag 1222 as shown here. Duplicate the deadlock. Then look in the SQL Server log for the deadlock trace.
Also, you might inspect what locks are taken out in your code by inserting a call to exec sp_lock or select * from sys.dm_tran_locks immediately before the COMMIT TRANSACTION.
Most likely you are observing a conversion deadlock. To avoid them, you want to make sure that your table is clustered and has a PK, but this advice is specific to 2005 and 2008 R2, and they can change the implementation, rendering this advice useless. Google up "Some heap tables may be more prone to deadlocks than identical tables with clustered indexes".
Anyway, if you observe an error during stress testing, it is likely that sooner or later it will occur in production as well.
You may want to use sp_getapplock to serialize your requests. Google up "Application Locks (or Mutexes) in SQL Server 2005". Also I described a few useful ideas here: "Developing Modifications that Survive Concurrency".
I thought I'd share my solution. I doesn't deadlock, nor does it produce duplicate values. An important difference between this and my original procedure is that it doesn't create the queue if it doesn't already exist;
ALTER PROCEDURE [dbo].[ReserveSequence]
(
#Name nvarchar(100),
#Count int,
#FirstValue bigint OUTPUT
)
AS
BEGIN
SET NOCOUNT ON;
IF (#Count <= 0)
BEGIN
SET #FirstValue = NULL;
RETURN -1;
END
DECLARE #Result TABLE ([LastValue] bigint)
-- Update the sequence last value, and get the previous one
UPDATE [Sequences]
SET [LastValue] = [LastValue] + #Count
OUTPUT INSERTED.LastValue INTO #Result
WHERE [Name] = #Name;
-- Select the first value
SELECT TOP 1 #FirstValue = [LastValue] + 1 FROM #Result;
END

Resources