Sql server query with lots of substrings performance - sql-server

I have stored procedure in which i perform a bulk insert into a temp table and perform substring on its field to get the different row required for the main table.
The no. of columns for the main table is 66 and the row added after each run of the sp is approx 5500.
Code for bulk insert part:
CREATE TABLE [dbo].[#TestData] (
logdate DATETIME,
id CHAR(15),
value VARCHAR(max)
)
BEGIN TRANSACTION
DECLARE #sql VARCHAR(max)
SET #sql = 'BULK INSERT [dbo].[#TestData] FROM ''' + #pfile + ''' WITH (
firstrow = 2,
fieldterminator = ''\t'',
rowterminator = ''\n''
)'
EXEC(#sql)
IF (##ERROR <> 0)
BEGIN
ROLLBACK TRANSACTION
RETURN 1
END
COMMIT TRANSACTION
Code for substring part :
CASE
WHEN (PATINDEX('%status="%', value) > 0)
THEN (nullif(SUBSTRING(value, (PATINDEX('%status="%', value) + 8), (CHARINDEX('"', value, (PATINDEX('%status="%', value) + 8)) - (PATINDEX('%status="%', value) + 8))), ''))
ELSE NULL
END,
This substring code is used in insert into and is similar for all the 66 columns.
It takes around 20-25 sec for the sp to execute. I have tried indexing on temp table,droped foreign keys,droped all indexes,droped primary key but still it takes the same time.
So my question is can the performance be improved?
Edit: The application for interface is visual foxpro 6.0.
As sql server is slow with string manipulation and doing all the string manipulations on foxpro now. New to foxpro Any suggestions how to send null from foxpro to sqlserver?
Never worked with null in foxpro 6.0.

Since you are not really leveraging the features of PATINDEX() here you may want to examine the use of CHARINDEX() instead, which, despite its name, operates on strings and not only on characters. CHARINDEX() may prove to be faster than PATINDEX() since it is a somewhat simpler function.
Indexes won't help you with those string operations because you're not searching for prefixes of strings.
You should definitely look into options to avoid the excessive use of PATINDEX() or CHARINDEX() inside the statement; there are up to 4(!) invocations thereof in your CASE for each of your 66 columns in every record processed.
For this you may want to split the string operations into multiple statements to pre-compute the values for the start and end index of the substring of interest, like
UPDATE temptable
SET value_start_index = CHARINDEX('status="', value) + 8
UPDATE temptable
SET value_end_index = CHARINDEX('"', value, value_start_index)
WHERE value_start_index >= 8
UPDATE temptable
SET value_str = SUBSTRING(value, value_start_index, value_end_index - value_start_index)
WHERE value_end_index IS NOT NULL

SQL Server is rather slow in dealing with strings. For this number of executions it would be best to use SQL CLR user defined function. There is not much more you can do beyond that.

Related

DATETIME search predicate on DATETIME column much slower than string literal predicate

I'm doing a search on a large table of about 10 million rows. I want to specify a start and end date and return all records in the table created between those dates.
It's a straight-forward query:
declare #StartDateTime datetime = '2016-06-21',
#EndDateTime datetime = '2016-06-22';
select *
FROM Archive.dbo.Order O WITH (NOLOCK)
where O.Created >= #StartDateTime
AND O.Created < #EndDateTime;
Created is a DATETIME column which has a non-clustered index.
This query took about 15 seconds to complete.
However, if I modify the query slightly, as follows, it takes only 1 second to return the same result:
declare #StartDateTime datetime = '2016-06-21',
#EndDateTime datetime = '2016-06-22';
select *
FROM Archive.dbo.Order O WITH (NOLOCK)
where O.Created >= '2016-06-21'
AND O.Created < #EndDateTime;
The only change is replacing the #StartDateTime search predicate with a string literal. Looking at the execution plan, when I used #StartDateTime it did an index scan but when I used a string literal it did an index seek and was 15 times faster.
Does anyone know why using the string literal is so much faster?
I would have thought doing a comparison between a DATETIME column and a DATETIME variable would be quicker than comparing the column to a string representation of a date. I've tried dropping and recreating the index on the Created column and it made no difference. I notice I get similar results on the production system as I do on the test system so the weird behaviour doesn't seem specific to a particular database or SQL Server instance.
All variables have instances that they are recognized.
In OOP languages, we usually distinguish between static/constant variables from temporary variables by using keywords, or when a variable is called into a function where inside that instance the variable is treated as a constant if the function transforms that variable, such like the following in C++:
void string MyFunction(string& name)
//technically, `&` calls the actual location of the variable
//instead of using a logical representation. The concept is the same.
In SQL Server, the Standard chose to implement it a bit differently. There are no constant data types, so instead we use literals which are either
object names (which have similar precedence in the call as system keywords)
names with an object deliminator (including ', [])
or strings with a deliminator CHAR(39) (').
This is the reason you noticed that the two queries produce different results, because those variables are not constants to the Optimizer, which means SQL Server will already have chosen it's execution path beforehand.
If you have SSMS installed, include the Actual Execution Plan (CTRL + M), and notice in the select statement what the Estimated Rows are. This is the highlight of the execution plan. The greater difference between the Estimated and Actual rows, the more likely your query can use optimization. In your example, SQL Server had to guess how many rows, and ended up overshooting the results, losing efficiency.
The solution is one and the same, but you can still encapsulate everything if you wanted to. We use the AdventureWorks2012 for this example:
1) Declare the Variable in the Procedure
CREATE PROC dbo.TEST1 (#NameStyle INT, #FirstName VARCHAR(50) )
AS
BEGIN
SELECT *
FROM Person.PErson
WHERE FirstName = #FirstName
AND NameStyle = #NameStyle; --namestyle is 0
END
2) Pass the variable into Dynamic SQL
CREATE PROC dbo.TEST2 (#NameStyle INT)
AS
BEGIN
DECLARE #Name NVARCHAR(50) = N'Ken';
DECLARE #String NVARCHAR(MAX)
SET #String =
N'SELECT *
FROM Person.PErson
WHERE FirstName = #Other
AND NameStyle = #NameStyle';
EXEC sp_executesql #String
, N'#Other VARCHAR(50), #NameStyle INT'
, #Other = #Name
, #NameStyle = #NameStyle
END
Both plans will produce the same results. I could have used EXEC by itself, but sp_executesql can cache the entire select statement (plus, its more SQL Injection safe)
Notice how in both cases the level of the instance allowed SQL Server to transform the variable into a constant value (meaning it entered the object with a set value), and then the Optimizer was capable of choosing the most efficient execution plan available.
-- Remove Procs
DROP PROC dbo.TEST1
DROP PROC dbo.TEST2
A great article was highlighted in the comment section of the OP, but you can see it here: Optimizing Variables and Parameters - SQLMAG

SQL Server - more efficient hashing?

I am caching data locally from a remote SQL server, and need to know when the remote data has changed so I can update my cache. The remote data has no modified date column, and I don't have permissions to add columns or create triggers, so there are a lot of options that are closed to me. My current process is to calculate the checksum of the remote table:
select checksum_agg(binary_checksum(*)) from remoteTable;
and then if that value changes, I know I need to update my cache. This basically works, with two flaws:
There are lots of changes that woun't cause the checksum to change
All I know is "something changed" and I need to copy the entire remote table into my local cache.
In order to be more efficient and more correct, I want to upgrade my process to:
Use MD5 hash
Return multiple hash values, one for each "chunk" of the output table (in my case, every 10,000 rows) so that I can update only the "chunk" that changed
I've created this stored procedure that relies on the (apparently undocumented) fn_repl_hash_binary function:
use dataBase;
if exists (SELECT * FROM sys.objects WHERE type = 'P' AND name = 'GetHash')
drop procedure GetHash;
create procedure GetHash
as
begin
declare #j int, #x nvarchar(max), #b varbinary(max), #h nvarchar(40)
set #j = 0
IF OBJECT_ID('tempdb.dbo.##TempHashTable', 'U') IS NOT NULL
DROP TABLE ##TempHashTable;
create table ##TempHashTable(id int, xlen int, hash nvarchar(40))
while (#j < 7)
begin
set #x = (select * from myTable where (id/10000) = #j for xml auto )
set #b = convert(varbinary(max), #x)
set #h = convert(char, sys.fn_repl_hash_binary(#b), 2)
insert into ##TempHashTable(id, xlen, hash) values(#j, len(#x), #h)
set #j = #j + 1
end
end;
This works as expected, the only drawback is it's rather slow. Checksumming the entire table takes 0.1 seconds (including all the network roundtrip), while "exec GetHash" takes about 0.7 seconds. That's not bad, but I'm wondering if I can do better.
'id' is the primary key of myTable, so the select should be efficient, what I'm seeing is that the XML being generated is around 2 megabytes for each "chunk", with most of that being repetitive XML tag names and angle brackets.
Is there a way to output the values from myTable in a less readable (but more efficient) format than XML?
Are there other obvious inefficiencies in my stored procedure that would be causing it to be slow?
Yes, there are issues with this stored procedure (like, it only works on one table and I'm assuming the table will always contain fewer than 70,000 rows). Once I'm happy with the performance I'll put in all the niceties.
I just added some timing code, here's what I'm seeing:
Each "select" is taking about 60 msec
Calculating each hash is taking about 25 msec

Which column is being truncated? [duplicate]

The year is 2010.
SQL Server licenses are not cheap.
And yet, this error still does not indicate the row or the column or the value that produced the problem. Hell, it can't even tell you whether it was "string" or "binary" data.
Am I missing something?
A quick-and-dirty way of fixing these is to select the rows into a new physical table like so:
SELECT * INTO dbo.MyNewTable FROM <the rest of the offending query goes here>
...and then compare the schema of this table to the schema of the table into which the INSERT was previously going - and look for the larger column(s).
I realize that this is an old one. Here's a small piece of code that I use that helps.
What this does, is returns a table of the max lengths in the table you're trying to select from. You can then compare the field lengths to the max returned for each column and figure out which ones are causing the issue. Then it's just a simple query to clean up the data or exclude it.
DECLARE #col NVARCHAR(50)
DECLARE #sql NVARCHAR(MAX);
CREATE TABLE ##temp (colname nvarchar(50), maxVal int)
DECLARE oloop CURSOR FOR
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'SOURCETABLENAME' AND TABLE_SCHEMA='dbo'
OPEN oLoop
FETCH NEXT FROM oloop INTO #col;
WHILE (##FETCH_STATUS = 0)
BEGIN
SET #sql = '
DECLARE #val INT;
SELECT #val = MAX(LEN(' + #col + ')) FROM dbo.SOURCETABLENAME;
INSERT INTO ##temp
( colname, maxVal )
VALUES ( N''' + #col + ''', -- colname - nvarchar(50)
#val -- maxVal - int
)';
EXEC(#sql);
FETCH NEXT FROM oloop INTO #col;
END
CLOSE oloop;
DEALLOCATE oloop
SELECT * FROM ##temp
DROP TABLE ##temp;
Another way here is to use binary search.
Comment half of the columns in your code and try again. If the error persists, comment out another half of that half and try again. You will narrow down your search to just two columns in the end.
You could check the length of each inserted value with an if condition, and if the value needs more width than the current column width, truncate the value and throw a custom error.
That should work if you just need to identify which is the field causing the problem. I don't know if there's any better way to do this though.
Recommend you vote for the enhancement request on Microsoft's site. It's been active for 6 years now so who knows if Microsoft will ever do anything about it, but at least you can be a squeaky wheel: Microsoft Connect
For string truncation, I came up with the following solution to find the max lengths of all of the columns:
1) Select all of the data to a temporary table (supply column names where needed), e.g.
SELECT col1
,col2
,col3_4 = col3 + '-' + col4
INTO #temp;
2) Run the following SQL Statement in the same connection (adjust the temporary table name if needed):
DECLARE #table VARCHAR(MAX) = '#temp'; -- change this to your temp table name
DECLARE #select VARCHAR(MAX) = '';
DECLARE #prefix VARCHAR(256) = 'MAX(LEN(';
DECLARE #suffix VARCHAR(256) = ')) AS max_';
DECLARE #nl CHAR(2) = CHAR(13) + CHAR(10);
SELECT #select = #select + #prefix + name + #suffix + name + #nl + ','
FROM tempdb.sys.columns
WHERE object_id = object_id('tempdb..' + #table);
SELECT #select = 'SELECT ' + #select + '0' + #nl + 'FROM ' + #table
EXEC(#select);
It will return a result set with the column names prefixed with 'max_' and show the max length of each column.
Once you identify the faulty column you can run other select statements to find extra long rows and adjust your code/data as needed.
I can't think of a good way really.
I once spent a lot of time debugging a very informative "Division by zero" message.
Usually you comment out various pieces of output code to find the one causing problems.
Then you take this piece you found and make it return a value that indicates there's a problem instead of the actual value (in your case, should be replacing the string output with the len(of the output)). Then manually compare to the lenght of the column you're inserting it into.
from the line number in the error message, you should be able to identify the insert query that is causing the error. modify that into a select query to include AND LEN(your_expression_or_column_here) > CONSTANT_COL_INT_LEN for the string various columns in your query. look at the output and it will give your the bad rows.
Technically, there isn't a row to point to because SQL didn't write the data to the table. I typically just capture the trace, run it Query Analyzer (unless the problem is already obvious from the trace, which it may be in this case), and quickly debug from there with the ages old "modify my UPDATE to a SELECT" method. Doesn't it really just break down to one of two things:
a) Your column definition is wrong, and the width needs to be changed
b) Your column definition is right, and the app needs to be more defensive
?
The best thing that worked for me was to put the rows first into a temporary table using select .... into #temptable
Then I took the max length of each column in that temp table. eg. select max(len(jobid)) as Jobid, ....
and then compared that to the source table field definition.

Filter results containing drive paths by parent paths in t-sql

I have a table that contains mappings of UserIds to some paths on disk (e.g. \\UNCserver\path or C:\user\has\a\folder). I control the data and there is no trailing \ symbols in the DB.
Periodically, I need to select user IDs that have a parent path of the path in question assigned. E.g. if I have an event in \\superserver\cluster\2, I want to get all user IDs that have either or all of the following paths:
\\superserver\cluster\2
\\superserver\cluster
\\superserver
I have a stored procedure that does just that, but it is extremely inefficient due to the operations on the string that I use - for just 10000 UserPaths records I can load CPU to 50% invoking this just a few hundred of times in a row.
How can I optimise this procedure?
CREATE PROCEDURE [dbo].[SelectUserIdsWithPath]
#Path nvarchar(MAX)
AS
BEGIN
SET NOCOUNT ON;
IF (#Path IS NOT NULL)
BEGIN
DECLARE #TempPath NVARCHAR(MAX)
SET #TempPath = SUBSTRING(#Path, 0, LEN(#Path) + 1 - CHARINDEX('\', REVERSE(#Path)))
IF (LEN(#Path) - LEN(REPLACE(#Path, '\', '')) = 1) --we need to process path C:\
BEGIN
SET #TempPath = #TempPath + '\';
END
INSERT INTO Results(UserId)
SELECT DISTINCT UserId FROM UserPaths
WHERE
UserId NOT IN (SELECT UserId FROM Results)
AND (Path = #Path
OR CHARINDEX(Path, #TempPath, 0) <> 0)
END
END
UPDATE I have now changed the logic in my app so that figuring out the parent path is done in the app, which may have improved things a bit, but performance is still pathetic. Here is updated proc listing:
CREATE PROCEDURE [dbo].[SelectUserIdsWithPath]
#Path NVARCHAR(MAX),
#ParentPath NVARCHAR(MAX)
AS
BEGIN
SET NOCOUNT ON;
IF (#Path IS NOT NULL AND #ParentPath IS NOT NULL)
BEGIN
INSERT INTO Results(UserId)
SELECT DISTINCT UserId FROM UserPaths
WHERE
UserId NOT IN (SELECT UserId FROM Results)
AND (Path = #Path
OR CHARINDEX(Path, #ParentPath, 0) <> 0)
END
END
So the culprit is obviously CHARINDEX() call. Unfortunately I am still waiting on the infrastructure to confirm if we can turn Full-Text indexing on, but are there any alternatives?
Maybe using a cte to extract the parent folders. Something like this:
create procedure SelectUserIdsWithPath
#path varchar(250)
as begin
With c
As
(Select cast(path as varchar(500)) path
from (Select #path path) t
Union all select
Cast(substring(path,0,len(path)-charindex('\',reverse(path),0)+1) as varchar(500))
From c
where charindex('\',reverse(substring(path,0,len(path)-charindex('\',reverse(path),0)+1) ),0)>1)
Select distinct userid from userpaths up where exists(select * from c where c.path=up.path)
end
I doubt you need Full Text Search; 10k rows is a fairly small amount. There are probably a few things going on here that are affecting performance to varying degrees.
Any examples below are based on the original proc (as that should have been fine), but could easily be adapted to the updated proc by simply changing #TempPath to be #ParentPath.
Not a performance issue, but SQL Server start indexes are 1, not 0.
So the SUBSTRING and CHARINDEX should be using 1 instead of 0.
Why are you using NVARCHAR(MAX)? If you know that none of your paths are
over 4000 characters, you would be better off using NVARCHAR(4000)
for the input parameter datatype as well as local variable datatype.
The two fields in the CHARINDEX appear to be transposed as the
signature is:
CHARINDEX ( expressionToFind ,expressionToSearch [ ,
start_location ] ).
So it should be: CHARINDEX(#TempPath, Path).
It does not appear that you need CHARINDEX anyway. You should be just fine with:
[Path] = #TempPath
OR [Path] LIKE #TempPath + N'\%'
Note that #TempPath is now used in both conditions.
If using the original proc then be sure to remove the IF (LEN(#Path)...BEGIN...END else don't bother adding the trailing \ to #ParentPath in the app code (for the C:\ case). In either case, a LIKE is probably better than CHARINDEX as LIKE with a trailing '%' and no leading '%' is basically a String.StartsWith while CHARINDEX is a String.Contains.
The SELECT for the INSERT could maybe be improved by separating out the DISTINCT and the NOT IN into a second query, using a temp table to hold the results from the first query:
INSERT INTO #TempResults(UserId)
SELECT UserId
FROM UserPaths
WHERE [Path] = #TempPath
OR [Path] LIKE #TempPath + N'\%'
INSERT INTO Results(UserId)
SELECT DISTINCT UserId
FROM #TempResults
WHERE UserId NOT IN (SELECT UserId FROM Results)
You should test the NOT IN condition in both queries to see where it works better.
Given that this proc is called "hundreds of times a minute" (via Service Broker) and that the Results table "gets cleared out around every minute": if at all possible, move the expensive operation (i.e. guaranteeing uniqueness of UserId via the DISTINCT and NOT IN subquery) away from the process that runs hundreds of times per minute to the operation that runs about once per minute. So, a) remove the Unique Constraint on the Results table, b) update the process that consumes the Results table to include the DISTINCT, and c) use the following, simplified INSERT...SELECT:
INSERT INTO Results(UserId)
SELECT UserId
FROM UserPaths
WHERE [Path] = #TempPath
OR [Path] LIKE #TempPath + N'\%'
If Service Broker is configured to run this proc via multiple threads, then you are also experiencing contention between the INSERT operations and the SELECT for the NOT IN subquery. This contention will be avoided by the removal of the NOT IN subquery.
[I will update the WHERE condition after getting clarification on how to determine valid matches]

SQL Server - Implementing sequences

I have a system which requires I have IDs on my data before it goes to the database. I was using GUIDs, but found them to be too big to justify the convenience.
I'm now experimenting with implementing a sequence generator which basically reserves a range of unique ID values for a given context. The code is as follows;
ALTER PROCEDURE [dbo].[Sequence.ReserveSequence]
#Name varchar(100),
#Count int,
#FirstValue bigint OUTPUT
AS
BEGIN
SET NOCOUNT ON;
-- Ensure the parameters are valid
IF (#Name IS NULL OR #Count IS NULL OR #Count < 0)
RETURN -1;
-- Reserve the sequence
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE;
BEGIN TRANSACTION
-- Get the sequence ID, and the last reserved value of the sequence
DECLARE #SequenceID int;
DECLARE #LastValue bigint;
SELECT TOP 1 #SequenceID = [ID], #LastValue = [LastValue]
FROM [dbo].[Sequences]
WHERE [Name] = #Name;
-- Ensure the sequence exists
IF (#SequenceID IS NULL)
BEGIN
-- Create the new sequence
INSERT INTO [dbo].[Sequences] ([Name], [LastValue])
VALUES (#Name, #Count);
-- The first reserved value of a sequence is 1
SET #FirstValue = 1;
END
ELSE
BEGIN
-- Update the sequence
UPDATE [dbo].[Sequences]
SET [LastValue] = #LastValue + #Count
WHERE [ID] = #SequenceID;
-- The sequence start value will be the last previously reserved value + 1
SET #FirstValue = #LastValue + 1;
END
COMMIT TRANSACTION
END
The 'Sequences' table is just an ID, Name (unique), and the last allocated value of the sequence. Using this procedure I can request N values in a named sequence and use these as my identifiers.
This works great so far - it's extremely quick since I don't have to constantly ask for individual values, I can just use up a range of values and then request more.
The problem is that at extremely high frequency, calling the procedure concurrently can sometimes result in a deadlock. I have only found this to occur when stress testing, but I'm worried it'll crop up in production. Are there any notable flaws in this procedure, and can anyone recommend any way to improve on it? It would be nice to do with without transactions for example, but I do need this to be 'thread safe'.
MS themselves offer a solution and even they say it locks/deadlocks.
If you want to add some lock hints then you'd reduce concurrency for your high loads
Options:
You could develop against the "Denali" CTP which is the next release
Use IDENTITY and the OUTPUT clause like everyone else
Adopt/modify the solutions above
On DBA.SE there is "Emulate a TSQL sequence via a stored procedure": see dportas' answer which I think extends the MS solution.
I'd recommend sticking with the GUIDs, if as you say, this is mostly about composing data ready for a bulk insert (it's simpler than what I present below).
As an alternative, could you work with a restricted count? Say, 100 ID values at a time? In that case, you could have a table with an IDENTITY column, insert into that table, return the generated ID (say, 39), and then your code could assign all values between 3900 and 3999 (e.g. multiply up by your assumed granularity) without consulting the database server again.
Of course, this could be extended to allocating multiple IDs in a single call - provided that your okay with some IDs potentially going unused. E.g. you need 638 IDs - so you ask the database to assign you 7 new ID values (which imply that you've allocated 700 values), use the 638 you want, and the remaining 62 never get assigned.
Can you get some kind of deadlock trace? For example, enable trace flag 1222 as shown here. Duplicate the deadlock. Then look in the SQL Server log for the deadlock trace.
Also, you might inspect what locks are taken out in your code by inserting a call to exec sp_lock or select * from sys.dm_tran_locks immediately before the COMMIT TRANSACTION.
Most likely you are observing a conversion deadlock. To avoid them, you want to make sure that your table is clustered and has a PK, but this advice is specific to 2005 and 2008 R2, and they can change the implementation, rendering this advice useless. Google up "Some heap tables may be more prone to deadlocks than identical tables with clustered indexes".
Anyway, if you observe an error during stress testing, it is likely that sooner or later it will occur in production as well.
You may want to use sp_getapplock to serialize your requests. Google up "Application Locks (or Mutexes) in SQL Server 2005". Also I described a few useful ideas here: "Developing Modifications that Survive Concurrency".
I thought I'd share my solution. I doesn't deadlock, nor does it produce duplicate values. An important difference between this and my original procedure is that it doesn't create the queue if it doesn't already exist;
ALTER PROCEDURE [dbo].[ReserveSequence]
(
#Name nvarchar(100),
#Count int,
#FirstValue bigint OUTPUT
)
AS
BEGIN
SET NOCOUNT ON;
IF (#Count <= 0)
BEGIN
SET #FirstValue = NULL;
RETURN -1;
END
DECLARE #Result TABLE ([LastValue] bigint)
-- Update the sequence last value, and get the previous one
UPDATE [Sequences]
SET [LastValue] = [LastValue] + #Count
OUTPUT INSERTED.LastValue INTO #Result
WHERE [Name] = #Name;
-- Select the first value
SELECT TOP 1 #FirstValue = [LastValue] + 1 FROM #Result;
END

Resources