Sql Server Delete and Merge performance - sql-server

I've table that contains some buy/sell data, with around 8M records in it:
CREATE TABLE [dbo].[Transactions](
[id] [int] IDENTITY(1,1) NOT NULL,
[itemId] [bigint] NOT NULL,
[dt] [datetime] NOT NULL,
[count] [int] NOT NULL,
[price] [float] NOT NULL,
[platform] [char](1) NOT NULL
) ON [PRIMARY]
Every X mins my program gets new transactions for each itemId and I need to update it. My first solution is two step DELETE+INSERT:
delete from Transactions where platform=#platform and itemid=#itemid
insert into Transactions (platform,itemid,dt,count,price) values (#platform,#itemid,#dt,#count,#price)
[...]
insert into Transactions (platform,itemid,dt,count,price) values (#platform,#itemid,#dt,#count,#price)
The problem is, that this DELETE statement takes average 5 seconds. It's much too long.
The second solution I found is to use MERGE. I've created such Stored Procedure, wchich takes Table-valued parameter:
CREATE PROCEDURE [dbo].[sp_updateTransactions]
#Table dbo.tp_Transactions readonly,
#itemId bigint,
#platform char(1)
AS
BEGIN
MERGE Transactions AS TARGET
USING #Table AS SOURCE
ON (
TARGET.[itemId] = SOURCE.[itemId] AND
TARGET.[platform] = SOURCE.[platform] AND
TARGET.[dt] = SOURCE.[dt] AND
TARGET.[count] = SOURCE.[count] AND
TARGET.[price] = SOURCE.[price] )
WHEN NOT MATCHED BY TARGET THEN
INSERT VALUES (SOURCE.[itemId],
SOURCE.[dt],
SOURCE.[count],
SOURCE.[price],
SOURCE.[platform])
WHEN NOT MATCHED BY SOURCE AND TARGET.[itemId] = #itemId AND TARGET.[platform] = #platform THEN
DELETE;
END
This procedure takes around 7 seconds with table with 70k records. So with 8M it would probably take few minutes. The bottleneck is "When not matched" - when I commented this line, this procedure runs on average 0,01 second.
So the question is: how to improve perfomance of the delete statement?
Delete is needed to make sure, that table doesn't contains transaction that as been removed in application. But it real scenario it happens really rarely, ane the true need of deleting records is less than 1 on 10000 transaction updates.
My theoretical workaround is to create additional column like "transactionDeleted bit" and use UPDATE instead of DELETE, ane then make table cleanup by batch job every X minutes or hours and Execute
delete from transactions where transactionDeleted=1
It should be faster, but I would need to update all SELECT statements in other parts of application, to use only transactionDeleted=0 records and so it also may afect application performance.
Do you know any better solution?
UPDATE: Current indexes:
CREATE NONCLUSTERED INDEX [IX1] ON [dbo].[Transactions]
(
[platform] ASC,
[ItemId] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 50) ON [PRIMARY]
CONSTRAINT [IX2] UNIQUE NONCLUSTERED
(
[ItemId] DESC,
[count] ASC,
[dt] DESC,
[platform] ASC,
[price] ASC
) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

OK, here is another approach also. For a similar problem (large scan WHEN NOT MATCHED BY SOURCE then DELETE) I reduced the MERGE execute time from 806ms to 6ms!
One issue with the problem above is that the "WHEN NOT MATCHED BY SOURCE" clause is scanning the whole TARGET table.
It is not that obvious but Microsoft allows the TARGET table to be filtered (by using a CTE) BEFORE doing the merge. So in my case the TARGET rows were reduced from 250K to less than 10 rows. BIG difference.
Assuming that the above problem works with the TARGET being filtered by #itemid and #platform then the MERGE code would look like this. The changes above to the indexes would help this logic too.
WITH Transactions_CTE (itemId
,dt
,count
,price
,platform
)
AS
-- Define the CTE query that will reduce the size of the TARGET table.
(
SELECT itemId
,dt
,count
,price
,platform
FROM Transactions
WHERE itemId = #itemId
AND platform = #platform
)
MERGE Transactions_CTE AS TARGET
USING #Table AS SOURCE
ON (
TARGET.[itemId] = SOURCE.[itemId]
AND TARGET.[platform] = SOURCE.[platform]
AND TARGET.[dt] = SOURCE.[dt]
AND TARGET.[count] = SOURCE.[count]
AND TARGET.[price] = SOURCE.[price]
)
WHEN NOT MATCHED BY TARGET THEN
INSERT
VALUES (
SOURCE.[itemId]
,SOURCE.[dt]
,SOURCE.[count]
,SOURCE.[price]
,SOURCE.[platform]
)
WHEN NOT MATCHED BY SOURCE THEN
DELETE;

Using a BIT field for IsDeleted (or IsActive as many people do) is valid but it does require modifying all code plus creating a separate SQL Job to periodically come through and remove the "deleted" records. This might be the way to go but there is something less intrusive to try first.
I noticed in your set of 2 indexes that neither is CLUSTERED. Can I assume that the IDENTITY field is? You might consider making the [IX2] UNIQUE index the CLUSTERED one and changing the PK (again, I assume the IDENTITY field is a CLUSTERED PK) to be NONCLUSTERED. I would also reorder the IX2 fields to put [Platform] and [ItemID] first. Since your main operation is looking for [Platform] and [ItemID] as a set, physically ordering them this way might help. And since this index is unique, that is a good candidate for being CLUSTERED. It is certainly worth testing as this will impact all queries against the table.
Also, if changing the indexes as I have suggested helps, it still might be worth trying both ideas and hence doing the IsDeleted field as well to see if that increases performance even more.
EDIT:
I forgot to mention, by making the IX2 index CLUSTERED and moving the [Platform] field to the top, you should get rid of the IX1 index.
EDIT2:
Just to be very clear, I am suggesting something like:
CREATE UNIQUE CLUSTERED INDEX [IX2]
(
[ItemId] DESC,
[platform] ASC,
[count] ASC,
[dt] DESC,
[price] ASC
)
And to be fair, changing which index is CLUSTERED could also negatively impact queries where JOINs are done on the [id] field which is why you need to test thoroughly. In the end you need to tune the system for your most frequent and/or expensive queries and might have to accept that some queries will be slower as a result but that might be worth this operation being much faster.

See this https://stackoverflow.com/questions/3685141/how-to-....
would the update be the same cost as a delete? No. The update would be
a much lighter operation, especially if you had an index on the PK
(errrr, that's a guid, not an int). The point being that an update to
a bit field is much less expensive. A (mass) delete would force a
reshuffle of the data.
In light of this information, your idea to use a bit field is very valid.

Related

What indexes should I create to speed up my queries on this table in SQL Server?

I have a table with the following definition:
CREATE TABLE [dbo].[Transactions]
(
[ID] [varchar](18) NOT NULL,
[TIME_STAMP] [datetime] NOT NULL,
[AMT] [decimal](18, 4) NOT NULL,
[CID] [varchar](90) NOT NULL,
[DEPARTMENT] [varchar](4) NULL,
[SOURCE] [varchar](14) NULL,
PRIMARY KEY NONCLUSTERED
(
[ID] ASC
)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The table has 75 million rows in it. Somehow, it takes up 20 GB of disk space!
The following 2 queries...
SELECT
SUM(AMT)
FROM
Transactions
WHERE
TIME_STAMP >= '2017-11-11 00:00:00' AND
TIME_STAMP < '2017-11-12 00:00:00' AND
DEPARTMENT = 'Shoes' AND
SOURCE = 'Website'
SELECT
COUNT(DISTINCT(CID))
FROM
Transactions
WHERE
TIME_STAMP >= '2017-11-11 00:00:00' AND
TIME_STAMP < '2017-11-12 00:00:00' AND
DEPARTMENT = 'Accessories' AND
SOURCE = 'Mobile'
...each take about 2 minutes to run!
The DEPARTMENT and SOURCE fields are of low cardinality, they contain only a few distinct values.
Please advise on what I need to do, which indexes I need to create with which settings to optimize performance of these queries.
Thank you!
The best way to solve this specific query would be a composite index (one index with multiple columns) in this order:
Department
Source
Timestamp
Try to put the most selective column first, so if source has more possible variations than department, put it first. The date will obviously go last since it will trigger an index scan.
CREATE INDEX IX_Transactions ON Transactions(TIME_STAMP,DEPARTMENT,SOURCE) INCLUDE(AMT,CID)
I would create an index using Timestamp, Department and Source. I would also add AMT and CID as included columns. This means both your queries could be satisfied by reading the index and not having to hit the parent table at all.
CREATE INDEX IX_Transactions ON Transactions(TIME_STAMP,DEPARTMENT,SOURCE) INCLUDE(AMT,CID)
One additional option to consider is to run the Execution Plan and see if it recommends an index. I do this a lot when considering indexes because I have seen performance improvements from Execution Plan recommended indexes, over indexes that I thought were good, but were not intuitive.

Selecting rows first by Id then by datetime - with or without a subquery?

I need to create statistics from several log tables. Most of the time every hour but sometimes more frequently every 5 minutes.
Selecting rows only by datetime isn't fast enough for larger logs so I thought I select only rows that are new since the last query by storing the max Id and reusing it next time:
SELECT TOP(1000) * -- so that it's not too much
FROM [dbo].[Log]
WHERE Id > lastId AND [Timestamp] >= timestampMin
ORDER BY [Id] DESC
My question: is the SQL Server smart enough to:
first filter the rows by Id and then by the Timestamp even if I change the order of the conditions or does the condition order matter or
do I need a subquery to first select the rows by Id and then filter them by the Timestamp.
with subquery:
SELECT *
FROM (
SELECT TOP(1000) * FROM [dbo].[Log]
WHERE Id > lastId
ORDER BY [Id] DESC
) t
WHERE t.[TimeStamp] >= timestampMin
The table schema is:
CREATE TABLE [dbo].[Log](
[Id] [int] IDENTITY(1,1) NOT NULL,
[Timestamp] [datetime2](7) NOT NULL,
-- other columns
CONSTRAINT [PK_dbo_Log] PRIMARY KEY CLUSTERED
(
[Id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 80) ON [PRIMARY]
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
I tried to use the query plan to find out how it works but it turns out that I cannot read it and I don't understand it.
In your case you don't have an index on TimeStamp so SQL Server will always use the Clustered Index (Id) first (the Clustered index seek you see in the query plan) to find the first row matching Id > lastId and then perform a scan on the remaining rows with the predicate [Timestamp] >= timestampMin (actually is the other way around since you are sorting in reverse order with DESC).
If you were to add a index on TimeStamp SQL Server might use it based on:
the cardinality of the predicate [Timestamp] >= timestampMin. Please note that cardinality is always an estimate based on statistics (see https://msdn.microsoft.com/en-us/library/ms190397.aspx) and the cardinality estimator (it changed from SQL 2012 to 2014+, see https://msdn.microsoft.com/en-us/library/dn600374.aspx).
how covering the non-clustered index is (since you are using the wildcard it would hardly matters anyway). If the non-clustered index is non covering SQL Server would have to add a Key Lookup (see https://technet.microsoft.com/en-us/library/bb326635(v=sql.105).aspx) operator in order to retrieve all the fields (or perform a join). This will likely make the index not worthwhile for this query.
Also note that your two queries - the one with subplan and the one without - are functionally different. The first will give you the first 1000 rows the have both Id > lastId AND [Timestamp] >= timestampMin. The second will give you only the rows having [Timestamp] >= timestampMin from the first 1000 rows having Id > lastId. So, for example, you might get 1000 rows from the first query but less than that on the second one.

Is it possible to write a Filtered Index where the column value is NOT LIKE 'X'?

I'm trying to optimize a table lookup because the execution plan shows a pretty hefty parallelized table scan. The table is called Opportunity and the column I'm filtering on is Name. Specifically I want all rows that don't have "Supplement" as part of the Name:
WHERE ([Name] NOT LIKE '%Supplement%');
I was looking around for a way to optimize this and came across filtered indexes which is what I need, but they don't seem to like the LIKE keyword. Is there an alternate way to creating a filtered index like this?
The table has ~53k rows, and when directly querying the server it takes 4 seconds to get the data, but when I query it as a linked server (which is what I need) it takes 2 minutes. In an attempt to improve this time, I moved the query out of my script that was talking to the linked server and created a view on the remote server. Still takes forever.
Here's what I've tried so far, but the SSMS says it's invalid:
CREATE NONCLUSTERED INDEX [FX_NotSupplementOpportunities]
ON [Opportunity]([Name])
WHERE (([Name] NOT LIKE '%Supplement%')
AND ([Name] NOT LIKE '%Suplement%')
AND ([Name] NOT LIKE '%Supplament%')
AND ([Name] NOT LIKE '%Suppliment%'));
Thanks in advance for any suggestions!
You might use a Indexes on Computed Columns
an example would be:
CREATE TABLE [dbo].[MyTab](
[ID] [int] IDENTITY(1,1) NOT NULL,
[Text] [varchar](max) NULL,
[OK] AS (case when NOT [text] like '%abc%' then (1) else (0) end) PERSISTED NOT NULL,
CONSTRAINT [PK_MyTab] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [idx_OK] ON [dbo].[MyTab]
(
[OK] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Unfortunately, there are many limitations on what you can put in a condition of a filtered index.
I can't find anything specific in MSDN but this blog post by Brent Ozar: What You Can (and Can’t) Do With Filtered Indexes mentions several of the limitations:
You can't use BETWEEN, NOT IN, CASE expressions, OR.
They don't mention LIKE but simple testing (as you did) confirms that you can't either. You can't even use (NOT (a >= 7)) which can be rewritten to the allowed (a < 7).
One thought would be to use a case expression on a persisted column and then use the persisted columns in the filtered index - but that's another limitation of the filtered indexes!
So, what you do? The only thing that comes to mind is to create a persisted column and use it in a simple (not filtered) index. Something like:
ALTER TABLE dbo.Opportunity
ADD special_condition AS (CASE WHEN [Name] NOT LIKE '%Supplement%'
THEN 1 ELSE 0 END)
PERSISTED;
Then add an index, using the column:
CREATE NONCLUSTERED INDEX FX_NotSupplementOpportunities
ON dbo.Opportunity
(special_condition, [Name]) ;
and use the (WHERE special_condition = 1) in your queries.
I have chosen #bummi's answer because it was the closest to what I attempted, but it wasn't what I ended up using. A little explanation...
So, after many hours of trying to figure out how to make the query look-up faster I actually got it from a bunch of parallelized chunks down to two index seeks. I was super ecstatic about that, but in the end I had to scrap it. The problem is that the remote database is actually a backup of our Salesforce data. I had to go through some very complex table and column alterations pre and post sync that just didn't work correctly and would get erased each sync (every 10 minutes).
While doing all of that it eventually hit me that I was importing the data and then further formatting it again on my end. I decided instead to update the views on the remote server and format the data there as much as possible and then import it. So, I spent a couple of hours re-writing the SQL on both sides and I managed to get the ~25min script down to ~3min, which makes me very happy and satisfied.
In the end, although the query look-ups on the remote server are unoptimized they're still quite fast, mostly because on average there's no more than ~50k rows in most of the tables I'm touching...

table needs indexes to improve performance

I was having timeout issue when giving long period of DateTime in below query (query runs from c# application). Table had 30 million rows with a non-clustered index on ID(not a primary key).
Found that there was no primary key so I recently updated ID as Primary Key, it’s not giving me timeout now. Can anyone help me for the below query to create index on more than one key for future and also if I remove non clustered index from this table and create on more than one column? Data is increasing rapidly and need improvement on performace
select
ID, ReferenceNo, MinNo, DateTime, DataNo from tbl1
where
DateTime BETWEEN '04/09/2013' AND '20/11/2013'
and ReferenceNo = 4 and MinNo = 3 and DataNo = 14 Order by ID
this is the create script
CREATE TABLE [dbo].[tbl1]( [ID] [int] IDENTITY(1,1) not null, [ReferenceNo] [int] not null, [MinNo] [int] not null, [DateTime] [datetime] not null, [DataNo] [int] not null, CONSTRAINT [tbl1_pk] PRIMARY KEY CLUSTERED ([ID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS
= ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY]
Its hard to tell which index you should use without knowing more about your database and how its used.
You may want to change the ID column to a clustered index. If ID is an identity column you will get very few page splits while inserting new data. It will however require you to rebuild the table and that may be a problem depending on your usage of the database. You will be looking at some downtime.
If you want a covering index it should look something like this:
CREATE NONCLUSTERED INDEX [MyCoveringIndex] ON tbl1
(
[ReferenceNo] ASC,
[MinNo] ASC,
[DataNo] ASC,
[DateTime ] ASC
)
Its no need to include ID as a column as its already in the clusted index (clusted index columns will be included in all other indexes). This will however use up a whole lot of space (somewhere in the range of 1GB if the columns above are of the types int and datetime). It will also affect your insert, update and delete performance on the table in (most cases) a negative way.
You can create the index in online mode if you are using Enterprice Edition of SQL server. In all other cases there will be a lock on the table while creating the index.
Its also hard to know what other queries that are made against the table. You may want to tweek the order of the columns in the index to better match other queries.
Indexing all fields would be fastest, but would likely waste a ton of space. I would guess that a date index would provide the most benefit with the least storage capacity cost because the data is probably evenly spread out over a large period of time. If the MIN() MAX() dates are close together, then this will not be as effective:
CREATE NONCLUSTERED INDEX [IDX_1] ON [dbo].[tbl1] (
[DateTime] ASC
)
GO
As a side note, you can use SSMSE's "Display Estimated Execution Plan" which will show you what the DB needs to do to get your data. It will suggest missing indexes and also provide CREATE INDEX statements. These suggestions can be quite wasteful, but they will give you an idea of what is taking so long. This option is in the Standard Toolbar, four icons to the right from "Execute".

How to speed up Xpath performance in SQL Server when searching for an element with specific text

I am supposed to remove whole rows and part of XML-documents from a table with an XML column based on a specific value in the XML column. However the table contains millions of rows and gets locked when I perform the operation. Currently it will take almost a week to clean it up, and the system is too critical to be taken offline for so long.
Are there any ways to optimize the xpath expressions in this script:
declare #slutdato datetime = '2012-03-01 00:00:00.000'
declare #startdato datetime = '2000-02-01 00:00:00.000'
declare #lev varchar(20) = 'suppliername'
declare #todelete varchar(10) = '~~~~~~~~~~'
CREATE TABLE #ids (selId int NOT NULL PRIMARY KEY)
INSERT into #ids
select id from dbo.proevesvar
WHERE leverandoer = #lev
and proevedato <= #slutdato
and proevedato >= #startdato
begin transaction /* delete whole rows */
delete from dbo.proevesvar
where id in (select selId from #ids)
and ProeveSvarXml.exist('/LaboratoryReport/LaboratoryResults/Result[Value=sql:variable(''#todelete'')]') = 1
and Proevesvarxml.exist('/LaboratoryReport/LaboratoryResults/Result[Value!=sql:variable(''#todelete'')]') = 0
commit
go
begin transaction /* delete single results */
UPDATE dbo.proevesvar SET ProeveSvarXml.modify('delete /LaboratoryReport/LaboratoryResults/Result[Value=sql:variable(''#todelete'')]')
where id in (select selId from #ids)
commit
go
The table definitions is:
CREATE TABLE [dbo].[ProeveSvar](
[ID] [int] IDENTITY(1,1) NOT NULL,
[CPRnr] [nchar](10) NOT NULL,
[ProeveDato] [datetime] NOT NULL,
[ProeveSvarXml] [xml] NOT NULL,
[Leverandoer] [nvarchar](50) NOT NULL,
[Proevenr] [nvarchar](50) NOT NULL,
[Lokationsnr] [nchar](13) NOT NULL,
[Modtaget] [datetime] NOT NULL,
CONSTRAINT [PK_ProeveSvar] PRIMARY KEY CLUSTERED
(
[ID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY],
CONSTRAINT [IX_ProeveSvar_1] UNIQUE NONCLUSTERED
(
[CPRnr] ASC,
[Lokationsnr] ASC,
[Proevenr] ASC,
[ProeveDato] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
The first insert statement is very fast. I believe I can handle the locking by committing 50 rows at a time, so other requests can be handled in between my transactions.
The total number of rows for this supplier is about 5.5 million and the total rowcount in the table is around 13 million.
I've not really used xpath within SQL server before, but something which stands out is that you're doing lots of reads and writes in the same command (in the second statement). If possible, change your queries to..
CREATE TABLE #ids (selId int NOT NULL PRIMARY KEY)
INSERT into #ids
select id from dbo.proevesvar
WHERE leverandoer = #lev
and proevedato <= #slutdato
and proevedato >= #startdato
and ProeveSvarXml.exist('/LaboratoryReport/LaboratoryResults/Result[Value=sql:variable(''#todelete'')]') = 1
and Proevesvarxml.exist('/LaboratoryReport/LaboratoryResults/Result[Value!=sql:variable(''#todelete'')]') = 0
begin transaction /* delete whole rows */
delete from dbo.proevesvar
where id in (select selId from #ids)
This means that the first query will only create the new temporary table, and not write anything back, which will take slightly longer than your original, but the key thing is that your second query will ONLY be deleting records based on what's in your temporary table.
What you'll probably find is because it's deleting records, it's constantly re-building indices, and causing the reads to also be slower.
I'd also delete/disable any indices/constraints that don't actually help your query run.
Also, you're creating your clustered primary key on the ID, which isn't always the best thing to do. Especially if you're doing lots of date scans.
Can you also view the estimated execution plan for the top query, it would be interesting to see the order in which it checks the conditions. If it's doing the date first, then that's fine, but if it's doing the xpath before it checks the date, you might have to separte it into 3 queries, or add a new clustered index on 'proevedato,id'. This should force the query to only run the xpath for records which actually match the date.
Hope this helps.

Resources