Updating a partitioned table with aligned index locks all partitions - sql-server

part function and schema:
CREATE PARTITION FUNCTION SegmentPartitioningFunction (smallint)
AS RANGE LEFT FOR VALUES (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) ;
GO
CREATE PARTITION SCHEME SegmentPartitioningSchema
AS PARTITION SegmentPartitioningFunction ALL TO ([PRIMARY])
GO
I have a table (Note the table is M:N) :
PartTable
(
prd_id int,
cat_id smallint,
datacolumn smallint,
primary key (prd_id, cat_id) on SegmentPartitioningSchema(cat_id)
) on SegmentPartitioningSchema(cat_id)
NOTE: I tried all 3 combinations (part. table, part. index, part table and index)
In addition I have an index on PRD_ID and CAT_ID by them selves as they are Foreing keys.
I set LOCK_ESCALATION to AUTO as confirmed using this select:
SELECT lock_escalation_desc FROM sys.tables
Now, what I am trying to do is and by my understending should work:
begin tran
update PartTable set DataColumn = DataColumn where cat_id = 1
-- commit tran -- this doesnt happen yet
and at the same time in different connection
begin tran
update PartTable set DataColumn = DataColumn where cat_id = 2
-- commit tran -- this doesnt happen yet
The execution plan
Pastebin: http://pastebin.com/MgK6whYG
Bonus Q: Why is the TOP operator used?
By my understanding, I should get IX locks on the partition and only one one partition at the same time, which means both transactions can complete at the same time and the second one where cat_id = 2 shouldn't have to wait for the first one as both table and index are partitioned and the index is aligned.
However this doesn't happen and when I check the execution plan, I see that the table scan happens on every partition even through that doesnt make sense to me.
I tried technet, documentation and stackoverflow but all I didn't find an answer.
Now I know that the PK should be CAT_ID, PRD_ID (this makes more sense the way the table is used)
I am currently working on SQL Server 2012, but I suppose it should work better then on 2008R2.

Related

Do SQL Server partition switch on partition A while a long running query is consuming partition B

I'm currently looking for a method to improve the overall ability to ingest and consume data at the same time in an analytical environment (data warehouse). You may have already faced a similar situation and I'm interested about the various ways of improvement. Thanks in advance for your reading of the situation and your potential help on this matter!
The data ingestion process is currently relying on partition switching mechanism on top of Azure SQL Server. Multiple jobs per day are running. Data is loaded into staging table, then once the data is ready, there is partition swap operation happening, taking out the previous data on this partition, replaced by the new data set. The target table is configured with a clustered columnstore index. In addition, the table is configured with the lock escalation mode set to auto.
Once the data is ingested, a monitoring system is then automatically pushing this data into some Power BI datasets in import mode, by triggering refresh at partition level in PBI. The datasets are reading data in SQL Server in a very strict way, reading data partition by partition and are never reading data that is currently being ingested into the data warehouse. This system guarantees to have always the latest data up to date in the various PBI datasets with the shortest delay possible in the end-2-end.
When the alter table switch partition statement is fired, it may be locked by some other running statements that are consuming data located on the same table, but on another partition. This is due to a sch-s lock placed by the select query on the table.
Having set the lock escalation mode to auto, I was expecting the sch-s lock would be placed only at partition level, but this seems not the case.
This situation is particularly annoying as the queries running on Power BI are quite long, as a lots of records need to be moved into the dataset. As Power BI is running multiple queries in parallel, the locks can remain a very long time in total, preventing ingestion of new data to complete.
Few additional notes about the current setup:
I'm already taking advantage of the wait_at_low_priority feature, to ensure the swap partition process is not blocking another query to run in the meantime.
The partition swap process has been considered as we have snapshots of data, replacing entirely was has been ingested previously for the same partition.
Loading data in a staging table allows the computation of the columnstore index faster than inserting on the final data directly.
Here is below a test script showing the situation.
Hope this is all clear. Many thanks for your help!
Initialization
-- Clean any previous stuff.
drop table if exists dbo.test;
drop table if exists dbo.staging_test;
drop table if exists dbo.staging_out_test;
if exists(select 1 from sys.partition_schemes where name = 'ps_test') drop partition scheme ps_test;
if exists(select 1 from sys.partition_functions where name = 'pf_test') drop partition function pf_test;
go
-- Create partition function and scheme.
create partition function pf_test (int) as range right for values(1, 2, 3);
go
create partition scheme ps_test as partition pf_test all to ([primary]);
go
alter partition scheme ps_test next used [primary];
go
-- Data table.
create table dbo.test (
id int not null,
name varchar(100) not null
) on ps_test(id);
go
-- Staging table for data ingestion.
create table dbo.staging_test (
id int not null,
name varchar(100) not null
) on ps_test(id);
go
-- Staging table for taking out previous data.
create table dbo.staging_out_test (
id int not null,
name varchar(100) not null
) on ps_test(id);
go
-- Set clustered columnstore index on all tables.
create clustered columnstore index ix_test on dbo.test on ps_test(id);
create clustered columnstore index ix_staging_test on dbo.staging_test on ps_test(id);
create clustered columnstore index ix_staging_out_test on dbo.staging_out_test on ps_test(id);
go
-- Lock escalation mode is set to auto to allow locks at partition level.
alter table dbo.test set (lock_escalation = auto);
alter table dbo.staging_test set (lock_escalation = auto);
alter table dbo.staging_out_test set (lock_escalation = auto);
go
-- Insert few data...
insert into dbo.test (id, name) values(1, 'initial data partition 1'), (2, 'initial data partition 2');
insert into dbo.staging_test (id, name) values(1, 'new data partition 1'), (2, 'new data partition 2');
go
-- Display current data.
select * from dbo.test;
select * from dbo.staging_test;
select * from dbo.staging_out_test;
go
Long running query example (adjust variable #c to generate more or less records):
-- Generate a long running query hitting only one specific partition on the test table.
declare #i bigint = 1;
declare #c bigint = 100000;
with x as (
select #i n
union all
select n + 1
from x
where n < #c
)
select
d.name
from
x,
(
select name
from
dbo.test d
where
d.id = 2) d
option (MaxRecursion 0)
Partition swap example (to be run while "long running query" is running, showing the lock behavior:
select * from dbo.test;
-- Switch old data out.
alter table dbo.test switch partition $PARTITION.pf_test(1) to dbo.staging_out_test partition $PARTITION.pf_test(1) with(wait_at_low_priority (max_duration = 1 minutes, abort_after_wait = self));
-- Switch new data in.
alter table dbo.staging_test switch partition $PARTITION.pf_test(1) to dbo.test partition $PARTITION.pf_test(1) with(wait_at_low_priority (max_duration = 1 minutes, abort_after_wait = self));
go
select * from dbo.test;

unique index is not enforced if IsActive column is false [duplicate]

I have a situation where i need to enforce a unique constraint on a set of columns, but only for one value of a column.
So for example I have a table like Table(ID, Name, RecordStatus).
RecordStatus can only have a value 1 or 2 (active or deleted), and I want to create a unique constraint on (ID, RecordStatus) only when RecordStatus = 1, since I don't care if there are multiple deleted records with the same ID.
Apart from writing triggers, can I do that?
I am using SQL Server 2005.
Behold, the filtered index. From the documentation (emphasis mine):
A filtered index is an optimized nonclustered index especially suited to cover queries that select from a well-defined subset of data. It uses a filter predicate to index a portion of rows in the table. A well-designed filtered index can improve query performance as well as reduce index maintenance and storage costs compared with full-table indexes.
And here's an example combining a unique index with a filter predicate:
create unique index MyIndex
on MyTable(ID)
where RecordStatus = 1;
This essentially enforces uniqueness of ID when RecordStatus is 1.
Following the creation of that index, a uniqueness violation will raise an arror:
Msg 2601, Level 14, State 1, Line 13
Cannot insert duplicate key row in object 'dbo.MyTable' with unique index 'MyIndex'. The duplicate key value is (9999).
Note: the filtered index was introduced in SQL Server 2008. For earlier versions of SQL Server, please see this answer.
Add a check constraint like this. The difference is, you'll return false if Status = 1 and Count > 0.
http://msdn.microsoft.com/en-us/library/ms188258.aspx
CREATE TABLE CheckConstraint
(
Id TINYINT,
Name VARCHAR(50),
RecordStatus TINYINT
)
GO
CREATE FUNCTION CheckActiveCount(
#Id INT
) RETURNS INT AS BEGIN
DECLARE #ret INT;
SELECT #ret = COUNT(*) FROM CheckConstraint WHERE Id = #Id AND RecordStatus = 1;
RETURN #ret;
END;
GO
ALTER TABLE CheckConstraint
ADD CONSTRAINT CheckActiveCountConstraint CHECK (NOT (dbo.CheckActiveCount(Id) > 1 AND RecordStatus = 1));
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 2);
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 2);
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 2);
INSERT INTO CheckConstraint VALUES (1, 'No Problems', 1);
INSERT INTO CheckConstraint VALUES (2, 'Oh no!', 1);
INSERT INTO CheckConstraint VALUES (2, 'Oh no!', 2);
-- Msg 547, Level 16, State 0, Line 14
-- The INSERT statement conflicted with the CHECK constraint "CheckActiveCountConstraint". The conflict occurred in database "TestSchema", table "dbo.CheckConstraint".
INSERT INTO CheckConstraint VALUES (2, 'Oh no!', 1);
SELECT * FROM CheckConstraint;
-- Id Name RecordStatus
-- ---- ------------ ------------
-- 1 No Problems 2
-- 1 No Problems 2
-- 1 No Problems 2
-- 1 No Problems 1
-- 2 Oh no! 1
-- 2 Oh no! 2
ALTER TABLE CheckConstraint
DROP CONSTRAINT CheckActiveCountConstraint;
DROP FUNCTION CheckActiveCount;
DROP TABLE CheckConstraint;
You could move the deleted records to a table that lacks the constraint, and perhaps use a view with UNION of the two tables to preserve the appearance of a single table.
You can do this in a really hacky way...
Create an schemabound view on your table.
CREATE VIEW Whatever
SELECT * FROM Table
WHERE RecordStatus = 1
Now create a unique constraint on the view with the fields you want.
One note about schemabound views though, if you change the underlying tables you will have to recreate the view. Plenty of gotchas because of that.
For those still searching for a solution, I came accross a nice answer, to a similar question and I think this can be still useful for many. While moving deleted records to another table may be a better solution, for those who don't want to move the record can use the idea in the linked answer which is as follows.
Set deleted=0 when the record is available/active.
Set deleted=<row_id or some other unique value> when marking the row
as deleted.
If you can't use NULL as a RecordStatus as Bill's suggested, you could combine his idea with a function-based index. Create a function that returns NULL if the RecordStatus is not one of the values you want to consider in your constraint (and the RecordStatus otherwise) and create an index over that.
That'll have the advantage that you don't have to explicitly examine other rows in the table in your constraint, which could cause you performance issues.
I should say I don't know SQL server at all, but I have successfully used this approach in Oracle.
Because, you are going to allow duplicates, a unique constraint will not work. You can create a check constraint for RecordStatus column and a stored procedure for INSERT that checks the existing active records before inserting duplicate IDs.

Missing rows after updating SQL Server index key column

"T-SQL Querying" book (http://www.amazon.com/Inside-Microsoft-Querying-Developer-Reference/dp/0735626030) has an interesting example, where, querying a table under default transaction isolation level during clustered index key column update, you may miss a row or read a row twice. It looks to be acceptable, since updating table/entity key is not a good idea anyway. However, I've updated this example so that the same happens, when you update non-clustered index key column value.
Following is the table structure:
SET NOCOUNT ON;
USE master;
IF DB_ID('TestIndexColUpdate') IS NULL CREATE DATABASE TestIndexColUpdate;
GO
USE TestIndexColUpdate;
GO
IF OBJECT_ID('dbo.Employees', 'U') IS NOT NULL DROP TABLE dbo.Employees;
CREATE TABLE dbo.Employees
(
empid CHAR(900) NOT NULL, -- this column should be big enough, so that 9 rows fit on 2 index pages
salary MONEY NOT NULL,
filler CHAR(1) NOT NULL DEFAULT('a')
);
CREATE INDEX idx_salary ON dbo.Employees(salary) include (empid); -- include empid into index, so that test query reads from it
ALTER TABLE dbo.Employees ADD CONSTRAINT PK_Employees PRIMARY KEY NONCLUSTERED(empid);
INSERT INTO dbo.Employees(empid, salary) VALUES
('A', 1500.00),('B', 2000.00),('C', 3000.00),('D', 4000.00),
('E', 5000.00),('F', 6000.00),('G', 7000.00),('H', 8000.00),
('I', 9000.00);
This is what needs to be done in the first connection (on each update, the row will jump between 2 index pages):
SET NOCOUNT ON;
USE TestIndexColUpdate;
WHILE 1=1
BEGIN
UPDATE dbo.Employees SET salary = 10800.00 - salary WHERE empid = 'I'; -- on each update, "I" employee jumps between 2 pages
END
This is what needs to be done in the second connection:
SET NOCOUNT ON;
USE TestIndexColUpdate;
DECLARE #c INT
WHILE 1 = 1
BEGIN
SELECT salary, empid FROM dbo.Employees
if ##ROWCOUNT <> 9 BREAK;
END
Normally, this query should return 9 records we inserted in the first code sample. However, very soon, I see 8 records being returned. This query reads all it's data from the "idx_salary" index, which is being updated by previous sample code.
This seems to be quite lax attitude towards data consistency from SQL Server. I would expect some locking coordination, when data is being read from index, while its key column is being updated.
Do I interpret this behavior correctly? Does this mean, that even non-clustered index keys should not be updated?
UPDATE:
To solve this problem, you only need to enable "snapshots" on the db (READ_COMMITTED_SNAPSHOT ON). No more deadlocking or missing rows. I've tried summarize all of this here: http://blog.konstantins.net/2015/01/missing-rows-after-updating-sql-server.html
UPDATE 2:
This seems to be the very same problem, as in this good old article: http://blog.codinghorror.com/deadlocked/
Do I interpret this behavior correctly?
Yes.
Does this mean, that even non-clustered index keys should not be updated?
No. You should use a proper isolation level or make the application tolerate the inconsistencies that READ COMMITTED allows.
This issue of missing rows is not limited to clustered indexes. It is caused by moving a row in a b-tree. Clustered and nonclustered indexes are implemented as b-trees with only tiny physical differences between them.
So you are seeing the exact same physical phenomenon. It applies every time your query reads a range of rows from a b-tree. The contents of that range can move around.
Use an isolation level that provides you the guarantees that you need. For read-only transactions the snapshot isolation level is usually a very elegant and total solution to concurrency. It seems to apply to your case.
This seems to be quite lax attitude towards data consistency from SQL Server. I would expect some locking coordination, when data is being read from index, while its key column is being updated.
This is an understandable request. On the other hand you specifically requested a low level of isolation. You can dial all the way up to SERIALIZABLE of you want. SERIALIZABLE presents you as-if serial execution.
Missing rows are just one special case of the many effects that READ COMMITTED allows. It makes no sense to specifically prevent them while allowing all kinds of other inconsistencies.
SET NOCOUNT ON;
USE TestIndexColUpdate;
SET TRANSACTION ISOLATION LEVEL READ COMMITTED
DECLARE #c INT
WHILE 1 = 1
BEGIN
DECLARE #count INT
SELECT #count = COUNT(*) FROM dbo.Employees WITH (INDEX (idx_salary))
WHERE empid > '' AND CONVERT(NVARCHAR(MAX), empid) > '__'
AND salary > 0
if #count <> 9 BREAK;
END

SQL Server fastest way to change data types on large tables

We need to change the data type of about 10 primary keys in our db from numeric(19,0) to bigint. On the smaller tables a simple update of the datatype works just fine but on the larger tables (60-70 million rows) it takes a considerable amount of time.
What is the fastest way to achieve this, preferably without locking the database.
I've written a script that generates the following (which I believe I got from a different SO post)
--Add a new temporary column to store the changed value.
ALTER TABLE query_log ADD id_bigint bigint NULL;
GO
CREATE NONCLUSTERED INDEX IX_query_log_id_bigint ON query_log (id_bigint)
INCLUDE (id); -- the include only works on SQL 2008 and up
-- This index may help or hurt performance, I'm not sure... :)
GO
declare #count int
declare #iteration int
declare #progress int
set #iteration = 0
set #progress = 0
select #count = COUNT(*) from query_log
RAISERROR ('Processing %d records', 0, 1, #count) WITH NOWAIT
-- Update the table in batches of 10000 at a time
WHILE 1 = 1 BEGIN
UPDATE X -- Updating a derived table only works on SQL 2005 and up
SET X.id_bigint = id
FROM (
SELECT TOP 10000 * FROM query_log WHERE id_bigint IS NULL
) X;
IF ##RowCount = 0 BREAK;
set #iteration = #iteration + 1
set #progress = #iteration * 10000
RAISERROR ('processed %d of %d records', 0, 1, #progress, #count) WITH NOWAIT
END;
GO
--kill the pkey on the old column
ALTER TABLE query_log
DROP CONSTRAINT PK__query_log__53833672
GO
BEGIN TRAN; -- now do as *little* work as possible in this blocking transaction
UPDATE T -- catch any updates that happened after we touched the row
SET T.id_bigint = T.id
FROM query_log T WITH (TABLOCKX, HOLDLOCK)
WHERE T.id_bigint <> T.id;
-- The lock hints ensure everyone is blocked until we do the switcheroo
EXEC sp_rename 'query_log.id', 'id_numeric';
EXEC sp_rename 'query_log.id_bigint', 'id';
COMMIT TRAN;
GO
DROP INDEX IX_query_log_id_bigint ON query_log;
GO
ALTER TABLE query_log ALTER COLUMN id bigint NOT NULL;
GO
/*
ALTER TABLE query_log DROP COLUMN id_numeric;
GO
*/
ALTER TABLE query_log
ADD CONSTRAINT PK_query_log PRIMARY KEY (id)
GO
This works very well for the smaller tables but is extremely slow going for the very large tables.
Note this is in preparation for a migration to Postgres and the EnterpriseDB Migration toolkit doesn't seem to understand the numeric(19,0) datatype
If is not possible to change a primary key without locking. The fastest way with the least impact is to create a new table with the new columns and primary keys without foreign keys and indexes. Then batch insert blocks of data in sequential order relative to their primary key(s). When that is finished, add your indexes, then foreign keys back. Finally, drop or rename the old table and rename your new table to the systems expected table name.
In practice your approach will have to vary based on how many records are inserted, updated, and/or deleted. If you're only inserting then you can perform the initial load, and top of the table just before your swap.
This approach should provide the fastest migration, minimal logs, and very little fragmentation on your table and indexes.
You have to remember that every time you modify a record, the data is being modified, indexes are being modified, and foreign keys are being checked. All within one implicit or explicit transaction. The table and/or row(s) will be locked while all changes are made. Even if your database is set to simple logging, the server will still write all changes to the log files. Updates actually are a delete paired with an insert so it is not possible to prevent fragmentation during any other process.

Using a trigger to simulate a second identity column in SQL Server 2005

I have various reasons for needing to implement, in addition to the identity column PK, a second, concurrency safe, auto-incrementing column in a SQL Server 2005 database. Being able to have more than one identity column would be ideal, but I'm looking at using a trigger to simulate this as close as possible to the metal.
I believe I have to use a serializable isolation level transaction in the trigger. Do I go about this like Ii would use such a transaction in a normal SQL query?
It is a non-negotiable requirement that the business meaning of the second incrementing column remain separated from the behind the scenes meaning of the first, PK, incrementing column.
To put things as simply as I can, if I create JobCards '0001', '0002', and '0003', then delete JobCards '0002' and '0003', the next Jobcard I create must have ID '0002', not '0004'.
Just an idea, if you have 2 "identity" columns, then surely they would be 'in sync' - if not exactly the same value, then would differ by a constant value. If so, then why not add the "second identity" column as a COMPUTED column, which offsets the primary identity? Or is my logic flawed here?
Edit : As per Martin's comment, note that your calc might need to be N * id + C, where N is the Increment and C the offset / delta - excuse my rusty maths.
For example:
ALTER TABLE MyTable ADD OtherIdentity AS Id * 2 + 1;
Edit
Note that for Sql 2012 and later, that you can now use an independent sequence to create two or more independently incrementing columns in the same table.
Note: OP has edited the original requirement to include reclaiming sequences (noting that identity columns in SQL do not reclaim used ID's once deleted).
I would disallow all the deletes from this table altogether. Instead of deleting, I would mark rows as available or inactive. Instead of inserting, I would first search if there are inactive rows, and reuse the one with the smallest ID if they exist. I would insert only if there are no available rows already in the table.
Of course, I would serialize all inserts and deletes with sp_getapplock.
You can use a trigger to disallow all deletes, it is simpler than filling gaps.
A solution to this issue from "Inside Microsoft SQL Server 2008: T-SQL Querying" is to create another table with a single row that holds the current max value.
CREATE TABLE dbo.Sequence(
val int
)
Then to allocate a range of sufficient size for your insert
CREATE PROC dbo.GetSequence
#val AS int OUTPUT,
#n as int =1
AS
UPDATE dbo.Sequence
SET #val = val = val + #n;
SET #val = #val - #n + 1;
This will block other concurrent attempts to increment the sequence until the first transaction commits.
For a non blocking solution that doesn't handle multi row inserts see my answer here.
This is probably a terrible idea, but it works in at least a limited use scenario
Just use a regular identity and reseed on deletes.
create table reseedtest (
a int identity(1,1) not null,
name varchar(100)
)
insert reseedtest values('erik'),('john'),('selina')
select * from reseedtest
go
CREATE TRIGGER TR_reseedtest_D ON reseedtest FOR DELETE
AS
BEGIN TRAN
DECLARE #a int
SET #a = (SELECT TOP 1 a FROM reseedtest WITH (TABLOCKX, HOLDLOCK))
--anyone know another way to lock a table besides doing something to it?
DBCC CHECKIDENT(reseedtest, reseed, 0)
DBCC CHECKIDENT(reseedtest, reseed)
COMMIT TRAN
GO
delete reseedtest where a >= 2
insert reseedtest values('katarina'),('david')
select * from reseedtest
drop table reseedtest
This won't work if you are deleting from the "middle of the stack" as it were, but it works fine for deletes from the incrementing end.
Reseeding once to 0 then again is just a trick to avoid having to calculate the correct reseed value.
if you never delete from the table, you could create a view with a materialized column that uses ROW_NUMBER().
ALSO, a SQL Server identity can get out of sync with a user generated one, depending on the use of rollback.

Resources