The row-limitation in compound primary key in SQL Server 2014 - sql-server

I am going to insert a 2.3 billion rows (2,300,000,000) from table_a into table_b. The schema of table_a and table_b are identical, the only difference is table_a doesn't have a primary key but table_b has set up a 4 columns compound primary key with 0 rows of data. I encounter the error message after 24 hours:
Msg 666, Level 16, State 2, Line 1
The maximum system-generated unique value for a duplicate group was exceeded for index with partition ID 422223771074560. Dropping and re-creating the index may resolve this; otherwise, use another clustering key.
This is my compound PK in table_b and the sample query code, any help will be thankful.
column1: varchar(10), not null
column2: nvarchar(50), not null
column3: nvarchar(100), not null
column4: int, not null
Sample code
insert into table_b
select *
from table_a
where date < '2017-01-01' -- some filters here

According to the SQL Server Documentation part of creating a primary key includes creating a unique index on that same table.
When you create a PRIMARY KEY constraint, a unique index on the
column, or columns, is automatically created. By default, this index
is clustered; however, you can specify a nonclustered index when you
create the constraint.
When a unique index is not on the table, each row gets what the docs are calling a "uniqueifier" which is 4 bytes in length (aka ~2.14 Billion combinations)
If the clustered index is not created with the UNIQUE property, the
Database Engine automatically adds a 4-byte uniqueifier column to the
table. When it is required, the Database Engine automatically adds a
uniqueifier value to a row to make each key unique. This column and
its values are used internally and cannot be seen or accessed by
users.
From this information and your error message we can tell two things:
There is a clustered index on the table
There is not a primary key on the table
Given the volume of the data you're dealing with, I'm betting you have a Clustered Columnstore Index on the table, which in SQL Server 2014 does not have the ability to have a primary key on.
One possible solution is to partition table_b based on particular column value (that has less than 15K unique values based on the limitations specified in the documentation). As a side-note, the same partitioning effort could have a significant impact on minimizing run time of any queries using table_b depending on which column is used in the partition function.

You know that:
If the clustered index is not created with the UNIQUE property, the
Database Engine automatically adds a 4-byte uniqueifier column to the
table. When it is required, the Database Engine automatically adds a
uniqueifier value to a row to make each key unique. This column and
its values are used internally and cannot be seen or accessed by
users.
While it´s unlikely that you will face an issue related with uniqueifiers, we have seen rare cases where customer reaches the uniqueifier limit of 2,147,483,648, generating error 666.
And from this topic about the issue we have:
As of February 2018, the design goal for the storage engine is to not
reset uniqueifiers during REBUILDs. As such, rebuild of the index
ideally would not reset uniquifiers and issue would continue to occur,
while inserting new data with a key value for which the uniquifiers
were exhausted. But current engine behavior is different for one
specific case, if you use the statement ALTER INDEX ALL ON
REBUILD WITH (ONLINE = ON), it will reset the uniqueifiers (across all
version starting SQL Server 2005 to SQL Server 2017).
So, if this is the cause if your issue, you can add additional integer column and build the index over it.

Related

How to do transaction.insert_or_update on secondary index and not the primary index?

I have a table in Google Cloud Spanner.
CREATE TABLE test_id (
Id STRING(MAX) NOT NULL,
KeyColumn STRING(MAX) NOT NULL,
parent_id INT64 NOT NULL,
Updated TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp=true),
) PRIMARY KEY (Id)
And, I am trying to perform transaction.insert_or_update through a python script.
For each row in a pandas dataframe, I am doing:
transaction.insert_or_update(
'test_id', columns=['Id','KeyColumn', 'parent_id', 'Updated'],
values=[(uuid.uuid4().hex, row["KeyColumn"], row["parent_id"], spanner.COMMIT_TIMESTAMP)],
)
What I want is that if the row["KeyColumn"] is already present in KeyColumn of the table, update its parent_id column, otherwise insert a new row in the Spanner table corresponding to that KeyColumn.
But since, my primary key is Id which is generated randomly by uuid.uuid4().hex, it every time inserts a new row.
If I understand you correctly, the following is the situation:
ID is the primary key of your table.
There is a unique index defined for the table on the column KeyColumn.
You want to insert_or_update a row using KeyColumn as the column that should be used to determine whether the row already exists.
That is unfortunately not possible. insert_or_update will always use the primary key of the table to determine whether the row exists. I can think of three possible solutions to this problem, but they all have their drawbacks:
You could change the table definition and make KeyColumn the primary key and set a unique index on the Id column. The problem with this is of course that any other code that depends on Id being the primary key also needs to change. It is also a rather cumbersome change, because Cloud Spanner does not allow you to change the primary key of a table, so you would have to create a copy of the test_id table and then drop the old table.
You could fetch the row from Cloud Spanner before updating it by reading it using the KeyColumn value that you have. The big problem with this is obviously performance. You will need to do a read for each row that you want to update.
You could use a DML statement (UPDATE test_id SET parent_id=#parent WHERE KeyColumn=#key) to execute the update and check whether it actually updated a row by checking the returned update count. If it did not update anything, you could then execute the insert. This will obviously also be slower than an insert_or_update mutation.
Here there is a way to query the Cloud Spanner with a specific index.
You should use something like this in the end of your query : FROM test_id#{FORCE_INDEX=KeyColumnIndex} .
Even though this is the way to execute queries on secondary indexes and the answer for the question in the title, I do not know how much it can be applied in your use case.

Find distinct values in SQL table without scan

A SQL Server table with >200 million records is divided into ~100 partitions (not true SQL Server Partitions - it's not running on a compatible edition of SQL Server) by adding a column PartitionID. PartitionID is the first half the table's clustered index definition; the other half is a unique auto-incrementing integer ID. PartitionID is also foreign key into the Partition table. No record from Example is ever accessed without knowing its PartitionID; they are usually accessed in ranges associated with a single PartitionID (or small number of PartitionIDs).
CREATE TABLE Example (
ID BIGINT IDENTITY(1, 1) NOT NULL,
PartitionID DECIMAL(18, 0) NOT NULL,
-- Other columns omitted for brevity
CONSTRAINT PK_Example PRIMARY KEY NONCLUSTERED (ID),
CONSTRAINT FK_Example_Partition FOREIGN KEY (PartitionID) REFERENCES Partition (ID)
)
CREATE UNIQUE CLUSTERED INDEX IX_Example ON Example(PartitionID, ID)
Partition rows are kept indefinitely, but Example rows are frequently purged by issuing a DELETE statement against a range with the same PartitionID. Over time, this leads to Partition rows that are not referenced by any Example rows. This is not the problem; the problem is identifying the Partition rows that are still referenced.
Without resorting to user-level management techniques like adding and manually maintaining a ReferenceCount field in the Partition table, or adding and manually maintaining a list of in-use PartitionIDs, is there a system-level technique we could use to discover the set of PartitionIDs that are still in use - without scanning all the rows in table Example?
SELECT DISTINCT PartitionID FROM Example
The above query takes tens of seconds to return 100 values because it's scanning 100s of millions of rows in the clustered index. Adding another very narrow index on PartionID alone might reduce the I/O and halve the time but essentially SQL Server is still scanning that index too.
CREATE NONCLUSTERED INDEX IX_Example_PartitionID ON Example(PartitionID)
I should probably also avoid joining Partition with Example (performing a number of clustered index seeks instead of an index scan) because the number of seeks will increase (and decrease performance) over time.
SELECT DISTINCT PartitionID FROM Partition p WHERE EXISTS (
SELECT TOP 1 1 FROM Example e WHERE p.ID = e.PartitionID
)

SQL Server 2014 Index Optimization: Any benefit with including primary key in indexes?

After running a query, the SQL Server 2014 Actual Query Plan shows a missing index like below:
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1) INCLUDE
(PK_Column,SomeOtherColumn)
The missing index suggests to include the Primary Key column in the index. The table is clustered index with the PK_Column.
I am confused and it seems that I don’t get the concept of Clustered Index Primary Key right.
My assumption was: when a table has a clustered PK, all of the non-clustered indexes point to the PK value. Am I correct? If I am, why the query plan missing index asks me to include the PK column in the index?
Summary:
Index advised is not valid,but it doesn't make any difference.See below tests section for details..
After researching for some time,found an answer here and below statement explains convincingly about missing index feature..
they only look at a single query, or a single operation within a single query. They don't take into account what already exists or your other query patterns.
You still need a thinking human being to analyze the overall indexing strategy and make sure that you index structure is efficient and cohesive.
So coming to your question,this index advised may be valid ,but should not to be taken for granted. The index advised is useful for SQL Server for the particular query executed, to reduce cost.
This is the index that was advised..
CREATE NONCLUSTERED INDEX IX_1 ON Table1 (Column1)
INCLUDE (PK_Column, SomeOtherColumn)
Assume you have a query like below..
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
SQL Server tries to scan a narrow index as well if available, so in this case an index as advised will be helpful..
Further you didn't share the schema of table, if you have an index like below
create index nci_test on table(column1)
and a query of below form will advise again same index as stated in question
select pk_column, someothercolumn
from table
where column1 = 'somevalue'
Update :
i have orders table with below schema..
[orderid] [int] NOT NULL Primary key,
[custid] [char](11) NOT NULL,
[empid] [int] NOT NULL,
[shipperid] [varchar](5) NOT NULL,
[orderdate] [date] NOT NULL,
[filler] [char](160) NOT NULL
Now i created one more index of below structure..
create index onlyempid on orderstest(empid)
Now when i have a query of below form
select empid,orderid,orderdate --6.3 units
from orderstest
where empid=5
index advisor will advise below missing index .
CREATE NONCLUSTERED INDEX empidalongwithorderiddate
ON [dbo].[orderstest] ([empid])
INCLUDE ([orderid],[orderdate])--you can drop orderid too ,it doesnt make any difference
If you can see orderid is also included in above suggestion
now lets create it and observe both structures..
---Root level-------
For index onlyempid..
for index empidalongwithorderiddate
----leaf level-------
For index onlyempid..
for index empidalongwithorderiddate
As you can see , creating as per suggestion makes no difference,Even though it is invalid.
I Assume suggestion was made by Index advisor based on query ran and is specifically for the query and it has no idea of other indexes involved
I don't know your schema, nor your queries. Just guessing.
Please correct me if this theory is incorrect.
You are right that non-clustered indexes point to the PK value. Imagine you have large database (for example gigabytes of files) stored on ordinary platter hard-drive. Lets suppose that the disk is fragmented and the PK_index is saved physical far from your Table1 Index.
Imagine that your query need to evaluate Column1 and PK_column as well. The query execution read Column1 value, then PK_value, then Column1 value, then PK_value...
The hard-drive platter is spinning from one physical place to another, this can take time.
Having all you need in one index is more effective, because it means reading one file sequentially.

designing new table for daily uploads - use unique constraint

I am using SQL Server 2012 & am creating a table that will have 8 columns, types below
datetime
varchar(12)
varchar(6)
varchar(100)
float
float
int
datetime
Once a day (normally) there will be an upload of approx 10,000 rows of data. Going forward its possible it could be 100,000.
The rows will be unique if I group on the first three columns listed above. I have read I can use the unique constraint on multiple columns which will guarantee the rows are unique.
I think I'm correct in saying that the unique constraint by default sets up non-clustered index. Would a clustered index be better & assuming when the table starts to contain millions of rows this won't cause any issues?
My last question. By applying the unique constraint on my table I am right to say querying the data will be quicker than if the unique constraint wasn't applied (because of the non-clustering or clustering) & uploading the data will be slower (which is fine) with the constraint on the table?
Unique index can be non-clustered.
Primary key is unique and can be clustered
Clustered index is not unique by default
Unique clustered index is unique :)
Mor information you can get from this guide.
So, we should separate uniqueness and index keys.
If you need to kepp data unique by some column - create uniqe contraint (unique index). You'll protect your data.
Also, you can create primary key (PK) on your columns - they will be unique also. But, there is a difference: all other indexies will use PK for referencing, so PK must be as short as possible. So, my advice - create Identity column (int or bigint) and create PK on it. And, create unique index on your unique columns.
Querying data may become faster, if you do queries on your unique columns, if you do query on other columns - you need to create other, specific indexies.
So, unique keys - for data consistency, indexies - for queries.
I think I'm correct in saying that the unique constraint by default
sets up non-clustered index
TRUE
Would a clustered index be better & assuming when the table starts to
contain millions of rows this won't cause any issues?
(1)if u need to make (datetime ,varchar(12), varchar(6)) Unique
(2)if you application or you will access rows using datetime or datetime ,varchar(12) or datetime ,varchar(12), varchar(6) in where condition
ALL the time
then have primary key on (datetime ,varchar(12), varchar(6))
by default it will put Uniqness and clustered index on all above three column.
but as you commented above:
the queries will vary to be honest. I imagine most queries will make
use of the first datetime column
and you will deal with huge data and might join this table with other tables
then its better have a surrogate key( ever-increasing unique identifier ) in the table and to satisfy your Selects
have Non-Clustered INDEXES
Surrogate Key vs Business Key
NON-CLUSTERED INDEX

How to update guid ID references when converting to identity IDs

I am trying to convert tables from using guid primary keys / clustered indexes to using int identities. This is for SQL Server 2005. There are two tables MainTable and RelatedTable, and the current table structure is as follows:
MainTable [40 million rows]
IDGuid - uniqueidentifier - PK
-- [data columns]
RelatedTable [400 million rows]
RelatedTableID - uniqueidentifier - PK
MainTableIDGuid - uniqueidentifier [foreign key to MainTable]
SequenceNumber - int - incrementing number per main table entry since there can be multiple entries related to a given row in the main table. These go from 1,2,3... etc for each MainTableIDGuid value.
-- [data columns]
The clustered index for MainTable is currently the primary key (IDGuid). The clustered index for RelatedTable is currently (MainTableIDGuid, SequenceNumber).
I want my conversion is do several things:<
Change MainTable to use an integer ID instead of GUID
Add a MainTableIDInt column to related table that links to Main Table's integer ID
Change the primary key and clustered index of RelatedTable to (MainTableIDInt, SequenceNumber)
Get rid of the guid columns.
I've written a script to do the following:
Add an IDInt int IDENTITY column to MainTable. This does a table rebuild and generates the new identity ID values.
Add a MainTableIDInt int column to RelatedTable.
The next step is to populate the RelatedTable.MainTableIDInt column for each row with its corresponding MainTable.IDInt value [based on the matching guid IDs]. This is the step I'm hung up on. I understand this is not going to be speedy, but I'd like to have it perform as well as possible.
I can write a SQL statement that does this update:
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = (SELECT MainTable.IDInt FROM MainTable WHERE MainTable.IDGuid = RelatedTable.MainTableIDGuid)
or
UPDATE RelatedTable
SET RelatedTable.MainTableIDInt = MainTable.IDInt
FROM RelatedTable
LEFT OUTER JOIN MainTable ON RelatedTable.MainTableIDGuid = MainTable.IDGuid
The 'Display Estimated Execution Plan' displays roughly the same for both of these queries. The execution plan it spits out does the following:
Clustered index scans over MainTable and RelatedTable and does a Merge Join on them [estimated number of rows = 400 million]
Sorts [estimated number of rows = 400 million]
Clustered index update over RelatedTable [estimated number of rows = 400 million]
I'm concerned about the performance of this [sorting 400 million rows sounds unpleasant]. Are my concerns about performance of these execution plan justified? Is there a better way to update the new ID for my related table that will scale given the size of the tables?
First, this will be a headache. Second, I wouldn't change any of the indexes or constraints until I had the data in place. I.e., I would add the identity column but not make it the primary key nor clustered index. Then I'd add the soon-to-be new foreign keys to the various tables. Your queries should look like:
Update ChildTable
Set NewIntForeignKeyId = P.NewIntPrimaryKey
From ChildTable As C
Join ParentTable As P
On P.PrimaryKey = C.ForeignKey
First, notice that I'm using an inner join. There is no reason to use an outer join for this type of query given that you will eventually enforce referential integrity between the new columns. Second, if you populate the columns first and then rebuild the constraints, it will be faster as you'll be able to leverage the existing indexes. Remember that when you change the clustered index, it rebuilds all of the nonclustered indexes. If the tables are large, that will be a serious hit.
Once you have the data in place, I'd then drop all primary constraints, unique constraints, foreign key constraints and unique indexes. Drop the clustered index/constraint last. I'd then add the clustered indexes to all of the tables and after that was done, recreate the unique constraints, foreign key constraints and indexes. If you do not drop the existing indexes before you recreate the clustered index, it will rebuild the existing indexes twice: once when you drop the clustered index and again when you recreate it.
Btw, I highly doubt there is a way to avoid table scans for this sort of thing since you are going to be updating every row.

Resources