storing and processing 128-bit values in SQL Server - sql-server

I'm going to create a table with an specific column which stores 128-bit unsigned values.
I have below constraints:
I should query this column for previous duplicate values in shortest possible time (Note: this is not a uniqueness constraint).
I should insert values with low timing overhead.
Number of records may be around tens of millions (e.g. 90M).
I want to display the column value as hex.
I want to query the column value using hex input strings.
According to this answer, using binary[16] is suggested for storing 128-bit values (without considering about constraints).
So my questions is which data type can be suitable for my column data type? (Char(32), Binary(16), ...).
I'm currently using char(32) to store 128-bit value in hex representation and I want to know If I can improve my database performance or not? I guess since numbers are stored and processed as text values, it decreases the DB query performance and there might be data-types with better performance.

binary(16) is the most appropriate column data type for 128-bit unsigned values since SQL Server does not have a 128-bit unsigned type. This will reduce storage and memory requirements by half compared to char(32).
I should query this column for previous duplicate values in shortest
possible time (Note: this is not a uniqueness constraint).
Create an index on the column to avoid a full table scan.
I should insert values with low timing overhead.
There is a minor insert performance cost with the above index but will be more than offset by the runtime savings.
Number of records may be around tens of millions (e.g. 90M).
A 100M row table will only be a few GB (depending on row size) and will likely be memory resident on an adequate sized SQL instance if the table is used often.
I want to display the column value as hex.
As with all display formatting, this is a task best done in the presentation layer but can be done in T-SQL as well.
I want to query the column value using hex input strings.
Ideally, one should match the column data types when querying but a hex string can be explicitly converted to binary if needed.
Example T-SQL:
CREATE TABLE dbo.YourTable (
YourTableID int NOT NULL CONSTRAINT PK_YourTable PRIMARY KEY CLUSTERED
, BinaryValue binary(16) NOT NULL INDEX idx_BinaryValue NONCLUSTERED
, OtherData varchar(50) NOT NULL
);
INSERT INTO dbo.YourTable VALUES(1,0X000102030405060708090A0B0C0D0E0F,'example 1');
INSERT INTO dbo.YourTable VALUES(2,0X000102030405060708090A0B0C0D0E00,'example 2');
INSERT INTO dbo.YourTable VALUES(3,0X000102030405060708090A0B0C0D0E0F,'example 3 duplicate value');
--example query values
DECLARE #BinaryValue binary(16) = 0X000102030405060708090A0B0C0D0E0F;
DECLARE #CharValue char(32) = '000102030405060708090A0B0C0D0E0F';
--matching type query
SELECT YourTableID, BinaryValue, OtherData, CONVERT(char(16), BinaryValue, 2) AS DisplayValue
FROM dbo.YourTable
WHERE BinaryValue = #BinaryValue;
--query with hex string
SELECT YourTableID, BinaryValue, OtherData, CONVERT(char(16), BinaryValue, 2) AS DisplayValue
FROM dbo.YourTable
WHERE BinaryValue = CONVERT(binary(16), #CharValue, 2);
GO

Related

How to get data in chunks from very large table which has primary key column data type as varchar

I want to retrieve data from table with 28 million rows. I want to retrieve 1 million rows at a time.
I have checked answers from following links
Get data from large table in chunks
How can I get a specific chunk of results?
The solution suggests a query which has int ID column. In my case I have primary key column with varchar(15) as data type
I want to use something like this which is faster -
Select top 2000 *
from t
where ID >= #start_index
order by ID
But since the ID column is varchar, I can not use integer index.
How can I efficiently get data in chunks from a table with primary key having varchar data type?
Because your primary key has to be unique the same approach will work. You can use >= with character columns.
I'll suggest to use an offset ORDER BY like this:
SELECT *
FROM t
ORDER BY ID
OFFSET 1000 ROWS FETCH NEXT 500 ROWS ONLY;
This way is very flexible.

Alter Column: option to specify conversion function?

I have a column of type float that contains phone numbers - I'm aware that this is bad, so I want to convert the column from float to nvarchar(max), converting the data appropriately so as not to lose data.
The conversion can apparently be handled correctly using the STR function (suggested here), but I'm not sure how to go about changing the column type and performing the conversion without creating a temporary column. I don't want to use a temporary column because we are doing this automatically a bunch of times in future and don't want to encounter performance impact from page splits (suggested here)
In Postgres you can add a "USING" option to your ALTER COLUMN statement that specifies how to convert the existing data. I can't find anything like this for TSQL. Is there a way I can do this in place?
Postgres example:
...ALTER COLUMN <column> TYPE <type> USING <func>(<column>);
Rather than use a temporary column in your table, use a (temporary) column in a temporary table. In short:
Create temp table with PK of your table + column you want to change (in the correct data type, of course)
select data into temp table using your conversion method
Change data type in actual table
Update actual table from temp table values
If the table is large, I'd suggest doing this in batches. Of course, if the table isn't large, worrying about page splits is premature optimization since doing a complete rebuild of the table and its indexes after the conversion would be cheap. Another question is: why nvarchar(max)? The data is phone numbers. Last time I checked, phone numbers were fairly short (certainly less than the 2 Gb that nvarchar(max) can hold) and non-unicode. Do some domain modeling to figure out the appropriate data size and you'll thank me later. Lastly, why would you do this "automatically a bunch of times in future"? Why not have the correct data type and insert the right values?
In sqlSever:
CREATE TABLE dbo.Employee
(
EmployeeID INT IDENTITY (1,1) NOT NULL
,FirstName VARCHAR(50) NULL
,MiddleName VARCHAR(50) NULL
,LastName VARCHAR(50) NULL
,DateHired datetime NOT NULL
)
-- Change the datatype to support 100 characters and make NOT NULL
ALTER TABLE dbo.Employee
ALTER COLUMN FirstName VARCHAR(100) NOT NULL
-- Change datatype and allow NULLs for DateHired
ALTER TABLE dbo.Employee
ALTER COLUMN DateHired SMALLDATETIME NULL
-- Set SPARSE columns for Middle Name (sql server 2008 only)
ALTER TABLE dbo.Employee
ALTER COLUMN MiddleName VARCHAR(100) SPARSE NULL
http://sqlserverplanet.com/ddl/alter-table-alter-column

High Sort Cost on Merge Operation

I am using the MERGE feature to insert data into a table using a bulk import table as source. (as described here)
This is my query:
DECLARE #InsertMapping TABLE (BulkId int, TargetId int);
MERGE dbo.Target T
USING dbo.Source S
ON 0=1 WHEN NOT MATCHED THEN
INSERT (Data) VALUES (Data)
OUTPUT S.Id BulkId, inserted.Id INTO #InsertMapping;
When evaluating the performance by displaying the actual execution plan, I saw that there is a high cost sorting done on the primary key index. I don't get it because the primary key should already be sorted ascending, there shouldn't be a need for additional sorting.
!
Because of this sort cost the query takes several seconds to complete. Is there a way to speed up the inserting? Maybe some index hinting or additional indices? Such an insert shouldn't take that long, even if there are several thousand entries.
I can reproduce this issue with the following
CREATE TABLE dbo.TargetTable(Id int IDENTITY PRIMARY KEY, Value INT)
CREATE TABLE dbo.BulkTable(Id int IDENTITY PRIMARY KEY, Value INT)
INSERT INTO dbo.BulkTable
SELECT TOP (1000000) 1
FROM sys.all_objects o1, sys.all_objects o2
DECLARE #TargetTableMapping TABLE (BulkId INT,TargetId INT);
MERGE dbo.TargetTable T
USING dbo.BulkTable S
ON 0 = 1
WHEN NOT MATCHED THEN
INSERT (Value)
VALUES (Value)
OUTPUT S.Id AS BulkId,
inserted.Id AS TargetId
INTO #TargetTableMapping;
This gives a plan with a sort before the clustered index merge operator.
The sort is on Expr1011, Action1010 which are both computed columns output from previous operators.
Expr1011 is the result of calling the internal and undocumented function getconditionalidentity to produce an id column for the identity column in TargetTable.
Action1010 is a flag indicating insert, update, delete. It is always 4 in this case as the only action this MERGE statement can perform is INSERT.
The reason the sort is in the plan is because the clustered index merge operator has the DMLRequestSort property set.
The DMLRequestSort property is set based on the number of rows expected to be inserted. Paul White explains in the comments here
[DMLRequestSort] was added to support the ability to minimally-log
INSERT statements in 2008. One of the preconditions for minimal
logging is that the rows are presented to the Insert operator in
clustered key order.
Inserting into tables in clustered index key order can be more efficient anyway as it reduces random IO and fragmentation.
If the function getconditionalidentity returns generated identity values in ascending order (as would seem reasonable) then the input to the sort will already be in the desired order. The sort in the plan would in that case be logically redundant, (there was previously a similar issue with unnecessary sorts with NEWSEQUENTIALID)
It is possible to get rid of the sort by making the expression a bit more opaque.
DECLARE #TargetTableMapping TABLE (BulkId INT,TargetId INT);
DECLARE #N BIGINT = 0x7FFFFFFFFFFFFFFF
MERGE dbo.TargetTable T
USING (SELECT TOP(#N) * FROM dbo.BulkTable) S
ON 1=0
WHEN NOT MATCHED THEN
INSERT (Value)
VALUES (Value)
OUTPUT S.Id AS BulkId,
inserted.Id AS TargetId
INTO #TargetTableMapping;
This reduces the estimated row count and the plan no longer has a sort. You will need to test whether or not this actually improves performance though. Possibly it might make things worse.

Can I have a primary key and a separate clustered index together?

Let's assume I already have a primary key, which makes sure uniqueness. My primary key is also ordering index for the records. However, I am curious about the primary key's task in physical order of records in the disk (if there is). And the actual question is can I have a separate clustered index for these records?
This is an attempt at testing the size and performance characteristics of a covering secondary index on a clustered table, as per discussion with #Catcall.
All tests were done on MS SQL Server 2008 R2 Express (inside a fairly underpowered VM).
Size
First, I crated a clustered table with a secondary index and filled it with some test data:
CREATE TABLE THE_TABLE (
FIELD1 int,
FIELD2 int NOT NULL,
CONSTRAINT THE_TABLE_PK PRIMARY KEY (FIELD1)
);
CREATE INDEX THE_TABLE_IE1 ON THE_TABLE (FIELD2) INCLUDE (FIELD1);
DECLARE #COUNT int = 1;
WHILE #COUNT <= 1000000 BEGIN
INSERT INTO THE_TABLE (FIELD1, FIELD2) VALUES (#COUNT, #COUNT);
SET #COUNT = #COUNT + 1;
END;
EXEC sp_spaceused 'THE_TABLE';
The last line gave me the following result...
name rows reserved data index_size unused
THE_TABLE 1000000 27856 KB 16808 KB 11008 KB 40 KB
So, the index's B-Tree (11008 KB) is actually smaller than the table's B-Tree (16808 KB).
Speed
I generated a random number within the range of the data in the table, and then used it as criteria for selecting a whole row from the table. This was repeated 10000 times and the total time measured:
DECLARE #I int = 1;
DECLARE #F1 int;
DECLARE #F2 int;
DECLARE #END_TIME DATETIME2;
DECLARE #START_TIME DATETIME2 = SYSDATETIME();
WHILE #I <= 10000 BEGIN
SELECT #F1 = FIELD1, #F2 = FIELD2
FROM THE_TABLE
WHERE FIELD1 = (SELECT CEILING(RAND() * 1000000));
SET #I = #I + 1;
END;
SET #END_TIME = SYSDATETIME();
SELECT DATEDIFF(millisecond, #START_TIME, #END_TIME);
The last line produces an average time (of 10 measurements) of 181.3 ms.
When I change the query condition to: WHERE FIELD2 = ..., so the secondary index is used, the average time is 195.2 ms.
Execution plans:
So the performance (of selecting on the PK versus on the covering secondary index) seems to be similar. For much larger amounts of data, I suspect the secondary index could possibly be slightly faster (since it seems more compact and therefore cache-friendly), but I didn't hit that yet in my testing.
String Measurements
Using varchar(50) as type for FIELD1 and FIELD2 and inserting strings that vary in length between 22 and 28 characters gave similar results.
The sizes were:
name rows reserved data index_size unused
THE_TABLE 1000000 208144 KB 112424 KB 95632 KB 88 KB
And the average timings were: 254.7 ms for searching on FIELD1 and 296.9 ms fir FIELD2.
Conclusion
If a clustered table has a covering secondary index, that index will have space and time characteristics similar to the table itself (possibly slightly slower, but not by much). If effect, you'll have two B-Trees that sort their data differently, but are otherwise very similar, achieving your goal of having a "second cluster".
It depends on your dbms. Not all of them implement clustered indexes. Those that do are liable to implement them in different ways. As far as I know, every platform that implements clustered indexes also provides ways to choose which columns are in the clustered index, although often the primary key is the default.
In SQL Server, you can create a nonclustered primary key and a separate clustered index like this.
create table test (
test_id integer primary key nonclustered,
another_column char(5) not null unique clustered
);
I think that the closest thing to this in Oracle is an index organized table. I could be wrong. It's not quite the same as creating a table with a clustered index in SQL Server.
You can't have multiple clustered indexes on a single table in SQL Server. A table's rows can only be stored in one order at a time. Actually, I suppose you could store rows in multiple, distinct orders, but you'd have to essentially duplicate all or part of the table for each order. (Although I didn't know it at the time I wrote this answer, DB2 UDB supports multiple clustered indexes, and it's quite an old feature. Its design and implementation is quite different from SQL Server.)
A primary key's job is to guarantee uniqueness. Although that job is often done by creating a unique index on the primary key column(s), strictly speaking uniqueness and indexing are two different things with two different aims. Uniqueness aims for data integrity; indexing aims for speed.
A primary key declaration isn't intended to give you any information about the order of rows on disk. In practice, it usually gives you some information about the order of index entries on disk. (Because primary keys are usually implemented using a unique index.)
If you SELECT rows from a table that has a clustered index, you still can't be assured that the rows will be returned to the user in the same order that they're stored on disk. Loosely speaking, the clustered index helps the query optimizer find rows faster, but it doesn't control the order in which those rows are returned to the user. The only way to guarantee the order in which rows are returned to the user is with an explicit ORDER BY clause. (This seems to be a fairly frequent point of confusion. A lot of people seem surprised when a bare SELECT on a clustered index doesn't return rows in the order they expect.)

Changing Datatype from int to bigint for tables containing billions of rows

I have couple tables with millions, and in some table billions, of rows, with one column as int now I am changing to bigint. I tried changing datatype using SSMS and it failed after a couple of hours as transaction log full.
Another approach I took is to create a new column and started updating value from old column to new column in batches, by setting ROWCOUNT property to 100000, it works but it very slow and it claims full server memory. With this approach, it may take a couple of days to complete, and it won't be acceptable in production.
What is the fast\best way to change datatype? The source column is not identity column and duplicate, and null is allowed. The table has an index on other columns, shall disabling index will speed up the process? Will adding Begin Tran and Commit help?
I ran a test for the ALTER COLUMN that shows the actual time required to make the change. The results show that the ALTER COLUMN is not instantaneous, and the time required grows linearly.
RecordCt Elapsed Mcs
----------- -----------
10000 184019
100000 1814181
1000000 18410841
My recommendation would be to batch it as you suggested. Create a new column, and pre-populate the column over time using a combination of ROWCOUNT and WAITFOR.
Code your script so that the WAITFOR value is read from a table. That way you can modify the WAITFOR value on-the-fly as your production server starts to bog down. You can shorten the WAITFOR during off-peak hours. (You can even use DMVs to make your WAITFOR value automatic, but this is certainly more complex.)
This is a complex update that will require planning and a lot of babysitting.
Rob
Here is the ALTER COLUMN test code.
USE tempdb;
SET NOCOUNT ON;
GO
IF EXISTS (SELECT * FROM sys.tables WHERE [object_id] = OBJECT_ID('dbo.TestTable'))
DROP TABLE dbo.TestTable;
GO
CREATE TABLE dbo.TestTable (
ColID int IDENTITY,
ColTest int NULL,
ColGuid uniqueidentifier DEFAULT NEWSEQUENTIALID()
);
GO
INSERT INTO dbo.TestTable DEFAULT VALUES;
GO 10000
UPDATE dbo.TestTable SET ColTest = ColID;
GO
DECLARE #t1 time(7) = SYSDATETIME();
DECLARE #t2 time(7);
ALTER TABLE dbo.TestTable ALTER COLUMN ColTest bigint NULL;
SET #t2 = SYSDATETIME();
SELECT
MAX(ColID) AS RecordCt,
DATEDIFF(mcs, #t1, #t2) AS [Elapsed Mcs]
FROM dbo.TestTable;
a simple alter table <table> alter column <column> bigint null should take basically no time. there won't be any conversion issues or null checks - i don't see why this wouldn't be relatively instant
if you do it through the GUI, it'll probably try to create a temp table, drop the existing table, and create a new one - definitely don't do that
In SQL Server 2016+, this alter table <table> alter column <column> bigint null statement will be a simple metadata change (instant) if the table is fully compressed.
More info here from #Paul White:
https://sqlperformance.com/2020/04/database-design/new-metadata-column-changes-sql-server-2016
Compression must be enabled:
On all indexes and partitions, including the base heap or clustered index.
Either ROW or PAGE compression.
Indexes and partitions may use a mixture of these compression levels. The important thing is there are no uncompressed indexes or partitions.
Changing from NULL to NOT NULL is not allowed.
The following integer type changes are supported:
smallint to integer or bigint.
integer to bigint.
smallmoney to money (uses integer representation internally).
The following string and binary type changes are supported:
char(n) to char(m) or varchar(m)
nchar(n) to nchar(m) or nvarchar(m)
binary(n) to binary(m) or varbinary(m)
All of the above only for n < m and m != max
Collation changes are not allowed

Resources