I can I make this query run faster - sql-server

I have a table that can be simplified to the below:
Create Table Data (
DATAID bigint identity(1,1) NOT NULL,
VALUE1 varchar(200) NOT NULL,
VALUE2 varchar(200) NOT NULL,
CONSTRAINT PK_DATA PRIMARY KEY CLUSTERED (DATAID ASC)
)
Among others, this index exists:
CREATE NONCLUSTERED INDEX VALUEIDX ON dbo.DATA
(VALUE1 ASC) INCLUDE (VALUE2)
The table has about 9 million rows with mostly sparse data in VALUE1 and VALUE2.
The query Select Count(*) from DATA takes about 30 seconds. And the following query takes 1 minute and 30 seconds:
Select Count(*) from DATA Where VALUE1<>VALUE2
Is there any way I can make this faster? I basically need to find (and update) all rows where VALUE1 is different from VALUE2. I considered adding a bit field called ISDIFF and update that via a Trigger whenever the value fields are updated is updated. But then I need to create an index on the bit field and select WHERE ISDIFF=1.
Any help will be appreciated.
PS: Using MS SQL Server 2008

Related

Index and primary key in large table that doesn't have an Id column

I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.

Partition SQL Server tables based on column not in the primary key?

Let's say I have a table like this:
create table test_partitions
(
pk_id int not null,
col1 nvarchar(20),
col2 nvarchar(100),
constraint pk_test_partitions primary key (pk_id, col1)
);
I want to partition this table to improve query performance so that I don't have to look through the whole table every time I need something. So I added a calculated column:
create table test_partitions
(
pk_id int not null,
partition_id as pk_id % 10 persisted not null,
col1 nvarchar(20),
col2 nvarchar(100),
constraint pk_test_partitions primary key (pk_id, col1)
);
So ever time I do select * from test_partitions where pk_id = 123 I want SQL Server to look in only 1/10th of the entire table. I don't want to add partition_id column to the primary key because it will never be part of the where clause. How do I partition my table on partition_id?
Just right now i did test and found a solution
SELECT TOP ((SELECT COUNT(*) [TABLE_NAME])/10)
* FROM [TABLE_NAME]
it returns 1/10th of your records
To improve your query performance you can apply pagination. That is you can use fetch next query to achieve your output.
Please go through this for more details.
SELECT column-names FROM table-name
ORDER BY column-names
OFFSET n ROWS
FETCH NEXT m ROWS ONLY

How to improve the performance of a query on SQL Server with a table with 150 million records?

How can I improve my select query in a table with more 150 million records in SQL Server? I need to run a simple select and retrieve the result in the minimum time as possible. Should I create some index? Table partition? What do you guys recommend for that?
Here is my current scenario:
Table:
CREATE TABLE [dbo].[table_name]
(
[id] [BIGINT] IDENTITY NOT NULL,
[key] [VARCHAR](20) NOT NULL,
[text_value] [TEXT] NOT NULL,
CONSTRAINT [PK_table_name]
PRIMARY KEY CLUSTERED ([id] ASC)
)
GO
Select:
SELECT TOP 1
text_value
FROM
table_name (NOLOCK)
WHERE
key = #Key
Additional info:
That table won't have updates or deletes
The column text_value has a Json that will be retrieved on the Select and an application will handle this info
No other queries will run on that table, just the query above to retrieve always the text_value based on key column
Every 2 or 3 months about 15 millions are added to the table
For that query:
SELECT top 1 text_value FROM table_name (NOLOCK) where key = #Key
I would add the following index:
CREATE INDEX idx ON table_name (key)
INCLUDE (text_value);
The lookup will always be on the key column so that will form the index structure, and you want to include the text_value but not have it in the non-leaf pages. This should always result in an index seek without a key lookup (a covering index).
Also, do not use the TEXT data type as it will be removed in a future version, use VARCHAR(MAX) instead. Ref: https://learn.microsoft.com/en-us/sql/t-sql/data-types/ntext-text-and-image-transact-sql?view=sql-server-2017

Dynamic SQL to execute large number of rows from a table

I have a table with a very large number of rows which I wish to execute via dynamic SQL. They are basically existence checks and insert statements and I want to migrate data from one production database to another - we are merging transactional data. I am trying to find the optimal way to execute the rows.
I've been finding the coalesce method for appending all the rows to one another to not be efficient for this particularly when the number of rows executed at a time is greater than ~100.
Assume the structure of the source table is something arbitrary like this:
CREATE TABLE [dbo].[MyTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[DataField1] [int] NOT NULL,
[FK_ID1] [int] NOT NULL,
[LotsMoreFields] [NVARCHAR] (MAX),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE TABLE [dbo].[FK1]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [int] NOT NULL, -- Unique constrained value
CONSTRAINT [PK_FK1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
The other requirement is I am tracking the source table PK vs the target PK and whether an insert occurred or whether I have already migrated that row to the target. To do this, I'm tracking migrated rows in another table like so:
CREATE TABLE [dbo].[ChangeTracking]
(
[ReferenceID] BIGINT IDENTITY(1,1),
[Src_ID] BIGINT,
[Dest_ID] BIGINT,
[TableName] NVARCHAR(255),
CONSTRAINT [PK_ChangeTracking] PRIMARY KEY CLUSTERED ([ReferenceID] ASC)
)
My existing method is executing some dynamic sql generated by a stored procedure. The stored proc does PK lookups as the source system has different PK values for table [dbo].[FK1].
E.g.
IF NOT EXISTS (<ignore this existence check for now>)
BEGIN
INSERT INTO [Dest].[dbo].[MyTable] ([DataField1],[FK_ID1],[LotsMoreFields]) VALUES (333,(SELECT [ID] FROM [Dest].[dbo].[FK1] WHERE [Name]=N'ValueFoundInSource'),N'LotsMoreValues');
INSERT INTO [Dest].[dbo].[ChangeTracking] ([Src_ID],[Dest_ID],[TableName]) VALUES (666,SCOPE_IDENTITY(),N'MyTable'); --666 is the PK in [Src].[dbo].[MyTable] for this inserted row
END
So when you have a million of these, it isn't quick.
Is there a recommended performant way of doing this?
As mentioned, the MERGE statement works well when you're looking at a complex JOIN condition (if any of these fields are different, update the record to match). You can also look into creating a HASHBYTES hash of the entire record to quickly find differences between source and target tables, though that can also be time-consuming on very large data sets.
It sounds like you're making these updates like a front-end developer, by checking each row for a match and then doing the insert. It will be far more efficient to do the inserts with a single query. Below is an example that looks for names that are in the tblNewClient table, but not in the tblClient table:
INSERT INTO tblClient
( [Name] ,
TypeID ,
ParentID
)
SELECT nc.[Name] ,
nc.TypeID ,
nc.ParentID
FROM tblNewClient nc
LEFT JOIN tblClient cl
ON nc.[Name] = cl.[Name]
WHERE cl.ID IS NULL;
This is will way more efficient than doing it RBAR (row by agonizing row).
Taking the two answers from #RusselFox and putting them together, I reached this tentative solution (but looking a LOT more efficient):
MERGE INTO [Dest].[dbo].[MyTable] [MT_D]
USING (SELECT [MT_S].[ID] as [SrcID],[MT_S].[DataField1],[FK_1_D].[ID] as [FK_ID1],[MT_S].[LotsMoreFields]
FROM [Src].[dbo].[MyTable] [MT_S]
JOIN [Src].[dbo].[FK_1] ON [MT_S].[FK_ID1] = [FK_1].[ID]
JOIN [Dest].[dbo].[FK_1] [FK_1_D] ON [FK_1].[Name] = [FK_1_D].[Name]
) [SRC] ON 1 = 0
WHEN NOT MATCHED THEN
INSERT([DataField1],[FL_ID1],[LotsMoreFields])
VALUES ([DataField1],[FL_ID1],[LotsMoreFields])
OUTPUT [SRC].[SrcID],INSERTED.[ID],0,N'MyTable' INTO [Dest].[dbo].[ChangeTracking]([Src_ID],[Dest_ID],[AlreadyExists],[TableName]);

JOIN Performance: Composite key versus BigInt Primary Key

We have a table that is going to be say 100 million to a billion rows (Table name: Archive)
This table will be referenced from another table, Users.
We have 2 options for the primary key on the Archive table:
option 1: dataID (bigint)
option 2: userID + datetime (4 byte version).
Schema:
Users
- userID (int)
Archive
- userID
- datetime
OR
Archive
- dataID (big int)
Which one would be faster?
We are shying away from using Option#1 because bigint is 8 bytes and with 100 million rows that will add up to allot of storage.
Update
Ok sorry I forgot to mention, userID and datetime have to be regardless, so that was the reason for not adding another column, dataID, to the table.
Some thoughts, but there is probably not a clear cut solution:
If you have a billion rows, why not use int which goes from -2.1 billion to +2.1 billion?
Userid, int, 4 bytes + smalldatetime, 4 bytes = 8 bytes, same as bigint
If you are thinking of userid + smalldatetime then surely this is useful anyway.
If so, adding a surrogate "archiveID" column will increase space anyway
Do you require filtering/sorting by userid + smalldatetime?
Make sure your model is correct, worry about JOINs later...
Concern: Using UserID/[small]datetime carries with it a high risk of not being unique.
Here is some real schema. Is this what you're talking about?
-- Users (regardless of Archive choice)
CREATE TABLE dbo.Users (
userID int NOT NULL IDENTITY,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID)
)
-- Archive option 1
CREATE TABLE dbo.Archive (
dataID bigint NOT NULL IDENTITY,
userID int NOT NULL,
[datetime] smalldatetime NOT NULL,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (dataID)
)
-- Archive option 2
CREATE TABLE dbo.Archive (
userID int NOT NULL,
[datetime] smalldatetime NOT NULL,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID, [datetime] DESC)
)
CREATE NONCLUSTERED INDEX <name> ON dbo.Archive (
userID,
[datetime] DESC
)
If this were my decision, I would definitely got with option 1. Disk is cheap.
If you go with Option 2, it's likely that you will have to add some other column to your PK to make it unique, then your design starts degrading.
What's with option 3: Making dataID a 4 byte int?
Also, if I understand it right, the archive table will be referenced from the users table, so it wouldn't even make much sense to have the userID in the archive table.
I recommend that you setup a simulation to validate this in your environment, but my guess would be that the single bigint would be faster in general; however when you query the table what are you going to be querying on?
If I was building an arhive, I might lean to having an autoincrement identity field, and then using a partioning scheme to partion based on DateTime and perhaps userid but that would depend on the circumstance.

Resources