I have the following table
CREATE TABLE [dbo].[ActiveHistory]
(
[ID] [INT] IDENTITY(1,1) NOT NULL,
[Date] [VARCHAR](250) NOT NULL,
[ActiveID] [INT] NOT NULL,
[UserID] [INT] NOT NULL,
CONSTRAINT [PK_ActiveHistory]
PRIMARY KEY CLUSTERED ([ID] ASC)
)
About 600,000 rows are inserted into the table per day that means 300,000 distinct actives for one date with about 500 distinct users. I would like to have about 5 year history in one table that means more then bln rows, in overall about 4,000 distinct userid and 1,000,000 distinct actives are placed in 5 year table. it is very important for me to work faster with this table,
Most of the queries in the past used joins with date and userid but in last days I have to include activeid quite often, but sometimes just two of them could be used (any pairs).
I never use ID in join.
Now I have nonclustered index with userid and date as index key columns and ID and ActiveID as included columns, Now my question is - how to best arrange the index for this table considering new challenges, just add all options as index may use huge place and sometimes application that uses the same server is suffering as CPU usage goes to 99%, I am not sure how new indexes will effect on that.
Related
I'm looking for guidance on the best practice for adding indexes / primary key for the following table in SQL Server.
My goal is to maximize performance mostly on selecting data, but also in inserts.
IndicatorValue
(
[IndicatorId] [uniqueidentifier] NOT NULL, -- this is a foreign key
[UnixTime] [bigint] NOT null,
[Value] [decimal](15,4) NOT NULL,
[Interval] [int] NOT NULL
)
The table will have over 10 million rows. Data is batch inserted between 5-10 thousand rows at a time.
I frequently query the data and retrieve the same 5-10 thousand rows at a time with SQL similar to
SELECT [UnixTime]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
or
SELECT [UnixTime], [Value]
FROM [IndicatorValue]
WHERE [IndicatorId] = 'xxx GUID xxx'
AND [Interval] = 2
ORDER BY [UnixTime]
Based on my limited knowledge of SQL indexes, I think:
I should have a clustered index on IndicatorId and Interval. Because of the ORDER BY, should it also include UnixTime?
As I don't have an identity column (didn't create one because I wouldn't use it), I could have a non-clustered primary key on IndicatorId, UnixTime and Interval, because I read that it's always good to have PK on every table.
Also, the data is very rarely deleted, and there are not many updates, but when they happen it's only on 1 row.
Any insight on best practices would be much appreciated.
I have a table with a very large number of rows which I wish to execute via dynamic SQL. They are basically existence checks and insert statements and I want to migrate data from one production database to another - we are merging transactional data. I am trying to find the optimal way to execute the rows.
I've been finding the coalesce method for appending all the rows to one another to not be efficient for this particularly when the number of rows executed at a time is greater than ~100.
Assume the structure of the source table is something arbitrary like this:
CREATE TABLE [dbo].[MyTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[DataField1] [int] NOT NULL,
[FK_ID1] [int] NOT NULL,
[LotsMoreFields] [NVARCHAR] (MAX),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE TABLE [dbo].[FK1]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [int] NOT NULL, -- Unique constrained value
CONSTRAINT [PK_FK1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
The other requirement is I am tracking the source table PK vs the target PK and whether an insert occurred or whether I have already migrated that row to the target. To do this, I'm tracking migrated rows in another table like so:
CREATE TABLE [dbo].[ChangeTracking]
(
[ReferenceID] BIGINT IDENTITY(1,1),
[Src_ID] BIGINT,
[Dest_ID] BIGINT,
[TableName] NVARCHAR(255),
CONSTRAINT [PK_ChangeTracking] PRIMARY KEY CLUSTERED ([ReferenceID] ASC)
)
My existing method is executing some dynamic sql generated by a stored procedure. The stored proc does PK lookups as the source system has different PK values for table [dbo].[FK1].
E.g.
IF NOT EXISTS (<ignore this existence check for now>)
BEGIN
INSERT INTO [Dest].[dbo].[MyTable] ([DataField1],[FK_ID1],[LotsMoreFields]) VALUES (333,(SELECT [ID] FROM [Dest].[dbo].[FK1] WHERE [Name]=N'ValueFoundInSource'),N'LotsMoreValues');
INSERT INTO [Dest].[dbo].[ChangeTracking] ([Src_ID],[Dest_ID],[TableName]) VALUES (666,SCOPE_IDENTITY(),N'MyTable'); --666 is the PK in [Src].[dbo].[MyTable] for this inserted row
END
So when you have a million of these, it isn't quick.
Is there a recommended performant way of doing this?
As mentioned, the MERGE statement works well when you're looking at a complex JOIN condition (if any of these fields are different, update the record to match). You can also look into creating a HASHBYTES hash of the entire record to quickly find differences between source and target tables, though that can also be time-consuming on very large data sets.
It sounds like you're making these updates like a front-end developer, by checking each row for a match and then doing the insert. It will be far more efficient to do the inserts with a single query. Below is an example that looks for names that are in the tblNewClient table, but not in the tblClient table:
INSERT INTO tblClient
( [Name] ,
TypeID ,
ParentID
)
SELECT nc.[Name] ,
nc.TypeID ,
nc.ParentID
FROM tblNewClient nc
LEFT JOIN tblClient cl
ON nc.[Name] = cl.[Name]
WHERE cl.ID IS NULL;
This is will way more efficient than doing it RBAR (row by agonizing row).
Taking the two answers from #RusselFox and putting them together, I reached this tentative solution (but looking a LOT more efficient):
MERGE INTO [Dest].[dbo].[MyTable] [MT_D]
USING (SELECT [MT_S].[ID] as [SrcID],[MT_S].[DataField1],[FK_1_D].[ID] as [FK_ID1],[MT_S].[LotsMoreFields]
FROM [Src].[dbo].[MyTable] [MT_S]
JOIN [Src].[dbo].[FK_1] ON [MT_S].[FK_ID1] = [FK_1].[ID]
JOIN [Dest].[dbo].[FK_1] [FK_1_D] ON [FK_1].[Name] = [FK_1_D].[Name]
) [SRC] ON 1 = 0
WHEN NOT MATCHED THEN
INSERT([DataField1],[FL_ID1],[LotsMoreFields])
VALUES ([DataField1],[FL_ID1],[LotsMoreFields])
OUTPUT [SRC].[SrcID],INSERTED.[ID],0,N'MyTable' INTO [Dest].[dbo].[ChangeTracking]([Src_ID],[Dest_ID],[AlreadyExists],[TableName]);
I have the following table:
CREATE TABLE [dbo].[HousePrices](
[Id] [int] IDENTITY(1,1) NOT NULL,
[PropertyType] [int] NULL,
[Town] [nvarchar](500) NULL,
[County] [nvarchar](500) NULL,
[Outcode] [nvarchar](10) NULL,
[Price] [int] NULL
PRIMARY KEY CLUSTERED
(
[Id] ASC
)
Which currently holds around 20 million records, and I need to run queries to calculate the average price in a certain area. For example:
select avg(price)
from houseprices
where town = 'London'
and propertytype = 1
The WHERE clause could have any combination of Town, County or Outcode, and will probably always have PropertyType (which is one of four values). I've tried creating a non-clustered index on one of the fields, but that still took around 2 minutes to run.
Surely this should be able to run in under a second?
It depends.
If your WHERE clause only returns a small subset of the records, then create an index for each combination of search values, e.g. one multi-field index on PropertyType, Town, Country, Outcode, another on PropertyType, Country, Outcode, etc. You can skip indexes which are prefixes of existing indexes (i.e. if you have an index A, B, C, D, you don't need A, B, C; however, you do need A, C, D, if B can be omitted).
You can reduce the number of required indexes by reducing the number of combinations: For example, you could make Country mandatory when searching for Town -- which would make sense, since getting the average over Vienna (Austria) and Vienna (Virgina) would be quite useless.
If your WHERE clause returns a large set of records, your query will take a lot of time anyway, since all the selected records need to be fetched from the HDD or cache to calculate the average. In this case, you can increase performance by including the Price column in your indexes as an included column. This means that your query will only have to fetch the index rather than the actual rows.
In a SQL Server 2012 version 11.0.5058 I've a query like this
SELECT TOP 30
row_number() OVER (ORDER BY SequentialNumber ASC) AS [row_number],
o.Oid, StopAzioni
FROM
tmpTestPerf O
INNER JOIN
Stati s on O.Stato = s.Oid
WHERE
StopAzioni = 0
When I use ORDER BY SequentialNumber ASC it takes 400 ms
When I use ORDER BY DESC in the row_number function it takes only 2 ms
(This is in a test environment, in production it is 7000, 7 seconds vs 15 ms!)
Analyzing the execution plan, I found that it's the same for both queries. The interesting difference is that in the slower it works with all the rows filtered by the stopazioni = 0 condition, 117k rows
In the faster it only uses 53 rows
There are a primary key on the tmpTestPerf query and an indexed ASC key on the sequential number column.
How it could be explained?
Regards.
Daniele
This is the script of the tmpTestPerfQuery and Stati query with their indexes
CREATE TABLE [dbo].[tmpTestPerf]
(
[Oid] [uniqueidentifier] NOT NULL,
[SequentialNumber] [bigint] NOT NULL,
[Anagrafica] [uniqueidentifier] NULL,
[Stato] [uniqueidentifier] NULL,
CONSTRAINT [PK_tmpTestPerf]
PRIMARY KEY CLUSTERED ([Oid] ASC)
)
CREATE NONCLUSTERED INDEX [IX_2]
ON [dbo].[tmpTestPerf]([SequentialNumber] ASC)
CREATE TABLE [dbo].[Stati]
(
[Oid] [uniqueidentifier] ROWGUIDCOL NOT NULL,
[Descrizione] [nvarchar](100) NULL,
[StopAzioni] [bit] NOT NULL
CONSTRAINT [PK_Stati]
PRIMARY KEY CLUSTERED ([Oid] ASC)
) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [iStopAzioni_Stati]
ON [dbo].[Stati]([StopAzioni] ASC)
GO
The query plans are not exactly the same.
Select the Index Scan operator.
Press F4 to view the properties and have a look at Scan Direction.
When you order ascending the Scan Direction is FORWARD and when you order descending it is BACKWARD.
The difference in number of rows is there because it takes only 53 rows to find 30 rows when scanning backwards and it takes 117k rows to find 30 matching rows scanning forwards in the index.
Note, without an order by clause on the main query there is no guarantee on what 30 rows you will get from your query. In this case it just happens to be the first thirty or the last thirty depending on the order by used in row_number().
We have a table that is going to be say 100 million to a billion rows (Table name: Archive)
This table will be referenced from another table, Users.
We have 2 options for the primary key on the Archive table:
option 1: dataID (bigint)
option 2: userID + datetime (4 byte version).
Schema:
Users
- userID (int)
Archive
- userID
- datetime
OR
Archive
- dataID (big int)
Which one would be faster?
We are shying away from using Option#1 because bigint is 8 bytes and with 100 million rows that will add up to allot of storage.
Update
Ok sorry I forgot to mention, userID and datetime have to be regardless, so that was the reason for not adding another column, dataID, to the table.
Some thoughts, but there is probably not a clear cut solution:
If you have a billion rows, why not use int which goes from -2.1 billion to +2.1 billion?
Userid, int, 4 bytes + smalldatetime, 4 bytes = 8 bytes, same as bigint
If you are thinking of userid + smalldatetime then surely this is useful anyway.
If so, adding a surrogate "archiveID" column will increase space anyway
Do you require filtering/sorting by userid + smalldatetime?
Make sure your model is correct, worry about JOINs later...
Concern: Using UserID/[small]datetime carries with it a high risk of not being unique.
Here is some real schema. Is this what you're talking about?
-- Users (regardless of Archive choice)
CREATE TABLE dbo.Users (
userID int NOT NULL IDENTITY,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID)
)
-- Archive option 1
CREATE TABLE dbo.Archive (
dataID bigint NOT NULL IDENTITY,
userID int NOT NULL,
[datetime] smalldatetime NOT NULL,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (dataID)
)
-- Archive option 2
CREATE TABLE dbo.Archive (
userID int NOT NULL,
[datetime] smalldatetime NOT NULL,
<other columns>
CONSTRAINT <name> PRIMARY KEY CLUSTERED (userID, [datetime] DESC)
)
CREATE NONCLUSTERED INDEX <name> ON dbo.Archive (
userID,
[datetime] DESC
)
If this were my decision, I would definitely got with option 1. Disk is cheap.
If you go with Option 2, it's likely that you will have to add some other column to your PK to make it unique, then your design starts degrading.
What's with option 3: Making dataID a 4 byte int?
Also, if I understand it right, the archive table will be referenced from the users table, so it wouldn't even make much sense to have the userID in the archive table.
I recommend that you setup a simulation to validate this in your environment, but my guess would be that the single bigint would be faster in general; however when you query the table what are you going to be querying on?
If I was building an arhive, I might lean to having an autoincrement identity field, and then using a partioning scheme to partion based on DateTime and perhaps userid but that would depend on the circumstance.