Azure SQL Database - Indexing 10+ millions rows - sql-server

I have database hosted on Azure SQL Database and below is the schema for a single table:
CREATE TABLE [dbo].[Article](
[ArticleHash] [bigint] NOT NULL,
[FeedHash] [bigint] NOT NULL,
[PublishedOn] [datetime] NOT NULL,
[ExpiresOn] [datetime] NOT NULL,
[DateCreated] [datetime] NOT NULL,
[Url] [nvarchar](max) NULL,
[Title] [nvarchar](max) NULL,
[Summary] [nvarchar](max) NULL
CONSTRAINT [PK_dbo.Article] PRIMARY KEY CLUSTERED
(
[ArticleHash] ASC,
[FeedHash] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
I have a few queries which I'm executing that are really slow since this table contains over 10 million records:
SELECT *
FROM (SELECT ROW_NUMBER() OVER (ORDER BY PublishedOn DESC) page_rn, *
FROM Article
WHERE (FeedHash = -8498408432858355421 AND ExpiresOn > '2016-01-18 14:18:04.970')
) paged
WHERE page_rn>0 AND page_rn<=21
And one more:
SELECT ArticleHash
FROM Article
WHERE (FeedHash = -8498408432858355421
AND ArticleHash IN (-1776401574438488264,996871668263687248,-5186412434178204433,6410875610077852481,-5428137965544411137,-5326808411357670185,2738089298373692963,9180394103094543689,8120572317154347382,-369910952783360989,1071631911959711259,1187953785740614613,6665010324256449533,3720795027036815325,-5458296665864077096,-5832860214011872788,-2941009192514997875,334202794706549486,-5579819992060984166,-696086851747657853,-7466754676679718482,-1461835507954240474,9021713212273098604,-6337379666850984216,5502287921912059432)
AND ExpiresOn >= '2016-01-18 14:28:25.883')
What is the best way to index this table so that queries execute below 300 ms? Is it even possible on such big table? The Azure SQL Database edition is S3.
Also, a lot of DELETE/INSERT actions are performed on this table so any indexes should not affect performance of these...

First query would benefit from native pagination with OFFSET and FETCH:
SELECT *
FROM Article
WHERE FeedHash = -8498408432858355421 AND ExpiresOn > '2016-01-18 14:18:04.970'
ORDER BY PublishedOn DESC
OFFSET 0 FETCH NEXT 20 ROWS ONLY
The second query might benefit from substituting IN list with INNER JOIN of a table:
DECLARE #ArticleHashList AS TABLE (ArticleHashWanted bigint PRIMARY KEY);
INSERT INTO #ArticleHashList (ArticleHashWanted) VALUES
(-1776401574438488264),
( 996871668263687248),
(-5186412434178204433),
( 6410875610077852481),
(-5428137965544411137),
(-5326808411357670185),
( 2738089298373692963),
( 9180394103094543689),
( 8120572317154347382),
( -369910952783360989),
( 1071631911959711259),
( 1187953785740614613),
( 6665010324256449533),
( 3720795027036815325),
(-5458296665864077096),
(-5832860214011872788),
(-2941009192514997875),
( 334202794706549486),
(-5579819992060984166),
( -696086851747657853),
(-7466754676679718482),
(-1461835507954240474),
( 9021713212273098604),
(-6337379666850984216),
( 5502287921912059432);
SELECT ArticleHash
FROM Article
INNER JOIN #ArticleHashList On ArticleHash = ArticleHashWanted
WHERE FeedHash = -8498408432858355421 AND ExpiresOn >= '2016-01-18 14:28:25.883';
Creating indexes on dates should help a lot:
CREATE INDEX idx_Article_PublishedOn ON Article (PublishedOn);
CREATE INDEX idx_Article_ExpiresOn ON Article (ExpiresOn);

for first query I recomend this index:
create index ix_Article_FeedHash_ExpiresOn_withInclude on Article(FeedHash,ExpiresOn) include ( DateCreated, PublishedOn, Url, Title, Summary)
and second query shoud use clustered index seek, you must look at Actul Execution Plan what happends. Also I think you have bad clustered index because valuse looks like not growing but has to be random and probably index is very fragmented, you could check it with query
select * from sys.dm_db_index_physical_stats(db_id(), object_id('Article'), null, null, 'DETAILED');
if avg_fragmentation_in_percent is between 5 and 30 then you can fix it by
alter index [clustered index name] on Article reorganize;
if avg_fragmentation_in_percent is higher then 30 then you can fix it by
alter index [clustered index name] on Article rebuild;
(if after reorganize nothing changes then you could try rebuild)

Related

Warnings when using graph queries in Visual Studio Database Project

We have some SQL "normal" and graph tables, a script that syncs the information between them.
Everything is working ok from SSMS but when building the database project in Visual Studio using msbuild we get warnings (see code and warnings details below).
If we set TreatTSqlWarningsAsErrors to True these warnings become errors.
We don't want to ignore warnings but its unclear why we even get them.
Are these warnings correct?
Why are they showing in Visual Studio only and not in SSMS?
How can we resolve them without ignoring them?
The details:
We have the following two "normal" SQL tables:
CREATE TABLE [dbo].[Currency]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Name] [nvarchar](250) NOT NULL,
[UId] [uniqueidentifier] NOT NULL,
CONSTRAINT [PK_Currency]
PRIMARY KEY CLUSTERED ([Id] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Currency]
ADD CONSTRAINT [DF_Currency_UId] DEFAULT (NEWID()) FOR [UId]
GO
and
CREATE TABLE [dbo].[Portfolio]
(
[Id] [INT] IDENTITY(1,1) NOT NULL,
[Name] [NVARCHAR](250) NULL,
[CurrencyId] [INT] NOT NULL,
[UId] [UNIQUEIDENTIFIER] NOT NULL,
CONSTRAINT [PK_Portfolio]
PRIMARY KEY CLUSTERED ([Id] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 90, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[Portfolio]
ADD CONSTRAINT [DF_Portfolio_UId] DEFAULT (NEWID()) FOR [UId]
GO
ALTER TABLE [dbo].[Portfolio] WITH CHECK
ADD CONSTRAINT [FK_Portfolio_Currency]
FOREIGN KEY([CurrencyId]) REFERENCES [dbo].[Currency] ([Id])
GO
ALTER TABLE [dbo].[Portfolio] CHECK CONSTRAINT [FK_Portfolio_Currency]
GO
and we created the following SQL graph schema together with two SQL node tables and one SQL edge:
CREATE SCHEMA [graph] ;
CREATE TABLE [graph].[Currency]
(
[Id] UNIQUEIDENTIFIER NOT NULL,
PRIMARY KEY CLUSTERED ([Id] ASC),
INDEX [GRAPH_UNIQUE_INDEX_4FFC60C0FCBE4843A7F4B9AB0729FF78] UNIQUE NONCLUSTERED ($node_id)
) AS NODE;
CREATE TABLE [graph].[Portfolio]
(
[Id] UNIQUEIDENTIFIER NOT NULL,
PRIMARY KEY CLUSTERED ([Id] ASC),
INDEX [GRAPH_UNIQUE_INDEX_F39F92BE8DD34DD791D6CA955DA0DA9A] UNIQUE NONCLUSTERED ($node_id)
) AS NODE;
CREATE TABLE [graph].[isOf]
(
[IsActive] BIT CONSTRAINT [DF_isOf_IsActive] DEFAULT ((1)) NOT NULL,
INDEX [GRAPH_UNIQUE_INDEX_AD8C5B40D277413580EDD943AC192869] UNIQUE NONCLUSTERED ($edge_id)
) AS EDGE;
CREATE UNIQUE NONCLUSTERED INDEX [UQ_FromTo]
ON [graph].[isOf] ($from_id, $to_id) ON [PRIMARY];
GO
A sync stored procedure which will populate the nodes and the edge from the "normal" tables that will be executed daily.
Attached parts of it:
PRINT ( 'Sync graph.[Portfolio]' );
INSERT INTO graph.Portfolio ( Id )
SELECT P.[UId]
FROM dbo.Portfolio P
WHERE NOT EXISTS ( SELECT 1
FROM graph.Portfolio G
WHERE G.Id = P.[UId] );
PRINT ( 'Sync graph.[Currency]' );
INSERT INTO graph.Currency ( Id )
SELECT C.[UId]
FROM dbo.Currency C
WHERE NOT EXISTS ( SELECT 1
FROM graph.Currency G
WHERE G.Id = C.[UId] );
PRINT('Sync Portfolio isOf Currency edge');
;
WITH UidCTE
AS ( SELECT P.[UId] AS PortfolioUid ,
C.[UId] AS CurrencyUid
FROM dbo.Portfolio P
JOIN dbo.Currency C ON C.Id = P.CurrencyId )
MERGE graph.isOf AS TGT
USING graph.[Portfolio] AS SourceFrom
JOIN UidCTE CTE ON SourceFrom.Id = CTE.PortfolioUid
JOIN graph.[Currency] AS SourceTo ON CTE.CurrencyUid = SourceTo.Id
ON MATCH(SourceFrom-(TGT)->SourceTo)
WHEN NOT MATCHED BY TARGET THEN
INSERT ( $from_id , $to_id )
VALUES ( SourceFrom.$node_id, SourceTo.$node_id)
WHEN MATCHED AND TGT.[IsActive] = 0 THEN
UPDATE SET TGT.[IsActive] = 1;
and the following script which updates the edge's IsActive column in case the relationship doesn't exist anymore in the "normal" tables(we don't want to delete it from the graph nodes/edges)
;WITH PortfolioIsOfCurrency
AS ( SELECT P.[UId] AS PortfolioUid ,
C.[UId] AS CurrencyUid
FROM dbo.Portfolio P
JOIN dbo.Currency C ON C.Id = P.CurrencyId ) ,
NodeCTE
AS ( SELECT P.$node_id AS FromNode ,
C.$node_id AS ToNode
FROM PortfolioIsOfCurrency CTE
JOIN graph.Portfolio P ON P.Id = CTE.PortfolioUid
JOIN graph.Currency C ON C.Id = CTE.CurrencyUid )
UPDATE ISOF
SET isOf.IsActive = 0
FROM graph.isOf ISOF
WHERE NOT EXISTS ( SELECT 1
FROM NodeCTE N
WHERE N.FromNode = ISOF.$from_id
AND N.ToNode = ISOF.$to_id );
We created a database project in VS 2019 with the following settings:
<DSP>Microsoft.Data.Tools.Schema.Sql.Sql150DatabaseSchemaProvider</DSP>
<TargetFrameworkVersion>v4.5</TargetFrameworkVersion>
<TreatWarningsAsErrors>False</TreatWarningsAsErrors>
<TreatTSqlWarningsAsErrors>False</TreatTSqlWarningsAsErrors>
From SSMS everything works perfect.
The problem is with the last syntax(UPDATE isof.active) I get the following warnings:
Warning SQL71502: Procedure: [graph].[SyncEngagementContentGraphs] has an unresolved reference to object [graph].[isOf].[N]. graph\Stored Procedures\SyncEngagementContentGraphs.sql 57
Warning SQL71502: Procedure: [graph].[SyncEngagementContentGraphs] has an unresolved reference to object [graph].[isOf].[N]. graph\Stored Procedures\SyncEngagementContentGraphs.sql 58
Warning SQL71509: The model already has an element that has the same name NodeCTE.$node_id. graph\Stored Procedures\SyncEngagementContentGraphs.sql 47
Warning SQL71509: The model already has an element that has the same name NodeCTE.$node_id. graph\Stored Procedures\SyncEngagementContentGraphs.sql 48
and if we set TreatTSqlWarningsAsErrors to True these warnings become errors and we cannot leave it like this (set to False) in the long run.
The solution is to define correctly the NodeCTE common table expression with columns names.
This seems to be mandatory in the case of a CTE with graph tables.
The correct query whithout warnings would be:
;WITH PortfolioIsOfCurrency
AS ( SELECT P.[UId] AS PortfolioUid ,
C.[UId] AS CurrencyUid
FROM dbo.Portfolio P
JOIN dbo.Currency C ON C.Id = P.CurrencyId ) ,
**NodeCTE(FromNode, ToNode)**
AS ( SELECT P.$node_id AS FromNode ,
C.$node_id AS ToNode
FROM PortfolioIsOfCurrency CTE
JOIN graph.Portfolio P ON P.Id = CTE.PortfolioUid
JOIN graph.Currency C ON C.Id = CTE.CurrencyUid )
UPDATE ISOF
SET isOf.IsActive = 0
FROM graph.isOf ISOF
WHERE NOT EXISTS ( SELECT 1
FROM NodeCTE N
WHERE N.FromNode = ISOF.$from_id
AND N.ToNode = ISOF.$to_id );
This change is needed just for the VS SqlProject warnings. Both versions work with no problems with SSMS/ADS.

Strange query plan on max(date) query on a View

I have a view which comprises 4 yearly tables:
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE VIEW [dbo].[BGT_BETWAYDETAILS]
WITH SCHEMABINDING
AS
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_ResultID] [bigint] NOT NULL,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2020]
UNION ALL
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2019]
UNION ALL
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2018]
UNION ALL
SELECT [bwd_BetTicketNr] ,
[bwd_LineID] [int] ,
[bwd_DateModified] ,
[bwd_DateModifiedTrunc] ,
[bwd_LineMaxPayout]
FROM [dbo].[BGT_BETWAYDETAILS_2017];
GO
Each table has the following structure:
CREATE TABLE [dbo].[BGT_BETWAYDETAILS_2020]
(
[bwd_BetTicketNr] [bigint] NOT NULL,
[bwd_LineID] [int] NOT NULL,
[bwd_ResultID] [bigint] NOT NULL,
[bwd_DateModified] [datetime] NULL,
[bwd_DateModifiedTrunc] [date] NULL,
[bwd_LineMaxPayout] [decimal](18, 4) NULL,
CONSTRAINT [CSTR__BGT_BETWAYDETAILS_2020_CKEY]
PRIMARY KEY CLUSTERED ([bwd_BetTicketNr] ASC, [bwd_LineID] ASC, [bwd_ResultID] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
I have added an non-clustered index on
CREATE NONCLUSTERED INDEX [NCI__DATEMODIFIED]
ON [dbo].[BGT_BETWAYDETAILS_2020] ([bwd_DateModifiedTrunc] ASC)
WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF,
SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF,
ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
I am running the following 3 queries:
SELECT COALESCE(MAX([bwd_DateModifiedTrunc]), '2019-01-01') AS next_date
FROM [dbo].[BGT_BETWAYDETAILS_2020]
SELECT COALESCE(MAX([bwd_DateModifiedTrunc]), '2019-01-01') AS next_date
FROM [dbo].[BGT_BETWAYDETAILS]
SELECT COALESCE(CAST(MAX([bwd_DateModified]) AS date), '2019-01-01') AS next_date
FROM [dbo].[BGT_BETWAYDETAILS]
The first one, when run on each yearly table, runs instantly.
The second one, seems to take forever. The query plan for this, seems very strange.
The plan shows two index scans on each yearly table.
The plan for each yearly table is what I expected to see:
Finally, the plan on the non-indexed date column is also what I expected to see (a clustered index scan). A clustered index scan on each table. This query runs in ~3mins which is expected.
What is the issue here? Some anti-pattern I am missing? Why the index scan on the non-clustered index is done 2 times according to the live plan? I expected the view to respond as fast as the individual tables.
For the record, I am running this on SQL Server 2017.
This just looks like an optimiser limitation. I have submitted a suggestion that this should be improved.
A simpler example is
CREATE TABLE T1(X INT NULL UNIQUE CLUSTERED);
CREATE TABLE T2(X INT NULL UNIQUE CLUSTERED);
INSERT INTO T1
OUTPUT INSERTED.X INTO T2
SELECT TOP 100000 NULLIF(ROW_NUMBER() OVER (ORDER BY 1/0),1)
FROM sys.all_objects o1,
sys.all_objects o2;
And then
WITH CTE AS
(
SELECT X FROM T1
UNION ALL
SELECT X FROM T2
)
SELECT MAX(X)
FROM CTE
OPTION (QUERYRULEOFF ScalarGbAggToTop)
This disables the query optimizer rule ScalarGbAggToTop and the query plan does a MAX on each individual table then computes a MAX of the MAX-es - so the same as
SELECT MAX(MaxX)
FROM
(
SELECT MAX(X) AS MaxX FROM T1
UNION ALL
SELECT MAX(X) AS MaxX FROM T1
) T
With the ScalarGbAggToTop rule enabled the plan now looks like this
It is effectively doing the following...
SELECT MAX(MaxX)
FROM (SELECT MAX(X) AS MaxX
FROM (SELECT TOP 1 X
FROM T1
WHERE X IS NULL
UNION ALL
SELECT TOP 1 X
FROM T1
WHERE X IS NOT NULL
ORDER BY X DESC) T1
UNION ALL
SELECT MAX(X) AS MaxX
FROM (SELECT TOP 1 X
FROM T2
WHERE X IS NULL
UNION ALL
SELECT TOP 1 X
FROM T2
WHERE X IS NOT NULL
ORDER BY X DESC) T2) T0
... but in a very inefficient way. Running the SQL above would give a plan with seeks and each branch only reading a single row.
The plan produced by ScalarGbAggToTop only has minimal changes to the stream aggregate plan. It looks like it takes the scan from that and applies a backwards ordering to it and then uses the backwards ordering for both the NOT NULL and NULL branches. And does not perform any additional exploration to see if there is a more efficient access path.
This means that in the pathological case that all of the rows are either NULL or NOT NULL one of the scans will end up reading all of the rows in the table (5 billion in your case if applicable to all 4 tables). Even if there is a mix of NULL and NOT NULL the fact that the IS NULL branch is doing a backwards scan is sub optimal because NULL is ordered first in SQL Server so would be at the beginning of the index.
The addition of a NOT NULL branch in the first place seems largely unnecessary as the query would return the same results without it. I imagine it is only needed so that it knows whether or not to display the message
Warning: Null value is eliminated by an aggregate or other SET
operation.
but I doubt you care about that. In which case adding an explicit WHERE ... NOT NULL resolves the issue.
WITH CTE AS
(
SELECT X FROM T1
UNION ALL
SELECT X FROM T2
)
SELECT MAX(X)
FROM CTE
WHERE X IS NOT NULL
;
It now has a seek into the NOT NULL part of the index and reads backwards (stopping after the first row is read from each table)

SQL Server parameterized query does not use Non Clustered Filtered

I defined a non clustered index with Include and Filter on Students table. The SQL Server version is 2017.
Students table definition:
CREATE TABLE [dbo].[Students]
(
[Id] [INT] IDENTITY(1,1) NOT NULL,
[Name] [NVARCHAR](50) NOT NULL,
[CreatedOn] [DATETIME2](7) NOT NULL,
[Active] [BIT] NOT NULL,
[Deleted] [BIT] NOT NULL,
CONSTRAINT [PK_Students]
PRIMARY KEY CLUSTERED ([Id] ASC)
) ON [PRIMARY]
Non-clustered index with include and filter:
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20200508-225254]
ON [dbo].[Students] ([CreatedOn] ASC)
INCLUDE([Name])
WHERE ([Active] = (1) AND [Deleted] = (0))
ON [PRIMARY]
GO
This query uses NonClusteredIndex-20200508-225254
SELECT Name, CreatedOn FROM dbo.Students
WHERE Active = 1
AND Deleted = 0
ORDER BY CreatedOn
Actual execution plan
But when I use the parameterized query as following, it doesn't use the NonClusteredIndex-20200508-225254. Why does this happen? Where am I wrong?
DECLARE #Active BIT = 1
DECLARE #Deleted BIT = 0
SELECT Name, CreatedOn
FROM dbo.Students
WHERE Active = #Active
AND Deleted = #Deleted
ORDER BY CreatedOn
Actual execution plan
This is entirely expected.
When you compile a plan with parameters or variables it needs to produce a plan that will work for any possible value they may have.
You can add OPTION (RECOMPILE) to the statement so the runtime values of these are taken into account (basically they are replaced by literals with the runtime value) but that will mean a recompile every execution.
You are probably best having two separate queries, one for the case handled by the filtered index and one for the other cases.
You might have been hoping that SQL Server would do something like the below and dynamically switch between the clustered index scan + sort vs filtered index scan and no sort (the filters have startup predicates so only at most one branch is executed)
But getting this plan required a change to the filtered index to move Name into the key columns as below...
CREATE NONCLUSTERED INDEX [NonClusteredIndex-20200508-225254]
ON [dbo].[Students] ([CreatedOn] ASC, [Name] asc)
WHERE ([Active] = (1) AND [Deleted] = (0))
... and rewriting the query to
DECLARE #Active BIT = 1
DECLARE #Deleted BIT = 0
SELECT NAME,
CreatedOn
FROM dbo.Students WITH (INDEX =[NonClusteredIndex-20200508-225254])
WHERE Active = 1
AND Deleted = 0
AND 1 = 1 /*Prevent auto parameterisation*/
AND ( #Active = 1 AND #Deleted = 0 )
UNION ALL
SELECT NAME,
CreatedOn
FROM dbo.Students
WHERE Active = #Active
AND Deleted = #Deleted
AND NOT ( #Active = 1
AND #Deleted = 0 )
ORDER BY CreatedOn
OPTION (MERGE UNION)

t-sql 2012 update foreign key value in primary table

In a special request run, I need to update Locker and Lock tables in a sql server 2012 database, I have the following 2 table definitiions:
CREATE TABLE [dbo].[Locker](
[lockerID] [int] IDENTITY(1,1) NOT NULL,
[schoolID] [int] NOT NULL,
[number] [varchar](10) NOT NULL,
[lockID] [int] NULL
CONSTRAINT [PK_Locker] PRIMARY KEY NONCLUSTERED
(
[lockerID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY =
OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 97)
ON [PRIMARY]
) ON [PRIMARY]
CREATE TABLE [dbo].[Lock](
[lockID] [int] IDENTITY(1,1) NOT NULL,
[schoolID] [int] NOT NULL,
[comboSeq] [tinyint] NOT NULL
CONSTRAINT [PK_Lock] PRIMARY KEY NONCLUSTERED
(
[lockID] ASC
)
The Locker table is the main table and the Lock table is the secondary table. I need to add 500 new Locker numbers that the user has given to me to place in the Locker table and is uniquely defined by LockerID. I also need to add 500 new rows to the correspsonding Lock table that is uniquely defined in the Lock table and identified by the lockid.
Since lockid is a key value in the lock table and is uniquely defined in the locker table, I would like to know how to update the lock table with the 500 new rows. I would then like to take value of lockid (from lock table for the 500 new rows that were created) and uniquely place those 500 lockids uniquely into the 500 rows that were created for the lock table.
I have sql that looks like the following so far:
declare #SchoolID int = 999
insert into test.dbo.Locker ( [schoolID], [number])
select distinct LKR.schoolID, A.lockerNumber
FROM [InputTable] A
JOIN test.dbo.School SCH
ON A.schoolnumber = SCH.type
and A.schoolnumber = #SchoolNumber
JOIN test.dbo.Locker LKR
ON SCH.schoolID = LKR.schoolID
AND A.lockerNumber not in (select number
from dbo.Locker
where schoolID = #SchoolID)
order by LKR.schoolID, A.lockerNumber
I am not certain how to complete the rest of the task of placing lockerid uniquely into lock and locker tables? Thus can you either modify the sql that I just listed above and/or
come up with some new sql that will show me how to accomplish my goal?
You should use OUTPUT statement. First you should add rows to the lock table then gram lockid and prepare insert to locker table. This shoul meet your expectations:
DECLARE #tmp TABLE (lockid INT)
INSERT dbo.Lock
( schoolID, comboSeq )
OUTPUT Inserted.lockID INTO #tmp ( lockid )
(SELECT
999,
1
FROM master..spt_values sv WHERE sv.type = 'P' AND sv.number <= 500);
INSERT INTO dbo.Locker( schoolID, number, lockID )
SELECT x.schoolID, x.lockerNumber, y.lockid
FROM
(
SELECT TOP 100 PERCENT DISCTINCT LKR.schoolID, A.lockerNumber, ROW_NUMBER() OVER (ORDER BY A.LockerNumber) rn
FROM [InputTable] A
JOIN test.dbo.School SCH ON A.schoolnumber = SCH.type
and A.schoolnumber = #SchoolNumber
JOIN test.dbo.Locker LKR ON SCH.schoolID = LKR.schoolID
AND A.lockerNumber not in (select number from dbo.Locker
WHERE schoolID = #SchoolID)
ORDER by LKR.schoolID, A.lockerNumber ) x
JOIN (SELECT t.lockid, ROW_NUMBER() OVER (ORDER BY t.lockid) AS rn FROM #tmp t
) y
ON x.rn = y.rn

How can I optimize this SQL outer join query?

SCENARIO
I need to select records from test_userData based on a 1-to-1 match from test_userCheck on the columns customer or account_info. The code below will create a mock-up of the tables and will populate with random data for the purpose of my question. Based on this code, it's looking for any records where test_userData.customer = 'Guerrero, Unity' or test_userData.account_info = 'XXXXXXXXXXXXXXXX0821', and should return three rows (confirmation_id = 6836985, 5502798, and 3046441)
PROBLEM
As it stands, the query returns what I need... however, my real userData table has almost 2 million records, and the userCheck table has about 10,000. The query takes about 7 seconds as it is and I think that's way too long. I'm also worried because the userData table will start to grow quickly (by tens of thousands of unique records a day), and I envision my current method becoming unmanageable.
QUESTION
Any ideas on how I can optimize this to scale with millions of records? The data resides on a shared SQL 2008 server with limited permissions.
--setup temporary testing tables
IF EXISTS
(
SELECT * FROM dbo.sysobjects
WHERE id = object_id(N'[dbo].[test_userData]')
AND OBJECTPROPERTY(id, N'IsUserTable') = 1
)
DROP TABLE [dbo].[test_userData]
GO
IF EXISTS
(
SELECT * FROM dbo.sysobjects
WHERE id = object_id(N'[dbo].[test_userCheck]')
AND OBJECTPROPERTY(id, N'IsUserTable') = 1
)
DROP TABLE [dbo].[test_userCheck]
GO
CREATE TABLE [dbo].[test_userData](
[id] [int] IDENTITY(1,1) NOT NULL,
[merchant_id] [int] NOT NULL,
[sales_date] [datetime] NOT NULL,
[confirmation_id] [int] NOT NULL,
[customer] [nvarchar](max) NOT NULL,
[total] [smallmoney] NOT NULL,
[account_info] [nvarchar](max) NOT NULL,
[email_address] [nvarchar](max) NOT NULL
CONSTRAINT [PK_test_userData] PRIMARY KEY CLUSTERED
(
[id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE TABLE [dbo].[test_userCheck](
[confirmation_id] [int] NOT NULL,
[customer] [nvarchar](max) NOT NULL,
[total] [smallmoney] NOT NULL,
[account_info] [nvarchar](max) NOT NULL
CONSTRAINT [PK_test_userCheck] PRIMARY KEY CLUSTERED
(
[confirmation_id] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
--insert some random user transactions
INSERT INTO [dbo].[test_userData] (merchant_id,sales_date,confirmation_id,customer,total,account_info,email_address) VALUES
('99','03/25/2010','3361424','Soto, Ahmed','936','XXXXXXXXXXXXXXXX8744','Donec.egestas#NullainterdumCurabitur.ca'),
('17','09/12/2010','6710165','Holcomb, Eden','1022','XXXXXXXXXXXXXXXX6367','Curabitur#dolortempus.org'),
('32','05/04/2010','4489509','Foster, Nasim','1463','XXXXXXXXXXXXXXXX7115','augue.eu.tellus#ullamcorperviverraMaecenas.ca'),
('95','01/02/2011','5384061','Browning, Owen','523','XXXXXXXXXXXXXXXX0576','sed.dictum.eleifend#accumsaninterdum.edu'),
('91','08/21/2010','6075234','Dawson, McKenzie','141','XXXXXXXXXXXXXXXX3580','dolor.sit.amet#etmagnis.org'),
('63','01/29/2010','1055619','Mathews, Keefe','1110','XXXXXXXXXXXXXXXX2682','ligula#Sednuncest.edu'),
('27','10/20/2010','1819662','Clarke, Briar','1474','XXXXXXXXXXXXXXXX7481','Donec.non.justo#malesuada.org'),
('82','03/05/2010','3184936','Holman, Dana','560','XXXXXXXXXXXXXXXX7080','Aenean.eget.magna#accumsan.edu'),
('24','06/11/2010','1007427','Kirk, Desiree','206','XXXXXXXXXXXXXXXX3681','parturient#at.com'),
('49','06/17/2010','6137066','Foley, Sopoline','1831','XXXXXXXXXXXXXXXX1718','ac.urna.Ut#pellentesqueafacilisis.org'),
('22','05/08/2010','3545367','Howell, Uriel','638','XXXXXXXXXXXXXXXX1945','ad.litora#arcuvelquam.ca'),
('5','10/25/2010','6836985','Little, Caryn','743','XXXXXXXXXXXXXXXX0821','Suspendisse.aliquet#auctor.org'),
('91','06/16/2010','6852582','Buckner, Chiquita','99','XXXXXXXXXXXXXXXX1533','tellus.sem#semvitaealiquam.edu'),
('63','06/12/2010','7930230','Nolan, Wyoming','1192','XXXXXXXXXXXXXXXX1291','Sed#diam.org'),
('32','02/01/2010','8407102','Cummings, Deacon','1315','XXXXXXXXXXXXXXXX4375','a.odio.semper#massaSuspendisseeleifend.ca'),
('75','06/29/2010','5502798','Guerrero, Unity','858','XXXXXXXXXXXXXXXX8000','eget#lectus.edu'),
('50','09/13/2010','8312525','Russo, Yvette','1680','XXXXXXXXXXXXXXXX2046','In.mi#eu.com'),
('11','04/13/2010','6204132','Small, Calista','426','XXXXXXXXXXXXXXXX0269','lacus#Cumsociisnatoque.org'),
('16','01/01/2011','7522507','Mosley, Thor','1459','XXXXXXXXXXXXXXXX8451','netus.et#Pellentesqueutipsum.com'),
('5','01/27/2010','1472120','Case, Kiona','1419','XXXXXXXXXXXXXXXX7097','Duis#duilectusrutrum.edu'),
('70','02/17/2010','1095935','Snyder, Tanner','1655','XXXXXXXXXXXXXXXX8556','metus.sit.amet#inconsequatenim.edu'),
('63','11/10/2010','3046441','Guerrero, Unity','629','XXXXXXXXXXXXXXXX0807','nonummy.ac.feugiat#Phasellusdapibus.org'),
('22','08/19/2010','5435100','Turner, Patrick','1133','XXXXXXXXXXXXXXXX6734','pede#Duis.edu'),
('96','10/05/2010','6381992','May, Dominic','1858','XXXXXXXXXXXXXXXX7227','hymenaeos#etcommodo.edu'),
('96','02/26/2010','8630748','Chandler, Olympia','1016','XXXXXXXXXXXXXXXX4001','sed.dui.Fusce#pellentesqueSed.com');
--insert a random fraud transaction to check against (based on customer and account_info only)
INSERT INTO [dbo].[test_userCheck] (confirmation_id, customer, total, account_info) VALUES
('2055015', 'Guerrero, Unity', '20.02', 'XXXXXXXXXXXXXXXX0821');
--get result, which is correct
SELECT a.confirmation_id, a.customer, a.total, a.account_info, a.email_address
FROM dbo.test_userData AS a RIGHT OUTER JOIN
dbo.test_userCheck AS b ON a.customer = b.customer OR a.account_info = b.account_info;
DROP TABLE [dbo].[test_userData];
DROP TABLE [dbo].[test_userCheck];
Create the appropriate index or indices. Just based on your question, I'd suggest two indices, one on test_userData.customer, and a second index on test_userData.account_info
Creating indexes would probably help, but have you considered another design that complies with normal forms. It would be better if you access the date through index on a integer column instead of string...

Resources