I was wondering if someone could shed some light on why SQL Server (2016 RTM in my case, but I suspect this is not version-specific) is performing this seemingly unnecessary INNER JOIN.
Consider the following two tables joined by a foreign key:
CREATE TABLE [dbo].[batches](
[Id] [smallint] IDENTITY(1,1) PRIMARY KEY,
[Date] [date] NOT NULL,
[Run] [tinyint] NOT NULL,
[Clean] [bit] NOT NULL)
CREATE TABLE [dbo].[batch_values](
[Batch_Id] [smallint] NOT NULL,
[Key] [int] NOT NULL,
[Value] [int] NOT NULL,
CONSTRAINT [PK_batch_values] PRIMARY KEY CLUSTERED
( [Batch_Id] ASC, [Key] ASC))
GO
ALTER TABLE [dbo].[batch_values] WITH CHECK
ADD CONSTRAINT [FK_batch_values_batches] FOREIGN KEY([Batch_Id])
REFERENCES [dbo].[batches] ([Id])
GO
ALTER TABLE [dbo].[batch_values] CHECK CONSTRAINT [FK_batch_values_batches]
GO
Populate the tables with some data:
SET NOCOUNT ON;
DECLARE
#BatchCount int,
#BatchId smallint,
#KeyCount int;
SET #BatchCount = 1;
WHILE #BatchCount <= 100
BEGIN
INSERT INTO dbo.[batches]
VALUES (DATEADD(dd, #BatchCount / 10, '2016-01-01'), #BatchCount % 10, #BatchCount % 2);
SET #BatchId = SCOPE_IDENTITY();
SET #KeyCount = 1;
WHILE #KeyCount <= 1000
BEGIN
INSERT INTO dbo.batch_values
VALUES (#BatchId, #KeyCount, RAND() * 1000000 - 500000);
SET #KeyCount = #KeyCount + 1;
END;
SET #BatchCount = #BatchCount + 1;
END;
Now, if I run the following query the execution plan shows that the SQL Server is performing the INNER JOIN to the [batches] table, even though no columns are selected from it, and no records could be dropped from [batch_values] as a result of the join due to the foreign key constraint.
screenshot of query and execution plan
It seems to me that Query Optimizer should discard the INNER JOIN as unnecessary and simply do a primary key seek on [batch_values], but it doesn't.
This is material because if I develop views that join multiple tables to present a "bigger picture" of the underlying data for ease of use, when querying those views I will be taking a performance hit.
There are many limitations to use JOIN ELIMINATION by SQL Optimizer
E.g. if you use multiple columns in the foreign key, or constraint is not trusted, or marked as 'not for replication', etc.
SQL Server may not use JOIN ELIMINATION if you specify WHERE predicate with the column in foreign key.
Remove WHERE or remove "Batch_id = 100" from WHERE, and you should see the Optimizer now uses JOIN ELIMINATION
The documentation is limited on this topic, so I can't provide a proof link, but many people reported this issue in the past 5-7 years for different versions and agreed that behaviour was by design. My recommendation is to raise an incident with MS and ask them directly about it if it is critical for your system.
Related
I have a table with a very large number of rows which I wish to execute via dynamic SQL. They are basically existence checks and insert statements and I want to migrate data from one production database to another - we are merging transactional data. I am trying to find the optimal way to execute the rows.
I've been finding the coalesce method for appending all the rows to one another to not be efficient for this particularly when the number of rows executed at a time is greater than ~100.
Assume the structure of the source table is something arbitrary like this:
CREATE TABLE [dbo].[MyTable]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[DataField1] [int] NOT NULL,
[FK_ID1] [int] NOT NULL,
[LotsMoreFields] [NVARCHAR] (MAX),
CONSTRAINT [PK_MyTable] PRIMARY KEY CLUSTERED ([ID] ASC)
)
CREATE TABLE [dbo].[FK1]
(
[ID] [int] IDENTITY(1,1) NOT NULL,
[Name] [int] NOT NULL, -- Unique constrained value
CONSTRAINT [PK_FK1] PRIMARY KEY CLUSTERED ([ID] ASC)
)
The other requirement is I am tracking the source table PK vs the target PK and whether an insert occurred or whether I have already migrated that row to the target. To do this, I'm tracking migrated rows in another table like so:
CREATE TABLE [dbo].[ChangeTracking]
(
[ReferenceID] BIGINT IDENTITY(1,1),
[Src_ID] BIGINT,
[Dest_ID] BIGINT,
[TableName] NVARCHAR(255),
CONSTRAINT [PK_ChangeTracking] PRIMARY KEY CLUSTERED ([ReferenceID] ASC)
)
My existing method is executing some dynamic sql generated by a stored procedure. The stored proc does PK lookups as the source system has different PK values for table [dbo].[FK1].
E.g.
IF NOT EXISTS (<ignore this existence check for now>)
BEGIN
INSERT INTO [Dest].[dbo].[MyTable] ([DataField1],[FK_ID1],[LotsMoreFields]) VALUES (333,(SELECT [ID] FROM [Dest].[dbo].[FK1] WHERE [Name]=N'ValueFoundInSource'),N'LotsMoreValues');
INSERT INTO [Dest].[dbo].[ChangeTracking] ([Src_ID],[Dest_ID],[TableName]) VALUES (666,SCOPE_IDENTITY(),N'MyTable'); --666 is the PK in [Src].[dbo].[MyTable] for this inserted row
END
So when you have a million of these, it isn't quick.
Is there a recommended performant way of doing this?
As mentioned, the MERGE statement works well when you're looking at a complex JOIN condition (if any of these fields are different, update the record to match). You can also look into creating a HASHBYTES hash of the entire record to quickly find differences between source and target tables, though that can also be time-consuming on very large data sets.
It sounds like you're making these updates like a front-end developer, by checking each row for a match and then doing the insert. It will be far more efficient to do the inserts with a single query. Below is an example that looks for names that are in the tblNewClient table, but not in the tblClient table:
INSERT INTO tblClient
( [Name] ,
TypeID ,
ParentID
)
SELECT nc.[Name] ,
nc.TypeID ,
nc.ParentID
FROM tblNewClient nc
LEFT JOIN tblClient cl
ON nc.[Name] = cl.[Name]
WHERE cl.ID IS NULL;
This is will way more efficient than doing it RBAR (row by agonizing row).
Taking the two answers from #RusselFox and putting them together, I reached this tentative solution (but looking a LOT more efficient):
MERGE INTO [Dest].[dbo].[MyTable] [MT_D]
USING (SELECT [MT_S].[ID] as [SrcID],[MT_S].[DataField1],[FK_1_D].[ID] as [FK_ID1],[MT_S].[LotsMoreFields]
FROM [Src].[dbo].[MyTable] [MT_S]
JOIN [Src].[dbo].[FK_1] ON [MT_S].[FK_ID1] = [FK_1].[ID]
JOIN [Dest].[dbo].[FK_1] [FK_1_D] ON [FK_1].[Name] = [FK_1_D].[Name]
) [SRC] ON 1 = 0
WHEN NOT MATCHED THEN
INSERT([DataField1],[FL_ID1],[LotsMoreFields])
VALUES ([DataField1],[FL_ID1],[LotsMoreFields])
OUTPUT [SRC].[SrcID],INSERTED.[ID],0,N'MyTable' INTO [Dest].[dbo].[ChangeTracking]([Src_ID],[Dest_ID],[AlreadyExists],[TableName]);
I have a table with this structure:
CREATE TABLE [dbo].[cl](
[ID] [int] IDENTITY(1,1) NOT NULL,
[NIF] [numeric](9, 0) NOT NULL,
[Name] [varchar](80) NOT NULL,
[Address] [varchar](100) NULL,
[City] [varchar](40) NULL,
[State] [varchar](30) NULL,
[Country] [varchar](25) NULL,
Primary Key([ID],[NIF])
);
Imagine that this table has 3 records. Record 1, 2, 3...
When ever I delete Record number 2 the IDENTITY Field generates a Gap. The table then has Record 1 and Record 3. Its not correct!
Even if I use:
DBCC CHECKIDENT('cl', RESEED, 0)
It does not solve my problem becuase it will set the ID of the next inserted record to 1. And that's not correct either because the table will then have a multiple ID.
Does anyone has a clue about this?
No database is going to reseed or recalculate an auto-incremented field/identity to use values in between ids as in your example. This is impractical on many levels, but some examples may be:
Integrity - since a re-used id could mean records in other systems are referring to an old value when the new value is saved
Performance - trying to find the lowest gap for each value inserted
In MySQL, this is not really happening either (at least in InnoDB or MyISAM - are you using something different?). In InnoDB, the behavior is identical to SQL Server where the counter is managed outside of the table, so deleted values or rolled back transactions leave gaps between last value and next insert. In MyISAM, the value is calculated at time of insertion instead of managed through an external counter. This calculation is what is giving the perception of being recalcated - it's just never calculated until actually needed (MAX(Id) + 1). Even this won't insert inside gaps (like the id = 2 in your example).
Many people will argue if you need to use these gaps, then there is something that could be improved in your data model. You shouldn't ever need to worry about these gaps.
If you insist on using those gaps, your fastest method would be to log deletes in a separate table, then use an INSTEAD OF INSERT trigger to perform the inserts with your intended keys by first looking for records in these deletions table to re-use (then deleting them to prevent re-use) and then using the MAX(Id) + 1 for any additional rows to insert.
I guess what you want is something like this:
create table dbo.cl
(
SurrogateKey int identity(1, 1)
primary key
not null,
ID int not null,
NIF numeric(9, 0) not null,
Name varchar(80) not null,
Address varchar(100) null,
City varchar(40) null,
State varchar(30) null,
Country varchar(25) null,
unique (ID, NIF)
)
go
I added a surrogate key so you'll have the best of both worlds. Now you just need a trigger on the table to "adjust" the ID whenever some prior ID gets deleted:
create trigger tr_on_cl_for_auto_increment on dbo.cl
after delete, update
as
begin
update dbo.cl
set ID = d.New_ID
from dbo.cl as c
inner join (
select c2.SurrogateKey,
row_number() over (order by c2.SurrogateKey asc) as New_ID
from dbo.cl as c2
) as d
on c.SurrogateKey = d.SurrogateKey
end
go
Of course this solution also implies that you'll have to ensure (whenever you insert a new record) that you check for yourself which ID to insert next.
Goal
I aim to create SSIS (ETL) Template that enables audit functionality (Audit Dimension). I've discovered a few ways to implement audit dimension that are described below with some reference links below:
SEQUENCE
Primary Key
Best way to get identity of inserted row?
Environment:
There are millions of rows in a fact tables and packages run a few
times a day.
Incremental ETL gets thousands of rows.
SQL Server 2012 BI edition is used for the BI solution.
Simplified Schema of DimAudit table:
CREATE TABLE [dw].[DimAudit] (
[AuditKey] [int] IDENTITY(1 ,1) NOT NULL,
[ParentAuditKey] [int] NOT NULL,
[TableName] [varchar] (50) NOT NULL DEFAULT ('Unknown'),
[PackageName] [varchar] (50) NOT NULL DEFAULT ('Unknown'),
[ExecStartDate] [datetime] NOT NULL DEFAULT ( getdate()),
[ExecStopDate] [datetime] NULL,
[SuccessfulProcessingInd] [char] (1) NOT NULL DEFAULT ('N'),
CONSTRAINT [PK_dbo.DimAudit] PRIMARY KEY CLUSTERED
(
[AuditKey] ASC
)
) ON [PRIMARY]
ALTER TABLE [dw].[DimAudit] WITH CHECK ADD CONSTRAINT [FK_DimAudit_ParentAuditKey] FOREIGN KEY( [ParentAuditKey])
REFERENCES [dw]. [DimAudit] ( [AuditKey])
GO
ALTER TABLE [dw].[DimAudit] CHECK CONSTRAINT [FK_DimAudit_ParentAuditKey]
GO
Primary Key Option:
Primary key is generated in the audit table and then AuditKey is queried.
Task: Master SQL Audit Generates Key (SQL Task)
INSERT INTO [dw].[DimAudit]
(ParentAuditKey
,[TableName]
,[PackageName]
,[ExecStartDate]
,[ExecStopDate]
,[SuccessfulProcessingInd])
VALUES
(1
,'Master Extract Package'
,?
,?
,?
,'N')
SELECT AuditKey
FROM [dw].[DimAudit]
WHERE TableName = 'Master Extract Package' and ExecStartDT = ?
/*
Last Parameter: ParameterSystem::StartTime
Result Set populates User::ParentAuditKey
*/
Task: Master SQL Audit End (SQL Task)
UPDATE [dw]. [DimAudit]
SET ParentAuditKey = AuditKey
,ExecStopDT = SYSDATETIME()
,SuccessfulProcessingInd= 'Y'
WHERE AuditKey = ?
/*
Parameter: User::ParentAuditKey
*/
SEQUENCE Option:
The sequence option does not select primary key (AuditKey) but uses logic below to create next available AuditKey.
CREATE SEQUENCE dbo . AuditID as INT
START WITH 1
INCREMENT BY 1 ;
GO
DECLARE #AuditID INTEGER ;
SET #AuditID = NEXT VALUE FOR dbo. AuditID ;
Best way to get identity of inserted row?
It feels risky using identity options as ETL packages could be executed in parallel.
Question
What is the recommended practice for audit dimension table and managing keys?
Sequence & primary key options do the job; however, I have concerns about the selecting primary key option because package could be executed the same millisecond (in theory) and therefore, a few primary keys would exist. So, Sequence sounds like the best option.
Is anything better I could do to create Audit Dimension for a data mart?
You could use the OUTPUT syntax:
INSERT INTO [dw].[DimAudit]
(ParentAuditKey
,[TableName]
,[PackageName]
,[ExecStartDate]
,[ExecStopDate]
,[SuccessfulProcessingInd])
OUTPUT inserted.AuditKey
VALUES
(1
,'Master Extract Package'
,?
,?
,?
,'N')
or SCOPE_IDENTITY() which is what I'm personally using:
INSERT INTO Meta.AuditDim (
Date,
UserName,
Source,
SourceType,
AuditType,
ExecutionId,
ExecutionHost,
ParentAuditKey,
FileID
)
VALUES (
GETDATE(),
CURRENT_USER,
#Source,
#SourceType,
#AuditType,
#ExecutionId,
#ExecutionHost,
#ParentAuditKey,
#FileID
);
SELECT AuditKey FROM Meta.AuditDim WHERE AuditKey = SCOPE_IDENTITY();
While trying to dissect a SQL Server stored proc that's been running slow, we found that simply using a temp table instead of a real table had a drastic impact on performance. The table we're swapping out (ds_location) only has 173 rows:
This query will run complete in 1 second:
IF OBJECT_ID('tempdb..#Location') IS NOT NULL DROP TABLE #Location
SELECT * INTO #Location FROM ds_location
SELECT COUNT(*)
FROM wip_cubs_hc m
INNER JOIN ds_scenario sc ON sc.Scenario = m.Scenario
INNER JOIN ds_period pe ON pe.Period = m.ReportingPeriod
INNER JOIN #Location l ON l.Location = m.Sh_Location
Compare that to the original, which takes 7 seconds:
SELECT COUNT(*)
FROM wip_cubs_hc m
INNER JOIN ds_scenario sc ON sc.Scenario = m.Scenario
INNER JOIN ds_period pe ON pe.Period = m.ReportingPeriod
INNER JOIN ds_location l ON l.Location = m.Sh_Location
Here's the definition of wip_cubs_hc. It contains 1.7 million rows:
CREATE TABLE wip_cubs_hc(
Scenario varchar(16) NOT NULL,
ReportingPeriod varchar(50) NOT NULL,
Sh_Location varchar(50) NOT NULL,
Department varchar(50) NOT NULL,
ProductName varchar(75) NOT NULL,
Account varchar(50) NOT NULL,
Balance varchar(50) NOT NULL,
Source varchar(50) NOT NULL,
Data numeric(18, 6) NOT NULL,
CONSTRAINT PK_wip_cubs_hc PRIMARY KEY CLUSTERED
(
Scenario ASC,
ReportingPeriod ASC,
Sh_Location ASC,
Department ASC,
ProductName ASC,
Account ASC,
Balance ASC,
Source ASC
)
)
CREATE NONCLUSTERED INDEX IX_wip_cubs_hc_Balance
ON [dbo].[wip_cubs_hc] ([Scenario],[Sh_Location],[Department],[Balance])
INCLUDE ([ReportingPeriod],[ProductName],[Account],[Source])
I'd love to know HOW to determine what's causing the slowdown, too.
I can answer the "How to determine the slowdown" question...
Take a look at the execution plan of both queries. You do this by going to the "Query" menu > "Display Estimated Execution Plan". The default keyboard shortcut is Ctrl+L. You can see the plan for multiple queries at once as well. Look at the type of operation being done. What you want to see are things like Index Seek instead of Index Scan, etc.
This article explains some of the other things to look for.
Without knowing the schema/indexes of all the tables involved, this is where I would suggest starting.
Best of Luck!
In SQLSERVER 2005, I'm using table-valued function as a convenient way to perform arbitrary aggregation on subset data from large table (passing date range or such parameters).
I'm using theses inside larger queries as joined computations and I'm wondering if the query plan optimizer work well with them in every condition or if I'm better to unnest such computation in my larger queries.
Does query plan optimizer unnest
table-valued functions if it make
sense?
If it doesn't, what do you
recommend to avoid code duplication
that would occur by manually
unnesting them?
If it does, how do
you identify that from the execution
plan?
code sample:
create table dbo.customers (
[key] uniqueidentifier
, constraint pk_dbo_customers
primary key ([key])
)
go
/* assume large amount of data */
create table dbo.point_of_sales (
[key] uniqueidentifier
, customer_key uniqueidentifier
, constraint pk_dbo_point_of_sales
primary key ([key])
)
go
create table dbo.product_ranges (
[key] uniqueidentifier
, constraint pk_dbo_product_ranges
primary key ([key])
)
go
create table dbo.products (
[key] uniqueidentifier
, product_range_key uniqueidentifier
, release_date datetime
, constraint pk_dbo_products
primary key ([key])
, constraint fk_dbo_products_product_range_key
foreign key (product_range_key)
references dbo.product_ranges ([key])
)
go
.
/* assume large amount of data */
create table dbo.sales_history (
[key] uniqueidentifier
, product_key uniqueidentifier
, point_of_sale_key uniqueidentifier
, accounting_date datetime
, amount money
, quantity int
, constraint pk_dbo_sales_history
primary key ([key])
, constraint fk_dbo_sales_history_product_key
foreign key (product_key)
references dbo.products ([key])
, constraint fk_dbo_sales_history_point_of_sale_key
foreign key (point_of_sale_key)
references dbo.point_of_sales ([key])
)
go
create function dbo.f_sales_history_..snip.._date_range
(
#accountingdatelowerbound datetime,
#accountingdateupperbound datetime
)
returns table as
return (
select
pos.customer_key
, sh.product_key
, sum(sh.amount) amount
, sum(sh.quantity) quantity
from
dbo.point_of_sales pos
inner join dbo.sales_history sh
on sh.point_of_sale_key = pos.[key]
where
sh.accounting_date between
#accountingdatelowerbound and
#accountingdateupperbound
group by
pos.customer_key
, sh.product_key
)
go
-- TODO: insert some data
-- this is a table containing a selection of product ranges
declare #selectedproductranges table([key] uniqueidentifier)
-- this is a table containing a selection of customers
declare #selectedcustomers table([key] uniqueidentifier)
declare #low datetime
, #up datetime
-- TODO: set top query parameters
.
select
saleshistory.customer_key
, saleshistory.product_key
, saleshistory.amount
, saleshistory.quantity
from
dbo.products p
inner join #selectedproductranges productrangeselection
on p.product_range_key = productrangeselection.[key]
inner join #selectedcustomers customerselection on 1 = 1
inner join
dbo.f_sales_history_..snip.._date_range(#low, #up) saleshistory
on saleshistory.product_key = p.[key]
and saleshistory.customer_key = customerselection.[key]
I hope the sample makes sense.
Much thanks for your help!
In this case, it's an "inline table valued function"
The optimiser simply expands (unnests) it if it's useful (or view).
If the function is treated as "black box" by the outer query, the quickest way is to compare IO shown in SSMS vs IO in profiler.
Profler captures "black box" IO that SSMS does not.
Blog post by Adam Mechanic (his book is in my drawer at work)
1) Yes, using your syntax, it does. If you happened to use a UDF that returned a table which had conditional logic in it, it would not, though.
3) The optimizer won't point out what part of your query it's optimizing, because it may see fit to combine chunks of the plan with your function, or to optimize bits away.