How to optimize T-SQL UI queries - sql-server

I have UI form which shows to user different aggregate information (fact, plan etc. - 6 different T-SQL queries run in parallel). Execution of pure SQL queries takes up to 3 seconds.
I use stored procedures with parameters, but there is no problem - call of SPs takes absolutely the same time.
Here I use example of one table and one query, another 5 queries and tables have the same structure. I use MS SQL Server 2012, it's possible to upgrade up to 2014 if any optimization reason.
Now I try to find all possible ways to improve it. And it should be only SQL ways.
Aggregate table structure:
create table dbo.plan_Total(
VersionId int not null,
WarehouseId int not null,
ChannelUnitId int not null,
ProductId] int not null,
Month date not null,
Volume float not null,
constraint PK_Total primary key clustered
(VersionId asc, WarehouseId asc, ChannelUnitId asc, ProductId asc, Month asc)) on PRIMARY
SP query structure:
ALTER PROCEDURE dbo.plan_GetTotals
#versionId INT,
#geoIds ID_LIST READONLY, -- lists from UI filters
#productIds ID_LIST READONLY,
#channelUnitIds ID_LIST READONLY
AS
begin
SELECT Id INTO #geos
FROM #geoIds
SELECT Id INTO #products
FROM #productIds
SELECT Id INTO #channels
FROM #channelUnitIds
CREATE CLUSTERED INDEX IDX_Geos ON #geos(Id)
CREATE CLUSTERED INDEX IDX_Products ON #products(Id)
CREATE CLUSTERED INDEX IDX_ChannelUnits ON #channels(Id)
SELECT Month, SUM(Volume) AS Volume
FROM plan_Total t
JOIN #geos g ON t.WarehouseId = g.Id
JOIN #products p ON t.ProductId = p.Id
JOIN #channels cu ON t.ChannelUnitId = cu.Id
WHERE VersionId = #versionId
GROUP BY Month
ORDER BY Month -- no any performance impact
END
Approx. execution time 600-800 ms. Time of another queries almost the same.
How can I dramatically decrease execution time? Is it possible?
What I've done already:
- Try columnstore indexes (clustered is not good because foreign key problem);
- Disable of non-clustered columnstore index is not solution, because in some tables need to update data online (user can change information);
- Rebuild all current indexes;
- Can't gather all tables in one.
Here is actual plan link:
Actual execution plan - for this plan i add real tables in joins instead of temp tables.
BR, thanks for any help!

Have you considered not asking not joining channel, product etc.?
At least channels - if you do not have 10.000 you can just load them "on demand" or "on application start" and cache them. This is a client side dictionary lookup.
Also Month, SUM(Volume)..... consider precalculating this, making a materialized view. Calculating this on demand is not what reporting should do and goes against data warehousing best practices.
All your solutions will not change that - they do not address the real problem: too much processing in the query.

See if this way works better
Create the TABLE type to have a PRIMARY KEY
Specify option RECOMPILE: force compiler to include cardinality of TABLE variables
Specify option OPTIMIZE FOR UNKNOWN: prevent parameter sniffing for #versionId
CREATE TYPE dbo.ID_LIST AS TABLE (
Id INT PRIMARY KEY
);
GO
CREATE PROCEDURE dbo.plan_GetTotals
#versionId INT,
#geoIds ID_LIST READONLY,
#productIds ID_LIST READONLY,
#channelUnitIds ID_LIST READONLY
AS
SELECT
Month,
SUM(Volume) AS Volume
FROM
plan_Total AS t
INNER JOIN #geoIds AS g ON g.Id=t.WarehouseId
INNER JOIN #productIds AS p ON p.Id=t.ProductId
INNER JOIN #channelUnitIds AS c ON c.Id=t.ChannelUnitId
WHERE
t.VersionId=#versionId
GROUP BY
Month
ORDER BY
Month
OPTION(RECOMPILE, OPTIMIZE FOR UNKNOWN);
GO

Ok, here I just show what I can find and how I increased speed of mu query.
List of addins:
Best way is to add Clustered columnstore index. For that you need to delete FK's but you can use triggers for example. This increase the query up to 3-4 times.
How you can see I use temp tables in query joins. I've changed one join (doesn't matter which) to IN operand like this "and t.productid in (select id from #productids)" it increased query pure speed twice.
This two made most impact to the query. Below I want to show the final query:
select [month], sum(volume) as volume
from #geos g
left join dbo.plan_Total t on t.warehouseid = g.id
join #channels cu on t.channelunitid = cu.id
where versionid = #versionid
and t.productid in (select id from #productids)
group by [month]
order by [Month]
With this changes I decrease query execution time from 0.8 to 0.2 ms.

Related

SQL Server: Perfomance of INNER JOIN on small table vs subquery in IN clause

Let's say I have the following two tables:
CREATE TABLE [dbo].[ActionTable]
(
[ActionID] [int] IDENTITY(1, 1) NOT FOR REPLICATION NOT NULL
,[ActionName] [varchar](80) NOT NULL
,[Description] [varchar](120) NOT NULL
,CONSTRAINT [PK_ActionTable] PRIMARY KEY CLUSTERED ([ActionID] ASC)
,CONSTRAINT [IX_ActionName] UNIQUE NONCLUSTERED ([ActionName] ASC)
)
GO
CREATE TABLE [dbo].[BigTimeSeriesTable]
(
[ID] [bigint] IDENTITY(1, 1) NOT FOR REPLICATION NOT NULL
,[TimeStamp] [datetime] NOT NULL
,[ActionID] [int] NOT NULL
,[Details] [varchar](max) NULL
,CONSTRAINT [PK_BigTimeSeriesTable] PRIMARY KEY NONCLUSTERED ([ID] ASC)
)
GO
ALTER TABLE [dbo].[BigTimeSeriesTable]
WITH CHECK ADD CONSTRAINT [FK_BigTimeSeriesTable_ActionTable] FOREIGN KEY ([ActionID]) REFERENCES [dbo].[ActionTable]([ActionID])
GO
CREATE CLUSTERED INDEX [IX_BigTimeSeriesTable] ON [dbo].[BigTimeSeriesTable] ([TimeStamp] ASC)
GO
CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID] ON [dbo].[BigTimeSeriesTable] ([ActionID] ASC)
GO
ActionTable has 1000 rows and BigTimeSeriesTable has millions of rows.
Now consider the following two queries:
Query A
SELECT *
FROM BigTimeSeriesTable
WHERE TimeStamp > DATEADD(DAY, -3, GETDATE())
AND ActionID IN (
SELECT ActionID
FROM ActionTable
WHERE ActionName LIKE '%action%'
)
Execution plan for query A
Query B
SELECT bts.*
FROM BigTimeSeriesTable bts
INNER JOIN ActionTable act ON act.ActionID = bts.ActionID
WHERE bts.TimeStamp > DATEADD(DAY, -3, GETDATE())
AND act.ActionName LIKE '%action%'
Execution plan for query B
Question: Why does query A have better performance than query B (sometimes 10 times better)? Shouldn't the query optimizer recognize that the two queries are exactly the same? Is there any way to provide hints that would improve the performance of the INNER JOIN?
Update: I changed the join to INNER MERGE JOIN and the performance greatly improved. See execution plan here. Interestingly when I try the merge join in the actual query I'm trying to run (which I cannot show here, confidential) it totally messes up the query optimizer and the query is super slow, not just relatively slow.
The execution plans you have supplied both have exactly the same basic strategy.
Join
There is a seek on ActionTable to find rows where ActionName starts with "generate" with a residual predicate on the ActionName LIKE '%action%'. The 7 matching rows are then used to build a hash table.
On the probe side there is a seek on TimeStamp > Scalar Operator(dateadd(day,(-3),getdate())) and matching rows are tested against the hash table to see if the rows should join.
There are two main differences which explain why the IN version executes quicker
IN
The IN version is executing in parallel. There are 4 concurrent threads working on the query execution - not just one.
Related to the parallelism this plan has a bitmap filter. It is able to use this bitmap to eliminate rows early. In the inner join plan 25,959,124 rows are passed to the probe side of the hash join, in the semi join plan the seek still reads 25.9 million rows but only 313 rows are passed out to be evaluated by the join. The remainder are eliminated early by applying the bitmap inside the seek.
It is not readily apparent why the INNER JOIN version does not execute in parallel. You could try adding the hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to see if you now get a plan which executes in parallel and contains the bitmap filter.
If you are able to change indexes then, given that the query only returns 309 rows for 7 distinct actions, you may well find that replacing IX_BigTimeSeriesTable_ActionID with a covering index with leading columns [ActionID], [TimeStamp] and then getting a nested loops plan with 7 seeks performs much better than your current queries.
CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID_TimeStamp]
ON [dbo].[BigTimeSeriesTable] ([ActionID], [TimeStamp])
INCLUDE ([Details], [ID])
Hopefully with that index in place your existing queries will just use it and you will see 7 seeks, each returning an average of 44 rows, to read and return only the exact 309 total required. If not you can try the below
SELECT CA.*
FROM ActionTable A
CROSS APPLY
(
SELECT *
FROM BigTimeSeriesTable B
WHERE B.ActionID = A.ActionID AND B.TimeStamp > DATEADD(DAY, -3, GETDATE())
) CA
WHERE A.ActionName LIKE '%action%'
I had some success using an index hint: WITH (INDEX(IX_BigTimeSeriesTable_ActionID))
However as the query changes, even slightly, this can totally hamstring the optimizer's ability to get the best query.
Therefore if you want to "materialize" a subquery in order to force it to execute earlier, your best bet as of February 2020 is to use a temp table.
For inner join there's no difference between filtering and joining
[Difference between filtering queries in JOIN and WHERE?
But here your codes create different cases
Query A: You are just filtering with 1000 record
Query B: You first join with millions of rows and then filter with 1000 records
So query A take less time than query B

Too many parameter values slowing down query

I have a query that runs fairly fast under normal circumstances. But it is running very slow (at least 20 minutes in SSMS) due to how many values are in the filter.
Here's the generic version of it, and you can see that one part is filtering by over 8,000 values, making it run slow.
SELECT DISTINCT
column
FROM
table_a a
JOIN
table_b b ON (a.KEY = b.KEY)
WHERE
a.date BETWEEN #Start and #End
AND b.ID IN (... over 8,000 values)
AND b.place IN ( ... 20 values)
ORDER BY
a.column ASC
It's to the point where it's too slow to use in the production application.
Does anyone know how to fix this, or optimize the query?
To make a query fast, you need indexes.
You need a separate index for the following columns: a.KEY, b.KEY, a.date, b.ID, b.place.
As gotqn wrote before, if you put your 8000 items to a temp table, and inner join it, it will make the query even faster too, but without the index on the other part of the join it will be slow even then.
What you need is to put the filtering values in temporary table. Then use the table to apply filtering using INNER JOIN instead of WHERE IN. For example:
IF OBJECT_ID('tempdb..#FilterDataSource') IS NOT NULL
BEGIN;
DROP TABLE #FilterDataSource;
END;
CREATE TABLE #FilterDataSource
(
[ID] INT PRIMARY KEY
);
INSERT INTO #FilterDataSource ([ID])
-- you need to split values
SELECT DISTINCT column
FROM table_a a
INNER JOIN table_b b
ON (a.KEY = b.KEY)
INNER JOIN #FilterDataSource FS
ON b.id = FS.ID
WHERE a.date BETWEEN #Start and #End
AND b.place IN ( ... 20 values)
ORDER BY .column ASC;
Few important notes:
we are using temporary table in order to allow parallel execution plans to be used
if you have fast (for example CLR function) for spiting, you can join the function itself
it is not good to use IN with many values, the SQL Server is not able to build always the execution plan which may lead to time outs/internal error - you can find more information here

Slow Running Query - Will Indexes Help? Not sure what to do with Execution Plan

I have this slow running query below that returns 3,023 rows in SQL Server 2014 in a full minute and a half. Is there anything I can do to speed it up?
I have indexes on all the fields it's joining on. ArticleAuthor has 99 million rows and #ArticleAuthorTemp gets filled very quickly beforehand with all the IDs I need from ArticleAuthor (3,023 rows) with 0% cost of execution plan. I filled the temp table only for that purpose to limit what it's doing in the query you see here.
The execution plan for the query below is saying it's spending the most time on 2 key lookups and an index seek, each of these things at about 30%. I'm not sure how to create the needed indexes from these or if that would even help? Kind of new to index stuff. I hate to just throw indexes on everything. Even without the 2 LEFT JOINS or outer query, it's very slow so I'm thinking the real issue is with ArticleAuthor table. You'll see the indexes I have on this table below too... :)
I can provide any info you need on the execution plan if that helps.
SELECT tot.*,pu.LastName+', '+ ISNULL(pu.FirstName,'') CreatedByPerson,COALESCE(pf.updateddate,pf.CreatedDate) CreatedDatePerson
from (
SELECT CONVERT(VARCHAR(12), AA.Id) ArticleId
, 0 Citations
, AA.FullName
, AA.LastName
, AA.FirstInitial
, AA.FirstName GivenName
, AA.Affiliations
FROM ArticleAuthor AA WITH (NOLOCK)
INNER JOIN #ArticleAuthorTemp AAT ON AAT.Id = AA.Id
)tot LEFT JOIN AcademicAnalytics..pub_articlefaculty pf WITH (NOLOCK) ON tot.ArticleId = pf.SourceId
LEFT JOIN AAPortal..portal_user pu on pu.id = COALESCE(pf.updatedby,pf.CreatedBy)
Indexes:
CREATE CLUSTERED INDEX [IX_Name] ON [dbo].[ArticleAuthor]
(
[LastName] ASC,
[FirstName] ASC,
[FirstInitial] ASC
)
CREATE NONCLUSTERED INDEX [IX_ID] ON [dbo].[ArticleAuthor]
(
[Id] ASC
)
CREATE NONCLUSTERED INDEX [IX_ArticleID] ON [dbo].[ArticleAuthor]
(
[ArticleId] ASC
)
Google the CREATE INDEX statement and learn about the INCLUDES part. Use INCLUDES to eliminate Key Lookups by including all the columns that your query needs to return.

Which order of joins is faster?

I'm looking at a MS SQL Server database which was developed by a company that is an expert at database design (or so I'm told) and I noticed a curious pattern of JOINs/indexes. It's upside down from what I would have done, so I wonder if it has some performance benefits (the DB is fairly big).
The table structure (simplified pseudocode) is:
Table JOBS (about 1K rows):
job_id [int, primary key]
server_id [int, foreign key]
job_name [string]
Table JOB_HISTORY (about 17M rows):
history_id [int, primary key]
job_id [int, foreign key]
server_id [int, foreign key]
job_start [datetime]
job_duration [int]
Note the denormalization where the server_id is in both tables.
What they did is:
select
t1.job_name, t2.job_start, t2.job_duration
from
JOBS t1
inner join
JOB_HISTORY t2 on (t1.job_id = t2.job_id and t1.server_id = t2.server_id)
where
t1.server_id = #param_server_id
and t2.job_start >= #param_from
and t2.job_start <= #param_to
And they have indexes:
JOBS => (server_id)
JOB_HISTORY => (job_id, server_id, job_start)
In other words, when they select the rows, they first filter the jobs from JOBS table and then look up the relevant JOB_HISTORY entries. This is what the DB is forced to do, because of the indexes.
What I would have done it is the bottom-up version:
select
t1.job_name, t2.job_start, t2.job_duration
from
JOB_HISTORY t2
inner join
JOBS t1 on (t1.job_id = t2.job_id)
where
t2.server_id = #param_server_id
and t2.job_start >= #param_from
and t2.job_start <= #param_to
And a single index:
JOB_HISTORY => (server_id, job_start)
So, basically, I directly select the relevant rows from the large JOB_HISTORY and then just look for the attached data from the JOBS table.
Is there a reason to prefer one over the other?
Well, I was a bit bored so thought I'd re-create this for you. First setup (I'm using a numbers table to generate about 1K and 17M rows, of course, this is all random data and doesn't represent your system :) I'm also assuming theres a clustered index on each table, even though you imply you wouldn't have one.
USE TempDB;
GO
DROP TABLE IF EXISTS #Jobs;
DROP TABLE IF EXISTS #Job_History;
CREATE TABLE #Jobs
(
job_id INT IDENTITY PRIMARY KEY
,server_id INT
,job_name VARCHAR(50)
);
CREATE TABLE #Job_History
(
history_id INT IDENTITY PRIMARY KEY
,job_id INT
,server_id INT
,job_start DATETIME DEFAULT SYSDATETIME()
,job_duration INT DEFAULT ABS(CHECKSUM(NEWID())) % 5000
);
GO
INSERT INTO #Jobs
SELECT server_id = N.n
,job_name = CONVERT(VARCHAR(50), NEWID())
FROM DBA.Dim.Numbers N
WHERE n < 1000;
INSERT INTO #JOB_HISTORY
( job_id
,server_id
)
SELECT job_id = j1.job_id
,server_id = j1.server_id
FROM #Jobs j1
CROSS JOIN DBA.Dim.Numbers n
WHERE n < 17000;
Now, case 1 (their way)
DROP INDEX IF EXISTS Idx_Job_hist ON #Job_History;
CREATE NONCLUSTERED INDEX Idx_Job_Hist ON #Job_History (job_id, server_id, job_start);
DBCC FREEPROCCACHE
DBCC DROPCLEANBUFFERS
DECLARE #param_server_id INT = 1234
DECLARE #param_from INT = 500
DECLARE #param_to INT = 1000
select
t1.job_name, t2.job_start, t2.job_duration
from
#JOBS t1
inner join
#JOB_HISTORY t2 on (t1.job_id = t2.job_id and t1.server_id = t2.server_id)
where
t1.server_id = #param_server_id
and t2.job_start >= #param_from
and t2.job_start <= #param_to;
And Case 2 (your way)
DROP INDEX IF EXISTS Idx_Job_hist ON #Job_History;
CREATE NONCLUSTERED INDEX Idx_Job_Hist ON #Job_History (server_id, job_start);
select
t1.job_name, t2.job_start, t2.job_duration
from
#JOB_HISTORY t2
inner join
#JOBS t1 on (t1.job_id = t2.job_id)
where
t2.server_id = #param_server_id
and t2.job_start >= #param_from
and t2.job_start <= #param_to;
And the (totally non-conclusive, because my system isn't your system...) results:
Their plan:
Your Plan:
The costs from your plan were much higher overall.
But then this is a rather artificial exercise to just prove the point - run the plans, the answer is - it depends.
(Thanks for the excuse to play with this, it was fun :)
The short answer here is that it doesn't really matter in what order you JOIN the tables. SQL is one of those languages where you tell the server what you want, not so much what you want it to do (**). (AKA a so-called declarative language).
The reason we are seeing different Query Plans for the two versions of your query is that they are not exactly the same. In the first one there is a requirement that server_id is identical in both tables, while in the second version this is no longer mentioned. t1.server_id can be anything there. If you re-add this requirement you'll notice that the query plans will be identical and that the server will do exactly the same thing 'under the hood' for either query.
FYI: Building on Les H's answer I took the liberty of checking what kind of index MSSQL would suggest here and not-surprisingly it came up with
CREATE NONCLUSTERED INDEX idx_test
ON [dbo].[Job_History] ([server_id],[job_start])
INCLUDE ([job_id],[job_duration])
FYI:
without the index, each query took about 1500ms to run
creating the index took about 20 seconds
with the index, each query takes about 200ms to run
(**: Yes, I'm aware that you can 'direct' what happens under the hood by means of HINTS, but experience shows that those should only be a last resort when the QO no longer is able to make sense of things. In most cases, when the statistics are up-to-date and the data layout is not extremely exotic, the Query Optimizer is ridiculously smart about finding the best way to get you the data you asked for.)

Optimal performing query for latest record for each N

Here is the scenario I find myself in.
I have a reasonably big table that I need to query the latest records from. Here is the create for the essential columns for the query:
CREATE TABLE [dbo].[ChannelValue](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[UpdateRecord] [bit] NOT NULL,
[VehicleID] [int] NOT NULL,
[UnitID] [int] NOT NULL,
[RecordInsert] [datetime] NOT NULL,
[TimeStamp] [datetime] NOT NULL
) ON [PRIMARY]
GO
The ID column is a Primary Key and there is a non-Clustered index on VehicleID and TimeStamp
CREATE NONCLUSTERED INDEX [IX_ChannelValue_TimeStamp_VehicleID] ON [dbo].[ChannelValue]
(
[TimeStamp] ASC,
[VehicleID] ASC
)ON [PRIMARY]
GO
The table I'm working on to optimise my query is a little over 23 million rows and is only a 10th of the sizes the query needs to operate against.
I need to return the latest row for each VehicleID.
I've been looking through the responses to this question here on StackOverflow and I've done a fair bit of Googling and there seem to be 3 or 4 common ways of doing this on SQL Server 2005 and upwards.
So far the fastest method I've found is the following query:
SELECT cv.*
FROM ChannelValue cv
WHERE cv.TimeStamp = (
SELECT
MAX(TimeStamp)
FROM ChannelValue
WHERE ChannelValue.VehicleID = cv.VehicleID
)
With the current amount of data in the table it takes about 6s to execute which is within reasonable limits but with the amount of data the table will contain in the live environment the query begins to perform too slow.
Looking at the execution plan my concern is around what SQL Server is doing to return the rows.
I cannot post the execution plan image because my Reputation isn't high enough but the index scan is parsing every single row within the table which is slowing the query down so much.
I've tried rewriting the query with several different methods including using the SQL 2005 Partition method like this:
WITH cte
AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC) AS seq
FROM ChannelValue
)
SELECT
VehicleID,
TimeStamp,
Col1
FROM cte
WHERE seq = 1
But the performance of that query is even worse by quite a large magnitude.
I've tried re-structuring the query like this but the result speed and query execution plan is nearly identical:
SELECT cv.*
FROM (
SELECT VehicleID
,MAX(TimeStamp) AS [TimeStamp]
FROM ChannelValue
GROUP BY VehicleID
) AS [q]
INNER JOIN ChannelValue cv
ON cv.VehicleID = q.VehicleID
AND cv.TimeStamp = q.TimeStamp
I have some flexibility available to me around the table structure (although to a limited degree) so I can add indexes, indexed views and so forth or even additional tables to the database.
I would greatly appreciate any help at all here.
Edit Added the link to the execution plan image.
Depends on your data (how many rows are there per group?) and your indexes.
See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.
In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do
SELECT CA.*
FROM Vehicles V
CROSS APPLY (SELECT TOP 1 *
FROM ChannelValue CV
WHERE CV.VehicleID = V.VehicleID
ORDER BY TimeStamp DESC) CA
If your records are inserted sequentially, replacing TimeStamp in your query with ID may make a difference.
As a side note, how many records is this returning? Your delay could be network overhead if you are getting hundreds of thousands of rows back.
Try this:
SELECT SequencedChannelValue.* -- Specify only the columns you need, exclude the SequencedChannelValue
FROM
(
SELECT
ChannelValue.*, -- Specify only the columns you need
SeqValue = ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC)
FROM ChannelValue
) AS SequencedChannelValue
WHERE SequencedChannelValue.SeqValue = 1
A table or index scan is expected, because you're not filtering data in any way. You're asking for the latest TimeStamp for all VehicleIDs - the query engine HAS to look at every row to find the latest TimeStamp.
You can help it out by narrowing the number of columns being returned (don't use SELECT *), and by providing an index that consists of VehicleID + TimeStamp.

Resources