Creating an appropriate index for server-side paging query - sql-server

I have the following query:
SELECT TOP (100000)
[Filter1].[ID] AS [ID],
[Filter1].[FIELD1] AS [FIELD1],
[Filter1].[FIELD2] AS [FIELD2],
[Filter1].[FIELD3] AS [FIELD3],
[Filter1].[FIELD4] AS [FIELD4],
...
[Filter1].[FIELD30] AS [FIELD30],
FROM ( SELECT [Extent1].[ID] AS [ID],
[Extent1].[FIELD1] AS [FIELD1],
[Extent1].[FIELD2] AS [FIELD2],
[Extent1].[FIELD3] AS [FIELD3]
...
[Filter1].[FIELD30] AS [FIELD30],
row_number() OVER (ORDER BY [Extent1].[ID] ASC) AS [row_number]
FROM [dbo].[TABLE] AS [Extent1]
WHERE (N'VALUE1' <> [Extent1].[**FIELD2**]
AND (N'VALUE2' <> ([Extent1].[**FIELD3**])
AND ([Extent1].[**FIELD4**] IN (VALUE1, VALUE2, VALUE3, .... VALUE9)))
AS [Filter1]
WHERE [Filter1].[row_number] > 0
ORDER BY [Filter1].[ID] ASC
Due to the amount of rows that need to be selected (a few million) I am doing it in batches, hence the row_number filtering. Currently the query analyzer says that a Clustered index scan is conducted on FIELD1. Still I would like better performance which is why I've tried indexing on the fields in the WHERE and ORDER BY clauses.
What I've tried so far:
Non-clustered indexes on
FIELD2 ASC,
FIELD3 ASC,
FIELD4 ASC,
ID ASC
And every possible permutation. The query execution time doubles and triples.
Why is this happening and what sort of index can I create to speed this up?
By the way, I'm running SQL Server 2005, so can't use filtered indexes. Compatibility level is 7.0.

Hey I did some testing with dummy data on my local. I am making this recommendations considering you do not have any index on that table right now. or if you have please drop before you do any test with this suggestions.
1) [FIELD1] is not being used in any where filters. Make that Primary KEY as NON-CLUSTERED.
2) Now make Clustered index on columns (FIELD2 ASC,FIELD3 ASC,FIELD4 ASC)
3) You have ROW_NUMBER function ordered by on [ID] column. So use that [Row_Number] in outer order by clause instead [ID]
4) Change the Where Filters Order in the Query. Keep the [FIELD4] filter very first and use [FIELD2] and [FIELD3] filters after that.
5) If The Data Type of Columns [FIELD2] AND [FIELD3] is INT/NUMERIC/DATE then you can replace the "<>" operator with combination of "(> OR <=)".
6) If the Datatype of Columns [FIELD2] AND [FIELD3] is STRING then leave those WHERE filters with "<>" operator.
check these suggestions on SQL FIDDLE. This one does not have any data but explains above index and query suggestions
With above suggestions you will get the "Index Seek" and this should give you the good performance improvement. in my testing with dummy data of 2M row table it is returning 50k rows in 2 seconds in SSMS.

Solution here was to ignore indexes on where and order by clauses and instead change the order by clause to "ORDER BY [Filter1].[FIELD2] ASC" where a clustered index already existed. In this way 100k rows were returned in 3 seconds. Ordering within the file was changed, however server-side paging was not affected.

Related

SQL Server: Perfomance of INNER JOIN on small table vs subquery in IN clause

Let's say I have the following two tables:
CREATE TABLE [dbo].[ActionTable]
(
[ActionID] [int] IDENTITY(1, 1) NOT FOR REPLICATION NOT NULL
,[ActionName] [varchar](80) NOT NULL
,[Description] [varchar](120) NOT NULL
,CONSTRAINT [PK_ActionTable] PRIMARY KEY CLUSTERED ([ActionID] ASC)
,CONSTRAINT [IX_ActionName] UNIQUE NONCLUSTERED ([ActionName] ASC)
)
GO
CREATE TABLE [dbo].[BigTimeSeriesTable]
(
[ID] [bigint] IDENTITY(1, 1) NOT FOR REPLICATION NOT NULL
,[TimeStamp] [datetime] NOT NULL
,[ActionID] [int] NOT NULL
,[Details] [varchar](max) NULL
,CONSTRAINT [PK_BigTimeSeriesTable] PRIMARY KEY NONCLUSTERED ([ID] ASC)
)
GO
ALTER TABLE [dbo].[BigTimeSeriesTable]
WITH CHECK ADD CONSTRAINT [FK_BigTimeSeriesTable_ActionTable] FOREIGN KEY ([ActionID]) REFERENCES [dbo].[ActionTable]([ActionID])
GO
CREATE CLUSTERED INDEX [IX_BigTimeSeriesTable] ON [dbo].[BigTimeSeriesTable] ([TimeStamp] ASC)
GO
CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID] ON [dbo].[BigTimeSeriesTable] ([ActionID] ASC)
GO
ActionTable has 1000 rows and BigTimeSeriesTable has millions of rows.
Now consider the following two queries:
Query A
SELECT *
FROM BigTimeSeriesTable
WHERE TimeStamp > DATEADD(DAY, -3, GETDATE())
AND ActionID IN (
SELECT ActionID
FROM ActionTable
WHERE ActionName LIKE '%action%'
)
Execution plan for query A
Query B
SELECT bts.*
FROM BigTimeSeriesTable bts
INNER JOIN ActionTable act ON act.ActionID = bts.ActionID
WHERE bts.TimeStamp > DATEADD(DAY, -3, GETDATE())
AND act.ActionName LIKE '%action%'
Execution plan for query B
Question: Why does query A have better performance than query B (sometimes 10 times better)? Shouldn't the query optimizer recognize that the two queries are exactly the same? Is there any way to provide hints that would improve the performance of the INNER JOIN?
Update: I changed the join to INNER MERGE JOIN and the performance greatly improved. See execution plan here. Interestingly when I try the merge join in the actual query I'm trying to run (which I cannot show here, confidential) it totally messes up the query optimizer and the query is super slow, not just relatively slow.
The execution plans you have supplied both have exactly the same basic strategy.
Join
There is a seek on ActionTable to find rows where ActionName starts with "generate" with a residual predicate on the ActionName LIKE '%action%'. The 7 matching rows are then used to build a hash table.
On the probe side there is a seek on TimeStamp > Scalar Operator(dateadd(day,(-3),getdate())) and matching rows are tested against the hash table to see if the rows should join.
There are two main differences which explain why the IN version executes quicker
IN
The IN version is executing in parallel. There are 4 concurrent threads working on the query execution - not just one.
Related to the parallelism this plan has a bitmap filter. It is able to use this bitmap to eliminate rows early. In the inner join plan 25,959,124 rows are passed to the probe side of the hash join, in the semi join plan the seek still reads 25.9 million rows but only 313 rows are passed out to be evaluated by the join. The remainder are eliminated early by applying the bitmap inside the seek.
It is not readily apparent why the INNER JOIN version does not execute in parallel. You could try adding the hint OPTION(USE HINT('ENABLE_PARALLEL_PLAN_PREFERENCE')) to see if you now get a plan which executes in parallel and contains the bitmap filter.
If you are able to change indexes then, given that the query only returns 309 rows for 7 distinct actions, you may well find that replacing IX_BigTimeSeriesTable_ActionID with a covering index with leading columns [ActionID], [TimeStamp] and then getting a nested loops plan with 7 seeks performs much better than your current queries.
CREATE NONCLUSTERED INDEX [IX_BigTimeSeriesTable_ActionID_TimeStamp]
ON [dbo].[BigTimeSeriesTable] ([ActionID], [TimeStamp])
INCLUDE ([Details], [ID])
Hopefully with that index in place your existing queries will just use it and you will see 7 seeks, each returning an average of 44 rows, to read and return only the exact 309 total required. If not you can try the below
SELECT CA.*
FROM ActionTable A
CROSS APPLY
(
SELECT *
FROM BigTimeSeriesTable B
WHERE B.ActionID = A.ActionID AND B.TimeStamp > DATEADD(DAY, -3, GETDATE())
) CA
WHERE A.ActionName LIKE '%action%'
I had some success using an index hint: WITH (INDEX(IX_BigTimeSeriesTable_ActionID))
However as the query changes, even slightly, this can totally hamstring the optimizer's ability to get the best query.
Therefore if you want to "materialize" a subquery in order to force it to execute earlier, your best bet as of February 2020 is to use a temp table.
For inner join there's no difference between filtering and joining
[Difference between filtering queries in JOIN and WHERE?
But here your codes create different cases
Query A: You are just filtering with 1000 record
Query B: You first join with millions of rows and then filter with 1000 records
So query A take less time than query B

Table with no primary key due to lots of duplicates, trying to speed up my query to get rid of duplicates etc

I have a table of products from various suppliers all added together, so there are lots of duplicate SKU's (ManuPartNo), there are other elements, Price, Qty etc. However I want to Choose the Highest Price based on the SKU (ManuPartNo) and also add together all of the Qty's for any matching SKU. I have a working query but was wondering if anyone could suggest how to speed it up, the table I am querying has 1.3 million rows and my query takes about 40 seconds to run. Its not too slow but I am trying to learn more about optimization and this question is very hard to google so if anyone could give me any tips or point me in the right direction it would be much appreciated.
Here is the structure of the table I'm querying and my query its self.
CREATE TABLE [dbo].[AllProductsFromAllDistis](
[ProdName] [varchar](max) NULL,
[ManuPartNo] [varchar](150) NULL,
[Manufacturer] [varchar](150) NULL,
[Price] [decimal](10, 2) NOT NULL,
[Qty] [int] NOT NULL,
[Weight] [decimal](10, 2) NULL,
[UpcCode] [varchar](50) NULL,
[Supplier] [varchar](50) NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
and my query to return only the values with the highest price and to add all qty's of any duplicate sku.
SELECT ProdName, ManuPartNo, Price, TotalQty, Weight, UPCCode, Supplier,
Manufacturer
FROM
(SELECT dbo.AllProductsFromAllDistis. *,
ROW_NUMBER() OVER(PARTITION BY ManuPartNo ORDER BY Price ASC) AS RN,
SUM(Qty) OVER(PARTITION BY ManuPartNo) AS TotalQty
FROM AllProductsFromAllDistis) AS t
WHERE RN = 1
ORDER BY ManuPartNo
As I say this works just fine but I am looking for suggestions on speeding it up.
Although you can’t create a primary key due to the duplicates, you can still create a clustered index which does not require uniqueness but will still improve performance of queries that group or join on the indexes columns. E.g:
CREATE CLUSTERED INDEX [IX_AllProductsFromAllDistis] ON [dbo].[AllProductsFromAllDistis] ([ManuPartNo])
I recommend you this
SELECT t.ProdName, t.ManuPartNo, t.Price, t.TotalQty, t.Weight, t.UPCCode, t.Supplier,
t.Manufacturer
FROM
(SELECT ProdName,ManuPartNo,Price,Weight,UPCCode,Supplier,Manufacturer
ROW_NUMBER() OVER(PARTITION BY ManuPartNo ORDER BY Price ASC) AS RN,
SUM(Qty) OVER(PARTITION BY ManuPartNo) AS TotalQty
FROM AllProductsFromAllDistis (Nolock) ) AS t
WHERE t.RN = 1
ORDER BY t.ManuPartNo
Never forget it allways specify all column ( NOT --> Select * From
TableProduction)
If you are using subqueries put "t" in all your columns
Never forget it set a (Nolock) ...could cause blocked
Use "Execution Plan" in Menu->Query
I hope this can help you a little

Slow Running Query - Will Indexes Help? Not sure what to do with Execution Plan

I have this slow running query below that returns 3,023 rows in SQL Server 2014 in a full minute and a half. Is there anything I can do to speed it up?
I have indexes on all the fields it's joining on. ArticleAuthor has 99 million rows and #ArticleAuthorTemp gets filled very quickly beforehand with all the IDs I need from ArticleAuthor (3,023 rows) with 0% cost of execution plan. I filled the temp table only for that purpose to limit what it's doing in the query you see here.
The execution plan for the query below is saying it's spending the most time on 2 key lookups and an index seek, each of these things at about 30%. I'm not sure how to create the needed indexes from these or if that would even help? Kind of new to index stuff. I hate to just throw indexes on everything. Even without the 2 LEFT JOINS or outer query, it's very slow so I'm thinking the real issue is with ArticleAuthor table. You'll see the indexes I have on this table below too... :)
I can provide any info you need on the execution plan if that helps.
SELECT tot.*,pu.LastName+', '+ ISNULL(pu.FirstName,'') CreatedByPerson,COALESCE(pf.updateddate,pf.CreatedDate) CreatedDatePerson
from (
SELECT CONVERT(VARCHAR(12), AA.Id) ArticleId
, 0 Citations
, AA.FullName
, AA.LastName
, AA.FirstInitial
, AA.FirstName GivenName
, AA.Affiliations
FROM ArticleAuthor AA WITH (NOLOCK)
INNER JOIN #ArticleAuthorTemp AAT ON AAT.Id = AA.Id
)tot LEFT JOIN AcademicAnalytics..pub_articlefaculty pf WITH (NOLOCK) ON tot.ArticleId = pf.SourceId
LEFT JOIN AAPortal..portal_user pu on pu.id = COALESCE(pf.updatedby,pf.CreatedBy)
Indexes:
CREATE CLUSTERED INDEX [IX_Name] ON [dbo].[ArticleAuthor]
(
[LastName] ASC,
[FirstName] ASC,
[FirstInitial] ASC
)
CREATE NONCLUSTERED INDEX [IX_ID] ON [dbo].[ArticleAuthor]
(
[Id] ASC
)
CREATE NONCLUSTERED INDEX [IX_ArticleID] ON [dbo].[ArticleAuthor]
(
[ArticleId] ASC
)
Google the CREATE INDEX statement and learn about the INCLUDES part. Use INCLUDES to eliminate Key Lookups by including all the columns that your query needs to return.

Workarounds for massive performance penalty for DISTINCT on SQL Server?

when I send the following query to our db, it returns 4636 rows in < 2 seconds:
select
company3_.Un_ID as col_0_0_
from
MNT_Equipments equip
inner join
DynamicProperties dprops
on equip.propertiesId=dprops.id
inner join
DynamicPropertiesValue dvalues
on dprops.id=dvalues.dynamicPropertiesId
inner join
Companies company3_
on dvalues.companyId=COMPANY.Un_ID
where
equip.discriminator='9000'
and equip.active=1
and dvalues.propertyName='Eigentuemer'
But when I add a distinct to the select clause, it takes almost 4.5 minutes to return the remaining 40 entries. This seems to be somewhat out of proportion - what can I do to improve this, work around it or at least find out, what exactly is happening here?
Execution plans
No Distinct
With Distinct
Your help is very much appreciated!
The clustered index scans indicate that there are no good indexes on the queried tables.
If you create the following indexes the execution times should improve.
CREATE NONCLUSTERED INDEX [IX_MNT_Equipments_Active] ON [MNT_Equipments]
(
[propertiesId] ASC,
[discriminator] ASC,
[active] ASC
)
GO
CREATE NONCLUSTERED INDEX [IX_DynamicPropertiesValue_Name] ON [DynamicPropertiesValue]
(
[propertyName] ASC
)
GO

Optimal performing query for latest record for each N

Here is the scenario I find myself in.
I have a reasonably big table that I need to query the latest records from. Here is the create for the essential columns for the query:
CREATE TABLE [dbo].[ChannelValue](
[ID] [bigint] IDENTITY(1,1) NOT NULL,
[UpdateRecord] [bit] NOT NULL,
[VehicleID] [int] NOT NULL,
[UnitID] [int] NOT NULL,
[RecordInsert] [datetime] NOT NULL,
[TimeStamp] [datetime] NOT NULL
) ON [PRIMARY]
GO
The ID column is a Primary Key and there is a non-Clustered index on VehicleID and TimeStamp
CREATE NONCLUSTERED INDEX [IX_ChannelValue_TimeStamp_VehicleID] ON [dbo].[ChannelValue]
(
[TimeStamp] ASC,
[VehicleID] ASC
)ON [PRIMARY]
GO
The table I'm working on to optimise my query is a little over 23 million rows and is only a 10th of the sizes the query needs to operate against.
I need to return the latest row for each VehicleID.
I've been looking through the responses to this question here on StackOverflow and I've done a fair bit of Googling and there seem to be 3 or 4 common ways of doing this on SQL Server 2005 and upwards.
So far the fastest method I've found is the following query:
SELECT cv.*
FROM ChannelValue cv
WHERE cv.TimeStamp = (
SELECT
MAX(TimeStamp)
FROM ChannelValue
WHERE ChannelValue.VehicleID = cv.VehicleID
)
With the current amount of data in the table it takes about 6s to execute which is within reasonable limits but with the amount of data the table will contain in the live environment the query begins to perform too slow.
Looking at the execution plan my concern is around what SQL Server is doing to return the rows.
I cannot post the execution plan image because my Reputation isn't high enough but the index scan is parsing every single row within the table which is slowing the query down so much.
I've tried rewriting the query with several different methods including using the SQL 2005 Partition method like this:
WITH cte
AS (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC) AS seq
FROM ChannelValue
)
SELECT
VehicleID,
TimeStamp,
Col1
FROM cte
WHERE seq = 1
But the performance of that query is even worse by quite a large magnitude.
I've tried re-structuring the query like this but the result speed and query execution plan is nearly identical:
SELECT cv.*
FROM (
SELECT VehicleID
,MAX(TimeStamp) AS [TimeStamp]
FROM ChannelValue
GROUP BY VehicleID
) AS [q]
INNER JOIN ChannelValue cv
ON cv.VehicleID = q.VehicleID
AND cv.TimeStamp = q.TimeStamp
I have some flexibility available to me around the table structure (although to a limited degree) so I can add indexes, indexed views and so forth or even additional tables to the database.
I would greatly appreciate any help at all here.
Edit Added the link to the execution plan image.
Depends on your data (how many rows are there per group?) and your indexes.
See Optimizing TOP N Per Group Queries for some performance comparisons of 3 approaches.
In your case with millions of rows for only a small number of Vehicles I would add an index on VehicleID, Timestamp and do
SELECT CA.*
FROM Vehicles V
CROSS APPLY (SELECT TOP 1 *
FROM ChannelValue CV
WHERE CV.VehicleID = V.VehicleID
ORDER BY TimeStamp DESC) CA
If your records are inserted sequentially, replacing TimeStamp in your query with ID may make a difference.
As a side note, how many records is this returning? Your delay could be network overhead if you are getting hundreds of thousands of rows back.
Try this:
SELECT SequencedChannelValue.* -- Specify only the columns you need, exclude the SequencedChannelValue
FROM
(
SELECT
ChannelValue.*, -- Specify only the columns you need
SeqValue = ROW_NUMBER() OVER(PARTITION BY VehicleID ORDER BY TimeStamp DESC)
FROM ChannelValue
) AS SequencedChannelValue
WHERE SequencedChannelValue.SeqValue = 1
A table or index scan is expected, because you're not filtering data in any way. You're asking for the latest TimeStamp for all VehicleIDs - the query engine HAS to look at every row to find the latest TimeStamp.
You can help it out by narrowing the number of columns being returned (don't use SELECT *), and by providing an index that consists of VehicleID + TimeStamp.

Resources