Performance Strategy for growing data - sql-server

I know that performance tuning is something which need to be done specific to each environment. But I have put maximum effort to make my question clear to see if I am missing something in the possible improvements.
I have a table [TestExecutions] in SQL Server 2005. It has around 0.2 million records as of today. It is expected to grow as 5 million in couple of months.
CREATE TABLE [dbo].[TestExecutions]
(
[TestExecutionID] [int] IDENTITY(1,1) NOT NULL,
[OrderID] [int] NOT NULL,
[LineItemID] [int] NOT NULL,
[Manifest] [char](7) NOT NULL,
[RowCompanyCD] [char](4) NOT NULL,
[RowReferenceID] [int] NOT NULL,
[RowReferenceValue] [char](3) NOT NULL,
[ExecutedTime] [datetime] NOT NULL
)
CREATE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
I have following two queries for same purpose (Query2 and Query 3). For 100 records in #OrdersForRC, the Query2 is working better (39% vs 47%) whereas with 10000 records in in #OrdersForRC the Query 3 is working better (53% vs 33%) as per the execution plan).
In the initial few months of use, the #OrdersForRC table will have close to 100 records. It will gradually increase to 2500 records over a couple of months.
In the following two approaches which one is good for such a incrementally growing scenario? Or is there any strategy to make one approach work better than the other even if data grows?
Note: In Plan2, the first Query uses Hash Match
References
query optimizer operator choice - nested loops vs hash match (or merge)
Execution Plan Basics — Hash Match Confusion
Test Query
CREATE TABLE #OrdersForRC
(
OrderID INT
)
INSERT INTO #OrdersForRC
--SELECT DISTINCT TOP 100 OrderID FROM [TestExecutions]
SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions
--QUERY 2:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
INNER JOIN #OrdersForRC R
ON R.OrderID = H.OrderID
--QUERY 3:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
WHERE OrderID IN (SELECT OrderID FROM #OrdersForRC)
DROP TABLE #OrdersForRC
Plan 1
Plan 2

AS commented above you have not specified table definition of table LWManifestReceiptExecutions and how many rows in it and
You are selecting Top N rows without order by, Do you want TOP N random id or in a specific order or order does`t matter for You?
if order does matter then you can create a index on column which you required in Order By
if order id is unique in [dbo].[TestExecutions] table then you should mark it as unique drop and recreate the index if UNIQUE
Drop Index [IX_TestExecutions_OrderID] ON [dbo].[TestExecutions]
CREATE UNIQUE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
You asked that data is keep growing and it will reach to millions in couple of months.
No need to worry sql server can easily handle these query with proper build schema and indexes,
When this data model starting hurting then you could look at the
other options but not now, i have seen people handling billions of data in sql server.
I can see you are comparing the queries on the bases of query cost you are coming the conclusion that
Query with higher percentages mean this is more expensive,
That is not the case always query cost is based on aggregate Subtree cost of all Iterator in the query plan,
and the total estimated cost of an Iterator is a simple sum of the I/O and CPU components.
The cost values represent expected execution times (in seconds) on a particular hardware configuration
But with the morden hardware these cost might be irrelevant.
Now coming to your query,
You have expressed two queries to get the result but both are not identical,
IN PLAN 1 Query 1
Expressed by JOIN
QO is choosing Nested loop join that is good choice for particular this scenario
Every row for the key OrderID IN table #OrdersForRC seeking the value in the table dbo.[TestExecutions]
until all rows matched
IN PLAN 2 Query 2
Expressed by IN
QO is doing the same thing as query one but there is extra distinct Sort ( Sort and Stream aggregated)
the reasoning behind it is you have expressed this query as IN and table #OrdersForRC can contain duplicate Rows
Just to eliminate that is necessary.
IN PLAN 2 Query 1
Expressed by JOIN
Now the Rows in the table in #OrdersForRC in 1000, QO is choosing hash join over loop join
Because loop join for 1000 rows has more cost than hash join and loop join and rows are unordered
and can consist nulls as well thus HASH JOIN is perfect stratergy here.
IN PLAN 2 Query 2
Expressed by IN
QO has chosen Distinct Sort for the same reason as chosen in Plan 2 query 2 and then Merge Join
Because rows are now sorted ON ID column for both tables.
IF you just mark temp table as NOT NULL and Unique then its more likly you will get the same execution plan for both IN the JOIN.
CREATE TABLE #OrdersForRC
(OrderID INT not null Unique)
Execution plan

Related

Large Table Optimization In SQL Server 2014

I have a reporting table that is populated from various fact tables in my Data Warehouse. The issue is that for one customer in that reporting table, it takes 46 seconds to pull his data. The one customer has 4232424 records. In total, the table has 5336393 records in it, and has 4 columns. I'll post the table structure and the query I'm running. I need to get the result time on this down to as low as possible. I've tried In Memory Tables, various Indexes, and Indexed Views.
TABLE STRUCTURE
CREATE TABLE cache.Tree
(
CustomerID INT NOT NULL PRIMARY KEY NONCLUSTERED,
RelationA_ID INT NOT NULL,
RelationB_ID INT NOT NULL,
NestedLevel INT NOT NULL,
lft INT NOT NULL,
rgt INT NOT NULL
INDEX IX_LEGS CLUSTERED (lft, rgt),
INDEX IX_LFT NONCLUSTERED (lft)
)
The Report Query
SELECT
tp.CustomerID AS DLine,
t.CustomerID,
t.RelationA_ID,
Level = t.NestedLevel - tp.NestedLevel,
IndentedSort = t.lft
FROM cache.UnilevelTreeWithLC2 tp
INNER JOIN cache.UniLevelTreeWithLC2 t
ON t.lft between tp.lft AND tp.rgt
WHERE tp.CustomerID = 7664
Any help or guidance would be greatly appreciated.
UPDATE 1: Query Execution Plan
UPDATE 2: Solved
I was able to get permission to filter out inactive people in the tree. This has cut the query execution in almost half, if I keep the indexes I put on the table.
Try forcescan - for a query that pulls 80% of a narrow table I would expect SQL to scan but it might not because of bad stats or one of the various cardinality estimation bugs (that are fixed but require traceflags to enable).
I would also ditch the celko-sets - a single parent_id col will make your table even narrower, which should speed up these throughput bound cases, spare you the left/right maintenance, and be very fast with recursive queries.

Why this query is running so slow?

This query runs very fast (<100 msec):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
If I add just a time filter, it takes too long (22 seconds!):
SELECT TOP (10)
[Extent2].[CompanyId] AS [CompanyId]
,[Extent1].[Id] AS [Id]
,[Extent1].[Status] AS [Status]
FROM [dbo].[SplittedSms] AS [Extent1]
INNER JOIN [dbo].[Sms] AS [Extent2]
ON [Extent1].[SmsId] = [Extent2].[Id]
WHERE [Extent2].Time > '2015-04-10'
AND [Extent2].[CompanyId] = 4563
AND ([Extent1].[NotifiedToClient] IS NULL)
I tried adding an index on the [Time] column of the Sms table, but the optimizer seems not using the index. Tried using With (index (Ix_Sms_Time)); but to my surprise, it takes even more time (29 seconds!).
Here is the actual execution plan:
The execution plan is same for both queries. Tables mentioned here have 5M to 8M rows (indices are < 1% fragmented and stats are updated). I am using MS SQL Server 2008R2 on a 16core 32GB memory Windows 2008 R2 machine)
Does it help when you force the time filter to kick in only after the client filter has run?
FI like in this example:
;WITH ClientData AS (
SELECT
[E2].[CompanyId]
,[E2].[Time]
,[E1].[Id]
,[E1].[Status]
FROM [dbo].[SplittedSms] AS [E1]
INNER JOIN [dbo].[Sms] AS [E2]
ON [E1].[SmsId] = [E2].[Id]
WHERE [E2].[CompanyId] = 4563
AND ([E1].[NotifiedToClient] IS NULL)
)
SELECT TOP 10
[CompanyId]
,[Id]
,[Status]
FROM ClientData
WHERE [Time] > '2015-04-10'
Create an index on Sms with the following Index Key Columns (in this order):
CompanyID
Time
You may or may not need to add Id as an Included Column.
What datatype is your Time column?
If it's datetime, try converting your '2015-04-10' into equivalent data-type, so that it can use the index.
Declare #test datetime
Set #test='2015-04-10'
Then modify your condition:
[Extent2].Time > #test
The sql server implicitly casts to matching data-type if there is a data-type mismatch. And any function or cast operation prevent using indexes.
I'm on the same track with #JonTirjan, the index with just Time results into a lot of key lookups, so you should try at least following:
create index xxx on Sms (Time, CompanyId) include (Id)
or
create index xxx on Sms (CompanyId, Time) include (Id)
If Id is your clustered index, then it's not needed in include clause. If significant part of your data belongs to CompanyID 4563, it might be ok to have it as include column too.
The percentages you see in actual plan are just estimates based on the row count assumptions, so those are sometimes totally wrong. Looking at actual number of rows / executions + statistics IO output should give you idea what's actually happening.
Two things come to mind:
By adding an extra restriction it will be 'harder' for the database to find the first 10 items that match your restrictions. Finding the first 10 rows from let's say 10.000 items (from a total of 1 milion) is a easier then finding the first 10 rows from maybe 100 items (from a total of 1 milion).
The index is not being used probably because the index is created on a datetime column, which is not very efficient if you are also storing the time in them. You might want to create a clustered index on the [time] column (but then you would have to remove the clustered index which is now on the [CompanyId] column or you could create a computed column which stores the date-part of the [time] column, create an index on this computed column and filter on this column.
I found out that there was no index on the foreign key column (SmsId) on the SplittedSms table. I made one and it seems the second query is almost as fast as the first one now.
The execution plan now:
Thanks everyone for the effort.

PostgreSQL - Index not used

I have created a table.
And I've written a procedure to update the table when I update another table.
That is, when I update table2 few records from the table2 will be updated to table1 using the trigger I've created on table2.
I could've created view instead of doing that. But the main purpose of it is I won't be able to create index on views.
Hence I did like this. Table2 consists about 500k rows. And I'm updating about 220k rows to and a extra column is created on some calculation that'll give each rows either 0 or 1 based on some criteria.
And I've created a index on the table1.
If I execute a count(*) query in table2 in which i already have only one index for date col. The query executes in 200ms which has about 500k rows.
But if I execute the same query on table1 it takes double the time when compared to that of table2.
And if I remove the index on table1 it add another 500-600ms to the execution time.
Creating index on table1 has just reduced 500-600ms.
Explain Analyze of the query with 2 columns.
"HashAggregate (cost=80421.85..80422.94 rows=73 width=4) (actual time=6248.826..6248.829 rows=3 loops=1)"
" -> Seq Scan on table1 (cost=0.00..70306.88 rows=2022994 width=4) (actual time=0.048..4203.224 rows=2022994 loops=1)"
" Filter: ((date >= '2014-02-01'::date) AND (date <= '2014-04-30'::date))"
"Total runtime: 6248.895 ms"
Table Definition:
CREATE TABLE table1
(
label1 text NOT NULL,
label2 text NOT NULL,
label3 text NOT NULL,
date date NOT NULL,
"mobile no" bigint NOT NULL,
"start time" time without time zone NOT NULL,
"end time" time without time zone NOT NULL,
label4 text NOT NULL,
label5 text NOT NULL,
value1 integer NOT NULL,
count numeric NOT NULL
)
Index Definition :
CREATE INDEX ix_date
ON table1
USING btree
(date);
And the COUNT(*) I've given is just for an example.
Actually I sum up the count column by grouping label1, 2, 3 and extracting the month from date.
Firstly,
I could've created view instead of doing that. But the main purpose of it is I won't be able to create index on views.
Views are "expanded" when processing queries so, e.g. a SELECT x FROM my_view JOIN y... will practically directly substitute the view definition inside your query, and the resulting expanded query will be able to use any indexes directly, if applicable.
Secondly,
If I execute a count(*) query in table2 in which i already have only one index for date col. The query executes in 200ms which has about 500k rows.
Unfortunately, COUNT(*) queries in PostgreSQL don't usually use indexes, even in recent (9.2+) versions with index-only scans. See here: https://wiki.postgresql.org/wiki/Index-only_scans#Is_.22count.28.2A.29.22_much_faster_now.3F for a description why is that. A non-unique (or primary key) index will NEVER be used for COUNT(*).
Thirdly, updating records in a MVCC database such as PostgreSQL creates updated copies of those records, instead of updating them in-place. This almost always results in significant internal data fragmentation, which is sorely visible if you use drives with slow seek times, like mechanical drives. If you want linearly reduced COUNT(*) times between tables of different sizes, either make sure the data is not fragmented (VACUUM FULL ANALYZE + REINDEX will mostly do the trick), or just use an SSD.

Need some assistance understanding a SQL Server 2012 query plan

I have the following query:
Select TOP 5000
CdCl.SubId
From dbo.PanelCdCl CdCl WITH (NOLOCK)
Inner Join dbo.PanelHistory PH ON PH.SubId = CdCl.SubId
Where CdCl.PanelCdClStatusId IS NULL And PH.LastProcessNumber >= 1605
Order By CdCl.SubId
The query plan looks as follows:
Both the PanelCdCl and PanelHistory tables have a clustered index / primary key on SubId, and it's the only column in the index. There is exactly one row for each SubId in each table. Both tables have ~35M total rows in them.
I'm curious why the query plan is showing a clustered index scan on PanelHistory when the join is being done on the clustered index column.
It's not scanning PanelHistory's clustered index(SubId) to find a SubId, it's scanning on it to find all rows where LastProcessNumber >= 1605. This is the first logical step.
Then it likewise scans PanelCdCl to find all non-null PanelCdClStatusId rows. Then since they had the same index (SubId), they are both already sorted on the Join column, so it can do a Merge-Join without an additional sort. (Merge-Join is almost always the most efficient if it doesn't have to re-sort the input rows).
Then it doesn't have to do a Sort for the ORDER BY, because it's already in SubId order.
And finally, it does the TOP, which has to be after everything else (by the rules of SQL clause logical execution ordering).
So the only place it tests SubId values is in the Merge-Join, it never pushes it down to the scans. This would probably remain true if it did a Hash-Join instead. Only for a Nested-Loop Join would it have to push the SubId test down as a seek on a table, and that should only be the lower branch, not the upper one.
The merge join operator needs two sorted inputs. The clustered key is SubId in both tables which means that the scan in PanelHistory will give the rows in correct order. The clustered key is included in all non clustered key indexes so because of that you will have all rows in NCI IX_PanelCdCl_PanelCdClStatusId where PanelCdClStatusId is null ordered by SubId as well so that can also be used directly by the merge join.
What you see here is actually two scans, one of the clustered key in PanelHistory with a residual predicate on LastProcessNumber > 1605 and one index range scan in IX_PanelCdCl_PanelCdClStatusId as long as PanelCdClStatusId is null.
They will however not scan the entire table/index. The query is executed from left to right in the query plan where select is asking for one row at a time until there is no more rows to be had. That means that the top operator will stop asking for new rows from the merge join when it has the required 5000 rows.

Suitable indexes for sorting in ranking functions

I have a table which keeps parent-child-relations between items. Those can be changed over time, and it is necessary to keep a complete history so that I can query how the relations were at any time.
The table is something like this (I removed some columns and the primary key etc. to reduce noise):
CREATE TABLE [tblRelation](
[dtCreated] [datetime] NOT NULL,
[uidNode] [uniqueidentifier] NOT NULL,
[uidParentNode] [uniqueidentifier] NOT NULL
)
My query to get the relations at a specific time is like this (assume #dt is a datetime with the desired date):
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY r.uidNode ORDER BY r.dtCreated DESC) ix, r.*
FROM [tblRelation] r
WHERE (r.dtCreated < #dt)
) r
WHERE r.ix = 1
This query works well. However, the performance is not yet as good as I would like. When looking at the execution plan, it basically boils down to a clustered index scan (36% of cost) and a sort (63% of cost).
What indexes should I use to make this query faster? Or is there a better way altogether to perform this query on this table?
The ideal index for this query would be with key columns uidNode, dtCreated and included columns all remaining columns in the table to make the index covering as you are returning r.*. If the query will generally only be returning a relatively small number of rows (as seems likely due to the WHERE r.ix = 1 filter) it might not be worthwhile making the index covering though as the cost of the key lookups might not outweigh the negative effects of the large index on CUD statements.
The window/rank functions on SQL Server 2005 are not that optimal sometimes (based on answers here). Apparently better in SQL Server 2008
Another alternative is something like this. I'd have a non-clustered index on (uidNode, dtCreated) INCLUDE any other columns required by SELECT. Subject to what Martin Smith said about lookups.
WITH MaxPerUid AS
(
SELECT
MAX(r.dtCreated) AS MAXdtCreated, r.uidNode
FROM
MaxPerUid
WHERE
r.dtCreated < #dt
GROUP BY
r.uidNode
)
SELECT
...
FROM
MaxPerUid M
JOIN
MaxPerUid R ON M.uidNode = R.uidNode AND M.MAXdtCreated = R.dtCreated

Resources