Large Table Optimization In SQL Server 2014 - sql-server

I have a reporting table that is populated from various fact tables in my Data Warehouse. The issue is that for one customer in that reporting table, it takes 46 seconds to pull his data. The one customer has 4232424 records. In total, the table has 5336393 records in it, and has 4 columns. I'll post the table structure and the query I'm running. I need to get the result time on this down to as low as possible. I've tried In Memory Tables, various Indexes, and Indexed Views.
TABLE STRUCTURE
CREATE TABLE cache.Tree
(
CustomerID INT NOT NULL PRIMARY KEY NONCLUSTERED,
RelationA_ID INT NOT NULL,
RelationB_ID INT NOT NULL,
NestedLevel INT NOT NULL,
lft INT NOT NULL,
rgt INT NOT NULL
INDEX IX_LEGS CLUSTERED (lft, rgt),
INDEX IX_LFT NONCLUSTERED (lft)
)
The Report Query
SELECT
tp.CustomerID AS DLine,
t.CustomerID,
t.RelationA_ID,
Level = t.NestedLevel - tp.NestedLevel,
IndentedSort = t.lft
FROM cache.UnilevelTreeWithLC2 tp
INNER JOIN cache.UniLevelTreeWithLC2 t
ON t.lft between tp.lft AND tp.rgt
WHERE tp.CustomerID = 7664
Any help or guidance would be greatly appreciated.
UPDATE 1: Query Execution Plan
UPDATE 2: Solved
I was able to get permission to filter out inactive people in the tree. This has cut the query execution in almost half, if I keep the indexes I put on the table.

Try forcescan - for a query that pulls 80% of a narrow table I would expect SQL to scan but it might not because of bad stats or one of the various cardinality estimation bugs (that are fixed but require traceflags to enable).
I would also ditch the celko-sets - a single parent_id col will make your table even narrower, which should speed up these throughput bound cases, spare you the left/right maintenance, and be very fast with recursive queries.

Related

SQL query runs into a timeout on a sparse dataset

For sync purposes, I am trying to get a subset of the existing objects in a table.
The table has two fields, [Group] and Member, which are both stringified Guids.
All rows together may be to large to fit into a datatable; I already encountered an OutOfMemory exception. But I have to check that everything I need right now is in the datatable. So I take the Guids I want to check (they come in chunks of 1000), and query only for the related objects.
So, instead of filling my datatable once with all
SELECT * FROM Group_Membership
I am running the following SQL query against my SQL database to get related objects for one thousand Guids at a time:
SELECT *
FROM Group_Membership
WHERE
[Group] IN (#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, #Guid5, ..., #Guid999)
The table in question now contains a total of 142 entries, and the query already times out (CommandTimeout = 30 seconds). On other tables, which are not as sparsely populated, similar queries don't time out.
Could someone shed some light on the logic of SQL Server and whether/how I could hint it into the right direction?
I already tried to add a nonclustered index on the column Group, but it didn't help.
I'm not sure that WHERE IN will be able to maximally use an index on [Group], or if at all. However, if you had a second table containing the GUID values, and furthermore if that column had an index, then a join might perform very fast.
Create a temporary table for the GUIDs and populate it:
CREATE TABLE #Guids (
Guid varchar(255)
)
INSERT INTO #Guids (Guid)
VALUES
(#Guid0, #Guid1, #Guid2, #Guid3, #Guid4, ...)
CREATE INDEX Idx_Guid ON #Guids (Guid);
Now try rephrasing your current query using a join instead of a WHERE IN (...):
SELECT *
FROM Group_Membership t1
INNER JOIN #Guids t2
ON t1.[Group] = t2.Guid;
As a disclaimer, if this doesn't improve the performance, it could be because your table has low cardinality. In such a case, an index might not be very effective.

How do indexes work behind the scenes

Im a begginer. I know indexes are necessary for performance boosts, but i want to know how they actually work behind the scenes. Beforehand, I used to think that we should make indexes on those columns which are included in where clause (which I realized is wrong)
For example, SELECT * from MARKS where marks_obtained > 50
Consider that there's a clustered index on primary key of this table and I created a non-clustered index on marks_obtained column as its there in my where clause.
My perception: So the leaf nodes will be containing pointers to clustered index and as clustered index points to actual rows, it will select entire rows (due to asteric in my query)
Scenario
I came across following query (from AdventureWorks DB on which a non-clustered index was created) which works fine and took less than a second to execute 3200000 rows until a new column was inserted into it:
Query
SELECT x.*
INTO#X
FROM dbo.bigProduct AS p
CROSS APPLY
(
SELECT TOP 1000 *
FROM dbo.bigTransactionHistory AS bth
WHERE
bth.ProductId = p.bth.ProductId
ORDER BY
TransactionDate DESC
) AS x
WHERE
p.ProductId BETWEEN 1000 AND 7500
GO
NEW INSERTED COLUMN
ALTER TABLE dbo.bigTransactionHistory
ADD CustomerId INT NULL
After insertion of above column it took 17 seconds! means 17 times slower. A non-clusered index was now missing CustomerId column in the index. Just after including CustomerId, problem was gone.
Question CustomerId seemed to be the culprit until it was added to the index. BUT HOW???
The execution plan would answer this but I'll make a guess: The non-clustered index was no longer enough to satisfy the query after the additional column had been added. This can cause the index to not be used anymore. It also can cause one clustered index seek per row.
Learn to read execution plans. Turn on the "actual execution plan" feature routinely for each query that you test.

Performance Strategy for growing data

I know that performance tuning is something which need to be done specific to each environment. But I have put maximum effort to make my question clear to see if I am missing something in the possible improvements.
I have a table [TestExecutions] in SQL Server 2005. It has around 0.2 million records as of today. It is expected to grow as 5 million in couple of months.
CREATE TABLE [dbo].[TestExecutions]
(
[TestExecutionID] [int] IDENTITY(1,1) NOT NULL,
[OrderID] [int] NOT NULL,
[LineItemID] [int] NOT NULL,
[Manifest] [char](7) NOT NULL,
[RowCompanyCD] [char](4) NOT NULL,
[RowReferenceID] [int] NOT NULL,
[RowReferenceValue] [char](3) NOT NULL,
[ExecutedTime] [datetime] NOT NULL
)
CREATE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
I have following two queries for same purpose (Query2 and Query 3). For 100 records in #OrdersForRC, the Query2 is working better (39% vs 47%) whereas with 10000 records in in #OrdersForRC the Query 3 is working better (53% vs 33%) as per the execution plan).
In the initial few months of use, the #OrdersForRC table will have close to 100 records. It will gradually increase to 2500 records over a couple of months.
In the following two approaches which one is good for such a incrementally growing scenario? Or is there any strategy to make one approach work better than the other even if data grows?
Note: In Plan2, the first Query uses Hash Match
References
query optimizer operator choice - nested loops vs hash match (or merge)
Execution Plan Basics — Hash Match Confusion
Test Query
CREATE TABLE #OrdersForRC
(
OrderID INT
)
INSERT INTO #OrdersForRC
--SELECT DISTINCT TOP 100 OrderID FROM [TestExecutions]
SELECT DISTINCT TOP 5000 OrderID FROM LWManifestReceiptExecutions
--QUERY 2:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
INNER JOIN #OrdersForRC R
ON R.OrderID = H.OrderID
--QUERY 3:
SELECT H.OrderID,H.LineItemID,H.Manifest,H.RowCompanyCD,H.RowReferenceID
FROM dbo.[TestExecutions] (NOLOCK) H
WHERE OrderID IN (SELECT OrderID FROM #OrdersForRC)
DROP TABLE #OrdersForRC
Plan 1
Plan 2
AS commented above you have not specified table definition of table LWManifestReceiptExecutions and how many rows in it and
You are selecting Top N rows without order by, Do you want TOP N random id or in a specific order or order does`t matter for You?
if order does matter then you can create a index on column which you required in Order By
if order id is unique in [dbo].[TestExecutions] table then you should mark it as unique drop and recreate the index if UNIQUE
Drop Index [IX_TestExecutions_OrderID] ON [dbo].[TestExecutions]
CREATE UNIQUE INDEX [IX_TestExecutions_OrderID]
ON [dbo].[TestExecutions] ([OrderID])
INCLUDE ([LineItemID], [Manifest], [RowCompanyCD], [RowReferenceID])
You asked that data is keep growing and it will reach to millions in couple of months.
No need to worry sql server can easily handle these query with proper build schema and indexes,
When this data model starting hurting then you could look at the
other options but not now, i have seen people handling billions of data in sql server.
I can see you are comparing the queries on the bases of query cost you are coming the conclusion that
Query with higher percentages mean this is more expensive,
That is not the case always query cost is based on aggregate Subtree cost of all Iterator in the query plan,
and the total estimated cost of an Iterator is a simple sum of the I/O and CPU components.
The cost values represent expected execution times (in seconds) on a particular hardware configuration
But with the morden hardware these cost might be irrelevant.
Now coming to your query,
You have expressed two queries to get the result but both are not identical,
IN PLAN 1 Query 1
Expressed by JOIN
QO is choosing Nested loop join that is good choice for particular this scenario
Every row for the key OrderID IN table #OrdersForRC seeking the value in the table dbo.[TestExecutions]
until all rows matched
IN PLAN 2 Query 2
Expressed by IN
QO is doing the same thing as query one but there is extra distinct Sort ( Sort and Stream aggregated)
the reasoning behind it is you have expressed this query as IN and table #OrdersForRC can contain duplicate Rows
Just to eliminate that is necessary.
IN PLAN 2 Query 1
Expressed by JOIN
Now the Rows in the table in #OrdersForRC in 1000, QO is choosing hash join over loop join
Because loop join for 1000 rows has more cost than hash join and loop join and rows are unordered
and can consist nulls as well thus HASH JOIN is perfect stratergy here.
IN PLAN 2 Query 2
Expressed by IN
QO has chosen Distinct Sort for the same reason as chosen in Plan 2 query 2 and then Merge Join
Because rows are now sorted ON ID column for both tables.
IF you just mark temp table as NOT NULL and Unique then its more likly you will get the same execution plan for both IN the JOIN.
CREATE TABLE #OrdersForRC
(OrderID INT not null Unique)
Execution plan

Tuning Select statement to obtain faster results

I have benefited from this website for a long time now. This is my first question on the site. It is regarding performance tuning a reporting query. Here it goes.
1.
SELECT Count(b1.primkey)
from tableA b1 --WITH (NOLOCK)
join tableA b2 --WITH (NOLOCK)
on b1.email = b2.email
and DateDiff(day, b2.BookedDate , b1.BookedDate) > 1
tableA has around 7 million rows. Email is a varchar(100) field. Bookeddate is a datetime field. primkey is a primary key column that is an int.
My purpose of writing this query is to find out the count entries that have same email ids but have come in one day late. This query take about 45 minutes to run. I really want to reduce the time it takes to execute.
Since this is for reporting, i tried in vain to use --WITH (NOLOCK) option to improve the read time. I have a column store index on tableA and I know that it is being used by the SQL optimizer - can see in the execution plan. I am using SQL Server 2012.
Can someone tell me in such a case, what would be better? Using a nonclustered index on email or a nonclustered columnstore index on tableA?
Please help me.
Your query is relatively complex. You are essentially joining two tables that have 7 million records each on a column that is not unique.
How about the following query instead:
select Email
from TableA
group by Email
having MAX(BookedDate) > MIN(BookedDate) + 1
Also make sure you have an index with Email and BookedDate.
Hope this helps.
You have 3 options here:
Create clustered index on email field at least for a larger table.
But I suppose there are other queries running on these tables, and
clustered index is needed on other fields
Move emails to another table, and store email id's in TableA and
TableB; join on int field would be much faster than on varchar
fields
Create indexes on email fields with included columns BookedDate (no
need to include primkey, you can count on another field, or count(*). Code: create index idx_email on TableA include(BoodedDate)
I think that third option is the one you should go with. There's not much work to be done, and there will be great performance gain. The only problem is that index on varchar field will take a lot of space and impact insert/update operations; but you said that this is a reporting db, so I think you can allow that.

Suitable indexes for sorting in ranking functions

I have a table which keeps parent-child-relations between items. Those can be changed over time, and it is necessary to keep a complete history so that I can query how the relations were at any time.
The table is something like this (I removed some columns and the primary key etc. to reduce noise):
CREATE TABLE [tblRelation](
[dtCreated] [datetime] NOT NULL,
[uidNode] [uniqueidentifier] NOT NULL,
[uidParentNode] [uniqueidentifier] NOT NULL
)
My query to get the relations at a specific time is like this (assume #dt is a datetime with the desired date):
SELECT *
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY r.uidNode ORDER BY r.dtCreated DESC) ix, r.*
FROM [tblRelation] r
WHERE (r.dtCreated < #dt)
) r
WHERE r.ix = 1
This query works well. However, the performance is not yet as good as I would like. When looking at the execution plan, it basically boils down to a clustered index scan (36% of cost) and a sort (63% of cost).
What indexes should I use to make this query faster? Or is there a better way altogether to perform this query on this table?
The ideal index for this query would be with key columns uidNode, dtCreated and included columns all remaining columns in the table to make the index covering as you are returning r.*. If the query will generally only be returning a relatively small number of rows (as seems likely due to the WHERE r.ix = 1 filter) it might not be worthwhile making the index covering though as the cost of the key lookups might not outweigh the negative effects of the large index on CUD statements.
The window/rank functions on SQL Server 2005 are not that optimal sometimes (based on answers here). Apparently better in SQL Server 2008
Another alternative is something like this. I'd have a non-clustered index on (uidNode, dtCreated) INCLUDE any other columns required by SELECT. Subject to what Martin Smith said about lookups.
WITH MaxPerUid AS
(
SELECT
MAX(r.dtCreated) AS MAXdtCreated, r.uidNode
FROM
MaxPerUid
WHERE
r.dtCreated < #dt
GROUP BY
r.uidNode
)
SELECT
...
FROM
MaxPerUid M
JOIN
MaxPerUid R ON M.uidNode = R.uidNode AND M.MAXdtCreated = R.dtCreated

Resources