In T-SQL, Why Does a Aggregate On a Subquery Run Faster - sql-server

The following 2 queries have the exact same execution plan. They go against the same heap table that has no indexes. I am only returning the top 5000 rows because this table is pretty huge.
Also, this table is read from 1000s of time a day and is refreshed nightly so I am pretty sure that the entire table is in memory.
Query 1 (Standard Aggregate, ran in 1:16)
SELECT TOP 5000 DB,
SourceID,
Date,
Units = SUM(Units),
WeightedUnits = SUM(WeightedUnits)
FROM dbo.Stats_Master_Charge_Details
GROUP BY DB,
SourceID,
Date
Query 2 (Subquery, ran in 1:11)
SELECT TOP 5000 x.DB,
x.SourceID,
x.Date,
Units = SUM(x.Units),
WeightedUnits = SUM(x.WeightedUnits)
FROM (SELECT DB,
SourceID,
Date,
Units,
WeightedUnits
FROM dbo.Stats_Master_Charge_Details) x
GROUP BY x.DB,
x.SourceID,
x.Date
Here is an image of the execution plan as well.
What am I missing here? Why would the subquery be faster? I would imagine the exact same results.

Related

SSRS: Why am I getting this aggregation?

I've recently uncovered that SSRS is doing a bizarre aggregation and I really don't understand why. In this report I'm building, as with other SQL queries I've built, I have a tendency to take preliminary results from an initial query, throw them into a temp table, and then perform another query and joining on that temp table to get my 'final' results I need to display. Here's an example:
--1. This query fetches all available rows based on the day (must be last day of month)
SELECT DISTINCT Salesperson ,c.Cust_Alias ,cost ,eomonth(CreateDate) createdate ,FaxNumber
INTO #equip
FROM PDICompany_2049_01.dbo.Customers c
JOIN PDICompany_2049_01.dbo.Customer_Locations cl ON c.Cust_Key = cl.CustLoc_Cust_Key
JOIN ricocustom..Equipment_OLD e ON e.FaxNumber = c.Cust_ID + '/' + cl.CustLoc_ID
JOIN PDICompany_2049_01.dbo.Charges ch ON ch.Chg_CustLoc_Key = cl.CustLoc_Key
WHERE Salesperson = #Salesperson
AND ch.Chg_Balance = 0
--2. This query fetches first result set, but filters further for matching date variable
SELECT DISTINCT (cost) EquipCost ,Salesperson ,DATEPART(YEAR, CreateDate) YEAR
,DATEPART(MONTH, CreateDate) MONTH ,Cust_Alias ,FaxNumber
INTO #equipcost
FROM #equip
WHERE Salesperson = #Salesperson
AND DATEPART(MONTH, CreateDate) = DATEPART(MONTH, #Start)
AND DATEPART(year, CreateDate) = DATEPART(year, #Start)
ORDER BY Cust_Alias
--3. Finally, getting sum of the EquipCost, with other KPI's, to put into my final result set
SELECT sum(EquipCost) EquipCost ,Salesperson ,YEAR ,MONTH ,Cust_Alias
INTO #temp_equipcost
FROM #equipcost
GROUP BY Salesperson ,year ,month ,Cust_Alias
Now I am aware that I could have easily reduced this to 2 queries instead of 3 in hindsight (and I have since gotten my results into a single query). But that's where I'm looking for the answer. In my GUI report, I had a row that was showing to have 180 for equipcost, but my query was showing 60. It wasn't until I altered my query to a single iteration (as opposed to the 3), and while I'm still getting the same result of 60, it now displays 60 in my GUI report.
I actually had this happen in another query as well, where I had 2 temp table result sets, but when I condensed it into one, my GUI report worked as expected.
Any ideas on why using multiple temp tables would affect my results via the GUI report in SQL Report Builder (NOT USING VB HERE!) but my SQL query within SSMS works as expected? And to be clear, only making the change described to the query and condensing it got my results, the GUI report in Report Builder is extremely basic, so nothing crazy regarding grouping, expressions, etc.
My best guess is that you accidentally had a situation where you did not properly clear the temp tables (or you populated the temp tables multiple times). As an alternative to temp tables, you could instead use table variables. Equally you could use a single query from the production tables -- using CTE if you want it to "feel" like 3 separate queries.

Same query, same DB, different execution plans & dramatically different times to execute

I'm confronted with a problem I cannot get my mind around. We're running SQL Server 2012. I have run into a pair of essentially identical queries which yield different execution plans and dramatically different times to execute (1 sec vs 40+ sec)... and they even return the exact same records. The only difference between them is the category the records are queried by.
This query runs in 1 second:
SELECT P.idProduct, P.sku, P.description, P.price, P.listhidden, P.listprice, P.serviceSpec, P.bToBPrice, P.smallImageUrl,P.noprices,P.stock, P.noStock,P.pcprod_HideBTOPrice,P.pcProd_BackOrder,P.FormQuantity,P.pcProd_BTODefaultPrice,cast(P.sDesc as varchar(8000)) sDesc, 0, 0, P.pcprod_OrdInHome, P.sales, P.pcprod_EnteredOn, P.hotdeal, P.pcProd_SkipDetailsPage
FROM products P INNER JOIN categories_products CP ON P.idProduct = CP.idProduct
WHERE CP.idCategory=494 AND active=-1 AND configOnly=0 and removed=0 AND formQuantity=0
AND ((SELECT TOP 1 SP.stock FROM products SP WHERE SP.pcprod_ParentPrd = P.idProduct AND SP.description LIKE N'%(9-12 Months)' AND SP.removed=0) > 0)
ORDER BY P.description Asc
The second runs 40 seconds or more, but the ONLY difference is the idCategory queried:
SELECT P.idProduct, P.sku, P.description, P.price, P.listhidden, P.listprice, P.serviceSpec, P.bToBPrice, P.smallImageUrl,P.noprices,P.stock, P.noStock,P.pcprod_HideBTOPrice,P.pcProd_BackOrder,P.FormQuantity,P.pcProd_BTODefaultPrice,cast(P.sDesc as varchar(8000)) sDesc, 0, 0, P.pcprod_OrdInHome, P.sales, P.pcprod_EnteredOn, P.hotdeal, P.pcProd_SkipDetailsPage
FROM products P INNER JOIN categories_products CP ON P.idProduct = CP.idProduct
WHERE CP.idCategory=628 AND active=-1 AND configOnly=0 and removed=0 AND formQuantity=0
AND ((SELECT TOP 1 SP.stock FROM products SP WHERE SP.pcprod_ParentPrd = P.idProduct AND SP.description LIKE N'%(9-12 Months)' AND SP.removed=0) > 0)
ORDER BY P.description Asc
They even return the exact same records in the exact same order.
Execution plan for the 1st query:
Execution plan for the 2nd query:
[EDIT] The plans here are the actual, not the estimated, execution plans.
The categories_products table is a simple lookup table with only the two fields idCategory and idProduct. Even the records returned are exactly the same (it just happens to be that for SP.description LIKE N'%(9-12 Months)', the same products are assigned to these 2 categories). The only other difference between the two is that CP.idCategory 628 was just created this morning (but i don't see what difference that could make).
[EDIT: but that's exactly what did make the difference]
How can this be? How can simply changing the CP.idCategory queried here yield a different execution plan, and even more importantly: how is it that one takes some 40 times as long to execute?
Ultimately, I'm at a loss to figure out how to improve the dreadful performance of the 2nd query given that there's no essential difference between the two that I can understand.
This problem is description column. The length of description of idCategory[628] longer than idCategory[494]. Because you are using SP.description LIKE N'%(9-12 Months)'. Length of description is too long then you get slowly.
Also you are using ORDER BY P.description Asc.

Inefficent Query Plans SQL Server 2008 R2

Good Day,
We experience ongoing issues with our databases for which our internal DBA's are unable to explain.
Using the below query example:
Select Distinct
Date,
AccountNumber,
Region,
Discount,
ActiveBalance
Into
#sometemptable
From
anothertable With (Index(ondate)) --use this or the query takes much longer
Where
Date >='7/1/2013'
And ActiveBalance > 0
And Discount <> '0' and discount is not null
This query will often run for an hour plus before I end up needing to kill it.
However, if I run the query as follows:
Select Distinct
Date,
AccountNumber,
Region,
Discount,
ActiveBalance
Into
#sometemptable
From
anothertable With (Index(ondate)) --use this or the query takes much longer
Where
Date Between '7/1/2013' and '12/1/2013' --all of the dates are the first of the month
And ActiveBalance > 0
And Discount <> '0' and discount is not null
Followed by
Insert into #sometemptable
Select Distinct
Date,
AccountNumber,
Region,
Discount,
ActiveBalance
From
anothertable With (Index(ondate)) --use this or the query takes much longer
Where
Date Between '1/1/2014' and '6/1/2014' --all of the dates are the first of the month
And ActiveBalance > 0
And Discount <> '0' and discount is not null
I can run the query in less than 10 minutes. The particular tables I'm hitting are updated monthly. Stat updates are run on these tables both Monthly and weekly. Our DBA's, as mentioned before do not understand why the top query takes so much longer than the combination of the smaller queries.
Any ideas? Any suggestions would be greatly appreciated!
Thanks,
Ron
This is just a guess, but when you do Date >= '7/1/2013' sql will analyze how many rows it will approximatly return, and if the rows are greater then some internal threshold it will do a scan instead of a seek, thinking that there is enough data that it needs to return that a a scan will be faster.
When you do the between clause, sql server will do a seek because it knows it will not need to return the majority of rows that, that table has.
I assume that it is doing a table scan when you do the >= search. Once you post the Execution plans we will see for sure.

Is there a quicker way of doing this type of query (finding inactive accounts)?

I have a very large table of wagering transactions. Let's say for the sake of the question I want to find the accounts of people who have wagered in the last year but not wagered in the last month, so I do something like this...
--query one
select accountnumber into #wageredrecently from activity
where _date >='2011-08-10' and transaction_type = 'Bet'
group by accountnumber
--query two
select accountnumber,firstname,lastname,email,sum(handle)
from activity a, customers c
where a.accountnumber = c.accountno
and transaction_type = 'Bet'
and _date >='2010-09-10'
and accountnumber not in (select * from #wageredrecently)
group by accountnumber,firstname,lastname,email
The problem is, this takes ages to get the data. Is there a quicker way to acheive the same in sql?
Edit, just to be specific about the time: It takes just over 3 minutes, which is far too long for a query that is destined for a php intranet page.
Edit (11/09/2011): I've found out that the problem is the customers table. It's actually a view. It previously had good performance but now all of a sudden its performance is terrible, a simple query on it takes almost as long as the above query pair. I have therefore chosen an alternative table of customer data (that actually is a table, and not a view) and now the query pair takes about 15 seconds.
You should try to join customers after you have found and aggregated the rows from activity (I assume that handle is a column in activity).
select c.accountno,
c.firstname,
c.lastname,
c.email,
a.sumhandle
from customers as c
inner join (
select accountnumber,
sum(handle) as sumhandle
from activity
where _date >= '2010-09-10' and
transaction_type = 'bet' and
accountnumber not in (
select accountnumber
from activity
where _date >= '2011-08-10' and
transaction_type = 'bet'
)
group by accountnumber
) as a
on c.accountno = a.accountnumber
I also included your first query as a sub-query instead. I'm not sure what that will do for performance. It could be better, it could be worse, you have to test on your data.
I don't know your exact business need, but rarely will someone need access to innactive accounts over several months at a moments notice. Depending on when you pruge data, this may get worse.
You could create an indexed view that contains the last transaction date for each account:
max(_date) as RecentTransaction
If this table gets too large, it could be partioned by year or month of the activity.
Have you considered adding an index on _date to the activity table? It's probably taking so long because it has to do a full table scan on that column when you're comparing the dates. Also, is transaction_type indexed as well? Otherwise, the other index wouldn't do you any good.
Answering my question as the problem wasn't the structure of the query but one of the tables being used. It was a view and its performance was terrible. I change to an actual table with customer data in and reduced the execution time down to about 15 seconds.

Performant way to get the maximum value of a running total in TSQL

We have a table of transactions which is structured like the following :
TranxID int (PK and Identity field)
ItemID int
TranxDate datetime
TranxAmt money
TranxAmt can be positive or negative, so the running total of this field (for any ItemID) will go up and down as time goes by. Getting the current total is obviously simple, but what I'm after is a performant way of getting the highest value of the running total and the TranxDate when this occurred. Note that TranxDate is not unique, and due to some backdating the ID field is not necessarily in the same sequence as TranxDate for a given Item.
Currently we're doing something like this (#tblTranx is a table variable containing just the transactions for a given Item) :
SELECT Top 1 #HighestTotal = z.TotalToDate, #DateHighest = z.TranxDate
FROM
(SELECT a.TranxDate, a.TranxID, Sum(b.TranxAmt) AS TotalToDate
FROM #tblTranx AS a
INNER JOIN #tblTranx AS b ON a.TranxDate >= b.TranxDate
GROUP BY a.TranxDate, a.TranxID) AS z
ORDER BY z.TotalToDate DESC
(The TranxID grouping removes the issue caused by duplicate date values)
This, for one Item, gives us the HighestTotal and the TranxDate when this occurred. Rather than run this on the fly for tens of thousands of entries, we only calculate this value when the app updates the relevant entry and record the value in another table for use in reporting.
The question is, can this be done in a better way so that we can work out these values on the fly (for multiple items at once) without falling into the RBAR trap (some ItemIDs have hundreds of entries). If so, could this then be adapted to get the highest values of subsets of transactions (based on a TransactionTypeID not included above). I'm currently doing this with SQL Server 2000, but SQL Server 2008 will be taking over soon here so any SQL Server tricks can be used.
SQL Server sucks in calculating running totals.
Here's a solution for your very query (which groups by dates):
WITH q AS
(
SELECT TranxDate, SUM(TranxAmt) AS TranxSum
FROM t_transaction
GROUP BY
TranxDate
),
m (TranxDate, TranxSum) AS
(
SELECT MIN(TranxDate), SUM(TranxAmt)
FROM (
SELECT TOP 1 WITH TIES *
FROM t_transaction
ORDER BY
TranxDate
) q
UNION ALL
SELECT DATEADD(day, 1, m.TranxDate),
m.TranxSum + q.TranxSum
FROM m
CROSS APPLY
(
SELECT TranxSum
FROM q
WHERE q.TranxDate = DATEADD(day, 1, m.TranxDate)
) q
WHERE m.TranxDate <= GETDATE()
)
SELECT TOP 1 *
FROM m
ORDER BY
TranxSum DESC
OPTION (MAXRECURSION 0)
You need to have an index on TranxDate for this to work fast.

Resources