I have an issue in SQL Server where I can't figure out how to solve it. I have a large product table (25m records) with a single full text search column.
Running the following query takes about 1s
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
SELECT TOP 15
[ProductID], [EAN], [BrandID], [ShopID],
[CategoryID], [DeliveryID], [ProductPrice],
[ShippingCosts]
-- ,count(ProductID) over()
FROM
product WITH (nolock)
WHERE
CONTAINS(Search, 'Samsung AND Galaxy')
To know the total of records I tried different solutions with subqueries etc., but adding count(ProductID) over() should be a good solution.
Adding the total count part to the query makes the query very slow. Now it takes about 1m30. Changing to containstable instead of contains or using freetext makes no difference.
I included the execution plan. There are some strange values (868% Table Spool?)
But repopulating the full text index and rebuilding statistics made no difference.
Does anyone have an idea how to speed up the count?
Execution plan
It's taking a long time with the count because it has to evaluate the entire table and give you the count. The plan looks like it's finding one of your top 15, seaking the table with the count, then repeating. Without the count, it's just looking at the top 15 from the table after matching the criteria.
This may be faster but still slower than your select without an aggregate function.
CHECKPOINT;
GO
DBCC DROPCLEANBUFFERS;
GO
;WITH cte AS (
select
[ProductID]
,[EAN]
,[BrandID]
,[ShopID]
,[CategoryID]
,[DeliveryID]
,[ProductPrice]
,[ShippingCosts]
from product with (nolock)
where
contains(Search,'Samsung AND Galaxy'))
select top 15
[ProductID]
,[EAN]
,[BrandID]
,[ShopID]
,[CategoryID]
,[DeliveryID]
,ProductPrice
,ShippingCosts
,count(ProductID) over()
FROM cte
Related
In SQL Server 2000, I want to add a computed column which basically is MAX(column1).
Of course I get an error because subqueries are not allowed.
What I basically try to do is to get the max(dateandtime) of a number of tables of my database.
However when I run my code it takes too long because it's a very old and badly designed database with no keys and indexes.
So, I believe that by adding a new computed column which is the max(datetime), I will do my query much much faster because I will query
(SELECT TOP 1 newcomputedcolumn FROM Mytable)
and I will not have to do
(SELECT TOP 1 dateandtime FROM Mytable
ORDER BY dateandtime DESC)
or
(SELECT MAX(dateandtime) FROM Mytable)
which takes too long.
Any ideas? Thanks a lot.
This part of the query makes it quite bad, unfortunately can't see a way around, only optimize.
update #db set contents = i.contents
from (select distinct
(select max(ac.contents) from ##dwv d
left join ##calendar c on 1=1
left join #db ac on d.id = ac.id
and c.ReportingPeriod = ac.DateValue and ac.Data_Type = 'ActivePeriod'
where d.ID = dd.id and month_number >= (cc.month_number-3)
and month_number <= cc.month_number) contents
,dd.id
,cc.ReportingPeriod
from #db dd
left join ##calendar cc on cc.ReportingPeriod = dd.DateValue
where dd.Data_Type = 'ActivePeriod'
)i
where i.id = #db.id and i.ReportingPeriod = #dashboard.DateValue
I was trying to merge it first, but wasn't going somewhere fast, and the above puppy came to be.
The Idea is to mark every customer as active in any given period (year, month in format 'YYYYMM') according to a specific algorithm, so for every customer that matches the report criteria I need to have a row which will tell me if he was active (that is: bought something recently).
#db is a temp table where I'm gathering all the data that will be later used for aggregates to produce report - large table of several million rows, depending on timeframe:
Create table #db
(
C_P varchar(6)
,Data_Type varchar(20)
,id int
,contents int
,DateValue varchar(10)
)
##dwv is a temp table where I'm dumping the result of a select on a large view (which itself is very slow), holds about 2.4 million rows
##calendar is an ad-hoc table which stores every period the report encompasses in same format 'YYYYMM':
select CONVERT(char(6), cast(#startdate as date), 112) "CP"
,CONVERT(char(6), cast(PKDate as date), 112) "RP"
,(ROW_NUMBER() over (order by (CONVERT(char(6), cast(PKDate as date), 112)) asc))-1
as month_number
into ##calendar
from [calendar].[dbo].[days]
where PKDate between #startdate and #enddate2
group by CONVERT(char(6), cast(PKDate as date), 112)
Query plan tells me that the bit c.ReportingPeriod = ac.DateValue is the cuplrit - takes 88% of the subquery cost with it, which in turns accounts for 87% of the cost of whole query.
What am I not seeing here and how can I improve that?
Hash Joins usually mean that the columns used in the JOIN are not indexed.
Make sure you have covering indexes for these columns:
d.id = ac.id
and c.ReportingPeriod = ac.DateValue and ac.Data_Type
Just in case if someone stumbles here I'll explain what I did to trim down execution time from 32 minutes to 15 seconds.
One, as suggested in comments and answer by Tab Alleman I've looked at the indexes for the tables where HASH JOIN showed on the execution plan. I've also had a closer look at ON clauses for the joins here and there refining them, which ended with smaller numbers of rows in results. To be more specific - the inline query fetching the 'contents' value for update now is against source table '#dwv' that joins to a preprocessed '#calendar' table, as opposed to a cross join between two tables and then another join to the result. This reduced the end dataset to bare hundreds of thousands of rows instead of 17 billion, as reported in the query plan.
Effect is that now the report is lightning quick compared to previous drafts, so much so that it now can be run in a loop and it still outputs in more than reasonable time.
Bottom line is that one has to pay attention to what SQL Server complains about, but also at least have a look at the number of rows crunched and try to lower them whenever possible. Indexing is good, but it's not "the miracle cure" for all that ails your query.
Thanks to all who took time to write here - when several people say similar things it's always good to sit down and think about it.
I have a query in SQL Server 2008 R2 in the following form:
SELECT TOP (2147483647) *
FROM (
SELECT *
FROM sub_query_a
) hierarchy
LEFT JOIN (
SELECT *
FROM sub_query_b
) expenditure
ON hierarchy.x = expenditure.x AND hierarchy.y = expenditure.y
ORDER BY hierarchy.c, hierarchy.d, hierarchy.e
The hierarchy subquery contains UNIONS and INNER JOINS. The expenditure subquery is based on several levels of sub-subqueries, and contains UNIONS, INNER and LEFT JOINS, and ultimately, a PIVOT aggregate.
The hierarchy subquery by itself runs in 2 seconds and returns 467 rows. The expenditure subquery by itself runs in 7 seconds and returns 458 rows. Together, without the ORDER BY clause, the query runs in 11 seconds. However, with the ORDER BY clause, the query runs in 11 minutes.
The Actual Execution Plan reveals what's different. Without the ORDER BY clause, both the hierarchy and expenditure subqueries are running once each, with the results being Merge Join (Right Outer Join) joined together. When the ORDER BY clause is included, the hierarchy query is still run once, but the expenditure portion is run once per row from the hierarchy query, and the results are Nested Loops (Left Outer Join) joined together. Its as if the ORDER BY clause is causing the expenditure subquery to become a correlated subquery (which it is not).
To verify that SQL Server was actually capable of doing the query and producing a sorted result set in 11 seconds, as a test, I created a temp table and inserted the results of the query without the ORDER BY clause into it. Then I did a SELECT * FROM #temp_table ORDER BY c, d, e. The entire script took the expected 11 seconds, and returned the desired results.
I want to make the query work efficiently with the ORDER BY clause as one query--I don't want to have to create a stored procedure just to enable the #temp_table hacky solution.
Any ideas on the cause of this issue, or a fix?
To avoid nested loop joins, you can give an option to the compiler:
SELECT TOP (2147483647) *
FROM (
SELECT *
FROM sub_query_a
) hierarchy
LEFT JOIN (
SELECT *
FROM sub_query_b
) expenditure
ON hierarchy.x = expenditure.x AND hierarchy.y = expenditure.y
ORDER BY hierarchy.c, hierarchy.d, hierarchy.e
option (merge join, hash join)
I generally much prefer to have the optimizer figure out the right query plan. On rare occasions, however, I run into a problem similar to yours and need to make a suggestion to push it in the right direction
Thanks to #MartinSmith's comment, I got looking at what could cause the major discrepancies between the estimated and actual rows delivered by the expenditure subquery in the non-ORDER BY version, even though I eventually wanted to ORDER it. I thought that perhaps if I can optimize that version a bit, perhaps that would also benefit the ORDER BY version as well.
As I mentioned in the OP, the expenditure subquery contains a PIVOT aggregation across yet another subquery (let's call it unaggregated_expenditure). I added a layer between the PIVOT and the unaggregated_expenditure subquery which aggregated the required column before PIVOTing the same column across the required few pivot columns. This added a bit of conceptual complexity, yet was able to reduce the estimated number of rows coming from the PIVOT from 106,245,000 to 10,307. This change, when applied to the ORDER BY version of the whole query, resulted in a different Actual Execution Plan that was able to process and deliver the query within the desired 11 seconds.
I have been fighting with this all weekend and am out of ideas. In order to have pages in my search results on my website, I need to return a subset of rows from a SQL Server 2005 Express database (i.e. start at row 20 and give me the next 20 records). In MySQL you would use the "LIMIT" keyword to choose which row to start at and how many rows to return.
In SQL Server I found ROW_NUMBER()/OVER, but when I try to use it it says "Over not supported". I am thinking this is because I am using SQL Server 2005 Express (free version). Can anyone verify if this is true or if there is some other reason an OVER clause would not be supported?
Then I found the old school version similar to:
SELECT TOP X * FROM TABLE WHERE ID NOT IN (SELECT TOP Y ID FROM TABLE ORDER BY ID) ORDER BY ID where X=number per page and Y=which record to start on.
However, my queries are a lot more complex with many outer joins and sometimes ordering by something other than what is in the main table. For example, if someone chooses to order by how many videos a user has posted, the query might need to look like this:
SELECT TOP 50 iUserID, iVideoCount FROM MyTable LEFT OUTER JOIN (SELECT count(iVideoID) AS iVideoCount, iUserID FROM VideoTable GROUP BY iUserID) as TempVidTable ON MyTable.iUserID = TempVidTable.iUserID WHERE iUserID NOT IN (SELECT TOP 100 iUserID, iVideoCount FROM MyTable LEFT OUTER JOIN (SELECT count(iVideoID) AS iVideoCount, iUserID FROM VideoTable GROUP BY iUserID) as TempVidTable ON MyTable.iUserID = TempVidTable.iUserID ORDER BY iVideoCount) ORDER BY iVideoCount
The issue is in the subquery SELECT line: TOP 100 iUserID, iVideoCount
To use the "NOT IN" clause it seems I can only have 1 column in the subquery ("SELECT TOP 100 iUserID FROM ..."). But when I don't include iVideoCount in that subquery SELECT statement then the ORDER BY iVideoCount in the subquery doesn't order correctly so my subquery is ordered differently than my parent query, making this whole thing useless. There are about 5 more tables linked in with outer joins that can play a part in the ordering.
I am at a loss! The two above methods are the only two ways I can find to get SQL Server to return a subset of rows. I am about ready to return the whole result and loop through each record in PHP but only display the ones I want. That is such an inefficient way to things it is really my last resort.
Any ideas on how I can make SQL Server mimic MySQL's LIMIT clause in the above scenario?
Unfortunately, although SQL Server 2005 Row_Number() can be used for paging and with SQL Server 2012 data paging support is enhanced with Order By Offset and Fetch Next, in case you can not use any of these solutions you require to first
create a temp table with identity column.
then insert data into temp table with ORDER BY clause
Use the temp table Identity column value just like the ROW_NUMBER() value
I hope it helps,
We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)