ROW_NUMBER OVER (ORDER BY date_column) - sql-server

I am wondering. I have a complex query which runs in a SQL Server 2005 Express edition in around 3 seconds.
The main table has around 300k rows.
When I add
ROW_NUMBER() OVER (ORDER BY date_column)
it takes 123 seconds while date_column is a datetime column.
If I do
ROW_NUMBER() OVER (ORDER BY string_title)
it runs in 3 seconds again.
I added an index on the datetime column. No change. Still 123 seconds.
Then I tried:
ROW_NUMBER() OVER (ORDER BY CAST(date_column AS int))
and the query runs in 3 seconds again.
Since casting needs time, why does SQL Server behave like this???
UPDATE:
It seems like ROW_NUMBER ignore my WHERE statements at all and build a row column list for all available entries? Can anyone confirm that ?
Here I copied a better read able (still tonz of logic :)) in the SQL Management Studio:
SELECT ROW_NUMBER() OVER (ORDER BY xinfobase.lid) AS row_num, *
FROM xinfobase
LEFT OUTER JOIN [xinfobasetree] ON [xinfobasetree].[lid] = [xinfobase].[xlngfolder]
LEFT OUTER JOIN [xapptqadr] ON [xapptqadr].[lid] = [xinfobase].[xlngcontact]
LEFT OUTER JOIN [xinfobasepvaluesdyn] ON [xinfobasepvaluesdyn].[lparentid] = [xinfobase].[lid]
WHERE (xinfobase.xlngisdeleted=2
AND xinfobase.xlinvalid=2)
AND (xinfobase.xlngcurrent=1)
AND ( (xinfobase.lownerid = 1
OR (SELECT COUNT(lid)
FROM xinfobaseacl
WHERE xinfobaseacl.lparentid = xinfobase.lid
AND xlactor IN(1,-3,-4,-230,-243,-254,-255,-256,-257,-268,-589,-5,-6,-7,-8,-675,-676,-677,-9,-10,-864,-661,-671,-913))>0
OR xinfobasetree.xlresponsible = 1)
AND (xinfobase.lid IN (SELECT lparentid
FROM xinfobasealt a, xinfobasetree t
WHERE a.xlfolder IN(1369)
AND a.xlfolder = t.lid
AND dbo.sf_MatchRights(1, t.xtxtrights,'|')=1 )) )
AND ((SELECT COUNT(*) FROM dbo.fn_Split(cf_17,',')
WHERE [value] = 39)>0)
This query need 2-3 seconds on 300k records.
Now I changed the ORDER BY to xinfobase.xstrtitle then it runs in around 2-3 seconds again.
If I switch to xinfobase.dtedit (datetime column with an additional index I just added) it needs hte time I mentioned above already.
I also tried to "cheat" and made my statement as a SUB SELECT to force him to retriev the records first and do a ROW_NUMBER() outside in another SQL statement, same performance result.

UPDATE
After I was still frustrated about doing a workaround I was investigating more.
I removed all my existing indexes and run several SQL statements against the tables.
It turns out, that building new indexes with a new sortorder of columns and include different columns I fixed my issue and the query is fast with dtedit (datetime) column as well.
So lessons learned:
Take more care of your indexes and execution plans and recheck them with every update (new version) of the software you produce...
But still wonderung why CAST(datetime_column AS int) makes it fast before...

Related

Add computed column using subquery

In SQL Server 2000, I want to add a computed column which basically is MAX(column1).
Of course I get an error because subqueries are not allowed.
What I basically try to do is to get the max(dateandtime) of a number of tables of my database.
However when I run my code it takes too long because it's a very old and badly designed database with no keys and indexes.
So, I believe that by adding a new computed column which is the max(datetime), I will do my query much much faster because I will query
(SELECT TOP 1 newcomputedcolumn FROM Mytable)
and I will not have to do
(SELECT TOP 1 dateandtime FROM Mytable
ORDER BY dateandtime DESC)
or
(SELECT MAX(dateandtime) FROM Mytable)
which takes too long.
Any ideas? Thanks a lot.

Slow performing T-SQL query with two joins to the same table

I am struggling with figuring out what is happening with the T-SQL query shown below.
You will see two inner joins to the same table, although with different join criteria. The first join by itself runs in approximately 21 seconds and if I run the second join by itself it completes in approximately 27 seconds.
If I leave both joins in place, the query runs and runs and runs, until I finally stop the query. The appropriate indices appear to be in place and I know this query runs in a different environment with less horsepower, the only difference being the other server is running SQL Server 2012 and I am running SQL Server 2016, although the database is in 2012 compatibility mode:
This join runs in ~21 seconds.
SELECT
COUNT(*)
FROM
dbo.SPONSORSHIP as s
INNER JOIN
dbo.SPONSORSHIPTRANSACTION AS st
ON st.SPONSORSHIPCOMMITMENTID = s.SPONSORSHIPCOMMITMENTID
AND st.TRANSACTIONSEQUENCE = (SELECT MIN(TRANSACTIONSEQUENCE)
FROM dbo.SPONSORSHIPTRANSACTION AS ms
WHERE ms.SPONSORSHIPCOMMITMENTID = s.SPONSORSHIPCOMMITMENTID
AND ms.TARGETSPONSORSHIPID = s.ID)
This join runs in ~27 seconds.
SELECT
COUNT(*)
FROM
dbo.SPONSORSHIP AS s
INNER JOIN
dbo.SPONSORSHIPTRANSACTION AS lt ON lt.SPONSORSHIPCOMMITMENTID = s.SPONSORSHIPCOMMITMENTID
AND lt.TRANSACTIONSEQUENCE = (SELECT MAX(TRANSACTIONSEQUENCE)
FROM dbo.SPONSORSHIPTRANSACTION AS ms
WHERE ms.SPONSORSHIPCOMMITMENTID = s.SPONSORSHIPCOMMITMENTID
AND s.ID IN (ms.CONTEXTSPONSORSHIPID,
ms.TARGETSPONSORSHIPID,
ms.DECLINEDSPONSORSHIPID)
AND ms.ACTIONCODE <> 9)
These are both considered correlated subqueries. You should typically avoid this pattern, as it causes what is known as "RBAR"... which is "Row by Agonizing Row". Before you focus on troubleshooting this particular query, I'd suggest revisiting the query itself and see if you can solve this in a more set based approach. You'll find that in most cases you have other ways to accomplish this and cut cost down dramatically.
As one example:
select
total_count
,row_sequence
from
(
SELECT
total_count = COUNT(*)
,row_sequence = row_number() over(order by st.TRANSACTIONSEQUENCE asc)
FROM
dbo.SPONSORSHIP as s
INNER JOIN dbo.SPONSORSHIPTRANSACTION AS st
ON st.SPONSORSHIPCOMMITMENTID = s.SPONSORSHIPCOMMITMENTID
) as x
where
x.row_sequence = 1
This was a quick example that is not tested. For future reference, if you want the best answer, it's a great idea to generate a temp table or test data set that's able to be used so someone can provide a full working example.
The example I gave shows what is called a windowing function. Take a look more into them for helping with selecting results when you see the word sequence, need the the first/last in a group and more.
Hope this gives you some ideas! Welcome to Stack Overflow! 👋

SQL Server : pagination into Excel

I've got a large data set that I need to get into excel to get some pivot tables and analysis going.
I normally am able to do this as the data never reaches the 1 million line mark. I just do a SQL Server data import and specify my SQL statement.
Here is my current SQL
WITH n AS (
Select A1.AccountID, A1.ParentAccountID, A1.Name
FROM Account AS A1
WHERE A1.ParentAccountID = 92
UNION ALL
SELECT A2.AccountID, A2.ParentAccountID, A2.Name
FROM Account AS A2
JOIN n
ON A2.ParentAccountID=n.AccountID
)
select n.*, D.DeviceID, A.*, P.*
FROM n
LEFT OUTER JOIN
Device AS D
ON D.AccountID = n.AccountID
LEFT OUTER JOIN
Audit as A
ON A.AccountID = n.AccountID
RIGHT OUTER JOIN
DeviceAudit As P
ON P.AuditID = A.AuditID
WHERE A.AuditDate > CAST('2013-03-11' AS DATETIME)
ORDER BY n.AccountID ASC, P.DeviceID ASC, A.AuditDate DESC
This right now is returning to me 100% of what I need. 18 million records for the past 30 days. I was hoping there would be a simple way to find the next 100,000 or 500,000 records.
I can use TOP 100000 to get my first chunk, though I do not seem to have an offset available to me.
At present this runs and completes in 20 minutes. This is 1 of many account hierarchies that I have to perform this for. Hopefully this pagination will not be too expensive cpu wise.
I did try exporting to a CSV in hopes of importing it, though that just gives me a 12GB csv file that I do not have time to and break apart.
Yes, you can do paginated subqueries on the row number since SQL 2005. Add a row number to the select clause of your original query:
, ROW_NUMBER() OVER (ORDER BY {whatever id}) AS row
Then you can make your old query a subquery and query against the row:
SELECT TOP {results per page} *
FROM ({your previous sql statement})
WHERE row > {page# * results per page}

SQL Server Pagination w/o row_number() or nested subqueries?

I have been fighting with this all weekend and am out of ideas. In order to have pages in my search results on my website, I need to return a subset of rows from a SQL Server 2005 Express database (i.e. start at row 20 and give me the next 20 records). In MySQL you would use the "LIMIT" keyword to choose which row to start at and how many rows to return.
In SQL Server I found ROW_NUMBER()/OVER, but when I try to use it it says "Over not supported". I am thinking this is because I am using SQL Server 2005 Express (free version). Can anyone verify if this is true or if there is some other reason an OVER clause would not be supported?
Then I found the old school version similar to:
SELECT TOP X * FROM TABLE WHERE ID NOT IN (SELECT TOP Y ID FROM TABLE ORDER BY ID) ORDER BY ID where X=number per page and Y=which record to start on.
However, my queries are a lot more complex with many outer joins and sometimes ordering by something other than what is in the main table. For example, if someone chooses to order by how many videos a user has posted, the query might need to look like this:
SELECT TOP 50 iUserID, iVideoCount FROM MyTable LEFT OUTER JOIN (SELECT count(iVideoID) AS iVideoCount, iUserID FROM VideoTable GROUP BY iUserID) as TempVidTable ON MyTable.iUserID = TempVidTable.iUserID WHERE iUserID NOT IN (SELECT TOP 100 iUserID, iVideoCount FROM MyTable LEFT OUTER JOIN (SELECT count(iVideoID) AS iVideoCount, iUserID FROM VideoTable GROUP BY iUserID) as TempVidTable ON MyTable.iUserID = TempVidTable.iUserID ORDER BY iVideoCount) ORDER BY iVideoCount
The issue is in the subquery SELECT line: TOP 100 iUserID, iVideoCount
To use the "NOT IN" clause it seems I can only have 1 column in the subquery ("SELECT TOP 100 iUserID FROM ..."). But when I don't include iVideoCount in that subquery SELECT statement then the ORDER BY iVideoCount in the subquery doesn't order correctly so my subquery is ordered differently than my parent query, making this whole thing useless. There are about 5 more tables linked in with outer joins that can play a part in the ordering.
I am at a loss! The two above methods are the only two ways I can find to get SQL Server to return a subset of rows. I am about ready to return the whole result and loop through each record in PHP but only display the ones I want. That is such an inefficient way to things it is really my last resort.
Any ideas on how I can make SQL Server mimic MySQL's LIMIT clause in the above scenario?
Unfortunately, although SQL Server 2005 Row_Number() can be used for paging and with SQL Server 2012 data paging support is enhanced with Order By Offset and Fetch Next, in case you can not use any of these solutions you require to first
create a temp table with identity column.
then insert data into temp table with ORDER BY clause
Use the temp table Identity column value just like the ROW_NUMBER() value
I hope it helps,

Optimizing ROW_NUMBER() in SQL Server

We have a number of machines which record data into a database at sporadic intervals. For each record, I'd like to obtain the time period between this recording and the previous recording.
I can do this using ROW_NUMBER as follows:
WITH TempTable AS (
SELECT *, ROW_NUMBER() OVER (PARTITION BY Machine_ID ORDER BY Date_Time) AS Ordering
FROM dbo.DataTable
)
SELECT [Current].*, Previous.Date_Time AS PreviousDateTime
FROM TempTable AS [Current]
INNER JOIN TempTable AS Previous
ON [Current].Machine_ID = Previous.Machine_ID
AND Previous.Ordering = [Current].Ordering + 1
The problem is, it goes really slow (several minutes on a table with about 10k entries) - I tried creating separate indicies on Machine_ID and Date_Time, and a single joined-index, but nothing helps.
Is there anyway to rewrite this query to go faster?
The given ROW_NUMBER() partition and order require an index on (Machine_ID, Date_Time) to satisfy in one pass:
CREATE INDEX idxMachineIDDateTime ON DataTable (Machine_ID, Date_Time);
Separate indexes on Machine_ID and Date_Time will help little, if any.
How does it compare to this version?:
SELECT x.*
,(SELECT MAX(Date_Time)
FROM dbo.DataTable
WHERE Machine_ID = x.Machine_ID
AND Date_Time < x.Date_Time
) AS PreviousDateTime
FROM dbo.DataTable AS x
Or this version?:
SELECT x.*
,triang_join.PreviousDateTime
FROM dbo.DataTable AS x
INNER JOIN (
SELECT l.Machine_ID, l.Date_Time, MAX(r.Date_Time) AS PreviousDateTime
FROM dbo.DataTable AS l
LEFT JOIN dbo.DataTable AS r
ON l.Machine_ID = r.Machine_ID
AND l.Date_Time > r.Date_Time
GROUP BY l.Machine_ID, l.Date_Time
) AS triang_join
ON triang_join.Machine_ID = x.Machine_ID
AND triang_join.Date_Time = x.Date_Time
Both would perform best with an index on Machine_ID, Date_Time and for correct results, I'm assuming that this is unique.
You haven't mentioned what is hidden away in * and that can sometimes means a lot since a Machine_ID, Date_Time index will not generally be covering and if you have a lot of columns there or they have a lot of data, ...
If the number of rows in dbo.DataTable is large then it is likely that you are experiencing the issue due to the CTE self joining onto itself. There is a blog post explaining the issue in some detail here
Occasionally in such cases I have resorted to creating a temporary table to insert the result of the CTE query into and then doing the joins against that temporary table (although this has usually been for cases where a large number of joins against the temp table are required - in the case of a single join the performance difference will be less noticable)
I have had some strange performance problems using CTEs in SQL Server 2005. In many cases, replacing the CTE with a real temp table solved the problem.
I would try this before going any further with using a CTE.
I never found any explanation for the performance problems I've seen, and really didn't have any time to dig into the root causes. However I always suspected that the engine couldn't optimize the CTE in the same way that it can optimize a temp table (which can be indexed if more optimization is needed).
Update
After your comment that this is a view, I would first test the query with a temp table to see if that performs better.
If it does, and using a stored proc is not an option, you might consider making the current CTE into an indexed/materialized view. You will want to read up on the subject before going down this road, as whether this is a good idea depends on a lot of factors, not the least of which is how often the data is updated.
What if you use a trigger to store the last timestamp an subtract each time to get the difference?
If you require this data often, rather than calculate it each time you pull the data, why not add a column and calculate/populate it whenever row is added?
(Remus' compound index will make the query fast; running it only once should make it faster still.)

Resources