Use an initial query to merge queries across multiple databases? - sql-server

Using the Data Explorer (SEDE), I would like to find which users have more than 200000 reputation on Stack Overflow, and then find details for any accounts they have on other Stack Exchange sites.
Here is the query which provides the list with this threshold:
Select id, reputation, accountid
From users
Where reputation > 200000
AccountId is the key for all Stack Exchange sites.
I have found this query for aggregating across SEDE databases, but how is it possible to do that based on the dynamic results of the previous/baseline query?
Here is the kind of output I'm aiming for:
id_so, reputation_so, accounted, other_stackexchange_site_name, reputation_othersite, number_of_answers_other_site, number_of_questions_other_site
1, 250000, 23, serverfault, 500, 5, 1
1, 250000, 23, superuser, 120, 1, 0
2, 300000, 21, serverfault, 300, 3, 2
2, 300000, 21, webmasters, 230, 1, 1
3, 350000, 20, NA, NA, NA, NA
#the case with id 3 has an SO profile with reputation but it has no other profile in other Stack Exchange site

To run non-trivial queries across databases, based on an initial query:
Figure out the common key in all databases. In this case it's AccountId (which is a user's Stack-Exchange-wide Id).
Create your initial query to feed that key into a temp table. In this case:
CREATE TABLE #UsersOfInterest (AccountId INT)
INSERT INTO #UsersOfInterest
SELECT u.AccountId
FROM Users u
Where u.Reputation > 200000
Create Another temp table to hold the final results (see below).
Determine the query, to run on each site, that gets the info you want. EG:
SELECT u.AccountId, u.DisplayName, u.Reputation, u.Id
, numQst = (SELECT COUNT(q.Id) FROM Posts q WHERE q.OwnerUserId = u.Id AND q.PostTypeId = 1)
, numAns = (SELECT COUNT(q.Id) FROM Posts q WHERE q.OwnerUserId = u.Id AND q.PostTypeId = 2)
FROM Users u
WHERE u.AccountId = ##seAccntId##
Use a system query to get the appropriate databases. For the Data Explorer (SEDE), a query of this type:
SELECT name
FROM sys.databases
WHERE CASE WHEN state_desc = 'ONLINE'
THEN OBJECT_ID (QUOTENAME (name) + '.[dbo].[PostNotices]', 'U')
END
IS NOT NULL
Create a cursor on the above query and use it to step through the databases.
For each database:
Build a query string that takes the query of step 4 and puts it into the temp table of step 3.
Run the query string using sp_executesql.
When the cursor is done, perform the final query on the temp table from step 3.
Refer to this other answer, for a working template for querying all of the Stack Exchange sites.
Putting it all together, results in the following query, which you can run live on SEDE:
-- MinMasterSiteRep: User's must have this much rep on whichever site this query is run against
-- MinRep: User's must have this much rep on all other sites
CREATE TABLE #UsersOfInterest (
AccountId INT NOT NULL
, Reputation INT
, UserId INT
, PRIMARY KEY (AccountId)
)
INSERT INTO #UsersOfInterest
SELECT u.AccountId, u.Reputation, u.Id
FROM Users u
Where u.Reputation > ##MinMasterSiteRep:INT?200000##
CREATE TABLE #AllSiteResults (
[Master Rep] INT
, [Mstr UsrId] NVARCHAR(777)
, AccountId NVARCHAR(777)
, [Site name] NVARCHAR(777)
, [Username on site] NVARCHAR(777)
, [Rep] INT
, [# Ans] INT
, [# Qst] INT
)
DECLARE #seDbName AS NVARCHAR(777)
DECLARE #seSiteURL AS NVARCHAR(777)
DECLARE #sitePrettyName AS NVARCHAR(777)
DECLARE #seSiteQuery AS NVARCHAR(max)
DECLARE seSites_crsr CURSOR FOR
WITH dbsAndDomainNames AS (
SELECT dbL.dbName
, STRING_AGG (dbL.domainPieces, '.') AS siteDomain
FROM (
SELECT TOP 50000 -- Never be that many sites and TOP is needed for order by, below
name AS dbName
, value AS domainPieces
, row_number () OVER (ORDER BY (SELECT 0)) AS [rowN]
FROM sys.databases
CROSS APPLY STRING_SPLIT (name, '.')
WHERE CASE WHEN state_desc = 'ONLINE'
THEN OBJECT_ID (QUOTENAME (name) + '.[dbo].[PostNotices]', 'U') -- Pick a table unique to SE data
END
IS NOT NULL
ORDER BY dbName, [rowN] DESC
) AS dbL
GROUP BY dbL.dbName
)
SELECT REPLACE (REPLACE (dadn.dbName, 'StackExchange.', ''), '.', ' ' ) AS [Site Name]
, dadn.dbName
, CASE -- See https://meta.stackexchange.com/q/215071
WHEN dadn.dbName = 'StackExchange.Mathoverflow.Meta'
THEN 'https://meta.mathoverflow.net/'
-- Some AVP/Audio/Video/Sound kerfuffle?
WHEN dadn.dbName = 'StackExchange.Audio'
THEN 'https://video.stackexchange.com/'
-- Ditto
WHEN dadn.dbName = 'StackExchange.Audio.Meta'
THEN 'https://video.meta.stackexchange.com/'
-- Normal site
ELSE 'https://' + LOWER (siteDomain) + '.com/'
END AS siteURL
FROM dbsAndDomainNames dadn
WHERE (dadn.dbName = 'StackExchange.Meta' OR dadn.dbName NOT LIKE '%Meta%')
-- Step through cursor
OPEN seSites_crsr
FETCH NEXT FROM seSites_crsr INTO #sitePrettyName, #seDbName, #seSiteURL
WHILE ##FETCH_STATUS = 0
BEGIN
SET #seSiteQuery = '
USE [' + #seDbName + ']
INSERT INTO #AllSiteResults
SELECT
uoi.Reputation AS [Master Rep]
, ''site://u/'' + CAST(uoi.UserId AS NVARCHAR(88)) + ''|'' + CAST(uoi.UserId AS NVARCHAR(88)) AS [Mstr UsrId]
, [AccountId] = ''https://stackexchange.com/users/'' + CAST(u.AccountId AS NVARCHAR(88)) + ''?tab=accounts|'' + CAST(u.AccountId AS NVARCHAR(88))
, ''' + #sitePrettyName + ''' AS [Site name]
, ''' + #seSiteURL + ''' + ''u/'' + CAST(u.Id AS NVARCHAR(88)) + ''|'' + u.DisplayName AS [Username on site]
, u.Reputation AS [Rep]
, (SELECT COUNT(q.Id) FROM Posts q WHERE q.OwnerUserId = u.Id AND q.PostTypeId = 2) AS [# Ans]
, (SELECT COUNT(q.Id) FROM Posts q WHERE q.OwnerUserId = u.Id AND q.PostTypeId = 1) AS [# Qst]
FROM #UsersOfInterest uoi
INNER JOIN Users u ON uoi.AccountId = u.AccountId
WHERE u.Reputation > ##MinRep:INT?200##
'
EXEC sp_executesql #seSiteQuery
FETCH NEXT FROM seSites_crsr INTO #sitePrettyName, #seDbName, #seSiteURL
END
CLOSE seSites_crsr
DEALLOCATE seSites_crsr
SELECT *
FROM #AllSiteResults
ORDER BY [Master Rep] DESC, AccountId, [Rep] DESC
It gives results like:
-- where the blue values are hyperlinked.
Note that a user must have 200 rep on a site for it to be "significant". That's also the rep needed for the site to be included in the Stack Exchange flair.

Related

SQL Server job gets stuck occasionally

I found my SQL Server job gets stuck occasionally, about once every two months. Since I am not from DBA background, I need some to help to rectify the issue.
So far, I have tried to pinpoint the issue by checking the activity monitor. I found the issue is caused by one of my stored procedures which it will create a temp table to collect data, then the data will be inserted into one of my transaction tables. This table have 400 millions of records.
Whenever this issue occur, I stop the job and:
I rerun the job, the stored procedure can complete
I execute the stored procedure manually, the stored procedure completes
I implemented the SP_BlitzCache, and execute it. I can see it suggest DBCC FREEPROCCACHE (0x0...) on the stored procedure.
CREATE TABLE #dtResult
(
RunningNumber INTEGER,
, AlphaID BIGINT
, BetaID BIGINT
, Content varchar(100)
, X varchar(10)
, Y varchar(10)
)
INSERT INTO #dtResult ( RunningNumber, ...)
SELECT RowId AS RunningNumber,
...
FROM
...
/*** Based on activity monitor, the highest CPU caused by this statement ***/
INSERT INTO tblTransaction ( ... )
SELECT DISTINCT
RES.AlphaID
, b.UnitId
, RES.BetaID
, CASE WHEN RES.BinData IS NULL THEN [dbo].[fnGetCode](B.Data, RES.X, RES.Y) ELSE RES.Content END
, CONVERT(DATETIME, SUBSTRING(RES.Timestamp, 1, 4) + '-' + SUBSTRING(RES.Timestamp, 5, 2) + '-' + SUBSTRING(RES.Timestamp, 7, 2) + ' ' + SUBSTRING(RES.Timestamp, 9, 2) + ':' + SUBSTRING(RES.Timestamp, 11, 2) + ':' + SUBSTRING(RES.Timestamp, 13, 2) + '.' + SUBSTRING(RES.Timestamp, 15, 3), 121)
FROM
#dtResult RES
INNER JOIN
tblA a with(nolock) ON RES.AlphaID = a.AlphaID
INNER JOIN
tblB b with(nolock) ON a.UnitId = b.UnitId AND CAST(RES.X AS INTEGER) = b.X AND CAST(RES.Y AS INTEGER) = b.Y
INNER JOIN
tblC c with(nolock) ON RES.BetaID = c.BetaID
LEFT OUTER JOIN
tblTransaction t with(nolock) ON RES.AlphaID = t.AlphaID AND RES.BetaID = t.BetaID AND t.UnitId = b.UnitId
WHERE
t.BetaID IS NULL
/* FUNCTION */
CREATE FUNCTION [dbo].[fnGetCode]
(
#Data VARCHAR(MAX),
#SearchX INT,
#SearchY INT
)
RETURNS CHAR(4)
WITH ENCRYPTION
AS
BEGIN
DECLARE #SearchResult CHAR(4)
DECLARE #Pos INT
SET #Pos = (#SearchY * #SearchX) + 1
SET #SearchResult = CONVERT(char(1),SUBSTRING(#Data,#Pos,1), 1)
RETURN #SearchResult
END

T-SQL Merge Two Comma-Separated Columns

I'm trying to merge one table into another (we'll call them Stage and Prod) that controls users and their permissions. My end result should be a single Prod table that has combined each userid's permissions from Stage into Prod. The issue I'm having though is that the tables were designed by an outside vendor and contain multiple pieces of information in one comma-delimited column.
Stage might look like below:
Userid | Permissions
----------------------------------------------------------------
1 | schedule,upload,test,download,admin
2 | test,upload
3 | download
Prod:
Userid | Permissions
----------------------------------------------------------------
1 | test,admin,schedule,download,upload
2 | admin
3 | download,upload
When they're merged, the userids should have their permissions from Stage, combined with those in Prod. However, tackling this when the permissions are a comma-delimited string has me at wit's end.
In the final result below, userid 1's permissions remain unchanged because they are the same in Stage as they are in Prod, merely in a different order.
Userid 2 had his Stage permissions added to his Prod since he did not have those permissions yet.
Userid 3 had his Prod permissions unchanged since his Stage permissions are already included.
Result:
Userid | Permissions
----------------------------------------------------------------
1 | test,admin,schedule,download,upload
2 | admin,test,upload
3 | download,upload
Is there any way to do this? Hopefully this makes some sense, but if there's any more info that might help I'm happy to try to provide it. Thank you for any help at all.
Interestingly enough, this was a topic of discussion on a MSSQLTips blog by Aaron Bertrand. Borrowing his code you can create the Numbers table and string splitting/reassembling functions required to make the following work. If you are planning on doing this often and are stuck with the schema you've shown, this is the way to go.
/*Create Test Data
create table StagePermissions (UserID int, [Permissions] nvarchar(max));
create table ProdPermissions (UserID int, [Permissions] nvarchar(max));
insert StagePermissions values
(1,'schedule,upload,test,download,admin'),
(2,'test,upload'),
(3,'download')
insert ProdPermissions values
(1,'test,admin,schedule,download,upload'),
(2,'admin'),
(3,'download,upload')
*/
select sp.UserID, dbo.ReassembleString(sp.Permissions+','+pp.Permissions,',',N'OriginalOrder') MergedPermissions
from StagePermissions sp
join ProdPermissions pp on pp.UserID=sp.UserID
Taking Steve's test data, but adding:
create table BothPermissions (UserID int, [Permissions] nvarchar(max));
This code will work with a fixed number of possible permissions.
DECLARE #XPermissions TABLE (
UserID int
,XSchedule BIT
,XUpload BIT
,XTest BIT
,XDownload BIT
,XAdmin BIT
)
INSERT INTO #XPermissions
SELECT
ISNULL(sp.UserID,pp.UserID),
CHARINDEX('schedule',sp.[Permissions]) + CHARINDEX('schedule',pp.[Permissions]),
CHARINDEX('upload',sp.[Permissions]) + CHARINDEX('upload',pp.[Permissions]),
CHARINDEX('test',sp.[Permissions]) + CHARINDEX('test',pp.[Permissions]),
CHARINDEX('download',sp.[Permissions]) + CHARINDEX('download',pp.[Permissions]),
CHARINDEX('admin',sp.[Permissions]) + CHARINDEX('admin',pp.[Permissions])
FROM StagePermissions sp
FULL JOIN ProdPermissions pp
ON sp.UserID = pp.UserID
INSERT INTO BothPermissions
SELECT
UserID,
CASE XSchedule WHEN 0 THEN '' ELSE 'schedule ' END +
CASE XUpload WHEN 0 THEN '' ELSE 'upload ' END +
CASE XTest WHEN 0 THEN '' ELSE 'test ' END +
CASE XDownload WHEN 0 THEN '' ELSE 'download ' END +
CASE XAdmin WHEN 0 THEN '' ELSE 'admin' END
FROM #XPermissions
UPDATE BothPermissions
SET [Permissions] = REPLACE(RTRIM([Permissions]),' ',', ')
Now, I was further curious about Steve's answer. I think it is the most robust solution here. However, I wondered how it would perform with a large dataset. I still don't know the answer because I haven't set up the tools necessary to use it. But here's a query that includes some random number generation to populate 10,000 records of each:
SELECT GETDATE()
DECLARE #StagePerms TABLE (
UserID INT IDENTITY
,Perms NVARCHAR(MAX)
)
DECLARE #ProdPerms TABLE (
UserID INT IDENTITY
,Perms NVARCHAR(MAX)
)
DECLARE #Counter INT = 0
DECLARE #XString NVARCHAR(MAX)
WHILE #Counter < 10000
BEGIN
SET #Counter += 1
SET #XString = REPLACE(RTRIM(
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'test ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'admin ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'schedule ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'download ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'upload ' END)
,' ',', ')
INSERT INTO #StagePerms SELECT #XString
SET #XString = REPLACE(RTRIM(
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'test ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'admin ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'schedule ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'download ' END +
CASE ROUND(RAND()-.2,0) WHEN 0 THEN '' ELSE 'upload ' END)
,' ',', ')
INSERT INTO #ProdPerms SELECT #XString
END
SELECT GETDATE()
DECLARE #BothPerms TABLE (
UserID INT
,Perms NVARCHAR(MAX)
)
DECLARE #XPerms TABLE (
UserID int
,XSchedule BIT
,XUpload BIT
,XTest BIT
,XDownload BIT
,XAdmin BIT
)
INSERT INTO #XPerms
SELECT
ISNULL(sp.UserID,pp.UserID),
CHARINDEX('schedule',sp.Perms) + CHARINDEX('schedule',pp.Perms),
CHARINDEX('upload',sp.Perms) + CHARINDEX('upload',pp.Perms),
CHARINDEX('test',sp.Perms) + CHARINDEX('test',pp.Perms),
CHARINDEX('download',sp.Perms) + CHARINDEX('download',pp.Perms),
CHARINDEX('admin',sp.Perms) + CHARINDEX('admin',pp.Perms)
FROM #StagePerms sp
FULL JOIN #ProdPerms pp
ON sp.UserID = pp.UserID
INSERT INTO #BothPerms
SELECT
UserID,
CASE XTest WHEN 0 THEN '' ELSE 'test ' END +
CASE XAdmin WHEN 0 THEN '' ELSE 'admin ' END +
CASE XSchedule WHEN 0 THEN '' ELSE 'schedule ' END +
CASE XDownload WHEN 0 THEN '' ELSE 'download ' END +
CASE XUpload WHEN 0 THEN '' ELSE 'upload ' END
FROM #XPerms
UPDATE #BothPerms
SET Perms = REPLACE(RTRIM(Perms),' ',', ')
SELECT * FROM #BothPerms
SELECT GETDATE()
The random number generation took less than a second; the rest took about 31 seconds. Steve, I'd be interested to see a comparison. Doesn't matter, obviously, if the data doesn't allow for my solution. And I'm sure there's a sweet spot somewhere.
Please make use of the below query. Its working fine in SQL Server 2012.
DECLARE #Stage TABLE (Userid int, Permission Varchar (8000))
DECLARE #Prod TABLE (Userid int, Permission Varchar (8000))
DECLARE #temp TABLE (Userid int, Permission Varchar (8000))
INSERT #Stage
(Userid,Permission)
VALUES
(1,'schedule,upload,test,download,admin'),
(2,'test,upload'),
(3,'download')
INSERT #Prod
(Userid,Permission)
VALUES
(1,'test,admin,schedule,download,upload'),
(2,'admin'),
(3,'download,upload')
-- Execution Part
INSERT INTO #temp
(Userid,Permission)
(
SELECT A.Userid AS Userid,Split.a.value('.', 'VARCHAR(100)') AS Permission FROM
(SELECT Userid,CAST ('<M>' + REPLACE(Permission, ',', '</M><M>') + '</M>' AS XML) AS Permission FROM #Stage A) AS A
CROSS APPLY Permission.nodes ('/M') AS Split(a)
UNION
SELECT A.Userid AS Userid,Split.a.value('.', 'VARCHAR(100)') AS Permission FROM
(SELECT Userid,CAST ('<M>' + REPLACE(Permission, ',', '</M><M>') + '</M>' AS XML) AS Permission FROM #Prod A) AS A
CROSS APPLY Permission.nodes ('/M') AS Split(a)
)
SELECT Userid, Permission =
STUFF((SELECT ', ' + Permission
FROM #temp b
WHERE b.Userid = a.Userid
FOR XML PATH('')), 1, 2, '')
FROM #temp a
GROUP BY Userid
OUTPUT
Userid Permission
1 admin, download, schedule, test, upload
2 admin, test, upload
3 download, upload
You can also use direct support of string splitting introduced in SQL Serv 2016 (in case you started using this engine version already of course :) )
STRING_SPLIT returns single column table...

Stored procedure with inner join using coalesce

I have a simple table tblAllUsers which stores simple values like Name,Date Of Birth etc of a UserId.
Another table tblInterest stores the interest(s) of a UserId.Here a user may have any number of Interest and are stored seperately in separate rows :
Create table tblInterest
(
Id int primary key identity,
UserId varchar(10),
InterestId int,
Interest varchar(20)
)
So when i want to display the set of Interest together of a particular user, I use the below query :
DECLARE #listStr VARCHAR(MAX)
SELECT #listStr = COALESCE(#listStr + ', ' ,'') + Interest FROM tblInterest where UserId=#UserId
SELECT #listStr
Now, want to display a users info from both these tables wherein the Interest(S) are displayed in ONE string.
I have tried the below ;
Create proc spPlayersGridview
#listStr VARCHAR(MAX)
as
begin
Select tblAllUsers.Category, tblAllUsers.DOB, tblAllUsers.FirstName, tblAllUsers.LastName, tblAllUsers.City, tblAllUsers.State,
#listStr = COALESCE(#listStr + ', ' ,'') + tblInterest.Interest
from tblAllUsers
INNER JOIN tblInterest
ON tblAllUsers.UserId=tblInterest.UserId
where Category='Player'
end
throws an exception "A SELECT statement that assigns a value to a variable must not be combined with data-retrieval operations."
I had a similar problem a while back, and a bit of SQL STUFF magic helps - Maybe it will work for you as well.
CREATE PROC spPlayersGridview
AS
BEGIN
SELECT
tblAllUsers.Category
, tblAllUsers.DOB
, tblAllUsers.FirstName
, tblAllUsers.LastName
, tblAllUsers.City
, tblAllUsers.State
, listStr = STUFF((
SELECT ',' + tblInterest.Interest
FROM tblInterest
WHERE tblAllUsers.UserId=tblInterest.UserId
ORDER BY tblInterest.Interest
FOR XML PATH(''), TYPE).value('.', 'NVARCHAR(MAX)'), 1, 1, '')
FROM tblAllUsers
WHERE Category='Player'
END
Hope it helps - For more reading look at: https://msdn.microsoft.com/en-us/library/ms188043.aspx

Query so much slower than stored procedure?

I have created a query which performs with aprox 2 seconds with top 100. If i create a stored procedure of this exact query it takes 12-13 seconds to run.
Why would that be?
Elements table count = 2309015 (with userid specified = 326969)
Matches table count = 1290 (with userid specified = 498)
sites table count = 71 (with userid specified = 9)
code
with search (elementid, siteid, title, description, site, link, addeddate)
as
(
select top(#top)
elementid,
elements.siteid, title, elements.description,
site =
case sites.description
when '' then sites.name
when null then sites.name
else sites.name + ' (' + sites.description + ')'
end,
elements.link,
elements.addeddate
from elements
left join sites on elements.siteid = sites.siteid
where title like #search and sites.userid = #userid
order by addeddate desc
)
select search.*, isnull(matches.elementid,0) as ismatch
from search
left join matches on matches.elementid = search.elementid
When you create SP it is compiled and stored and when the SP has parameters, by which you filter your result, the optimizer don't know which value you will pass on execution, then he treat as 33% selection and by this creates plan. When you execute query, the values are provided and optimizer create the execution plan depended on this values. I sure, the the plans are different.
Without code, I can only guess. When writing a sample query, you first have a constant where clause and second a cache. The stored procedure has no chance of either caching or optimizing the query plan based on a constant in the where clause.
I can suggest two ways to try
First one, write your sp like this:
create procedure sp_search
(
#top int,
#search nvarchar(max),
#userid int
)
as
begin
declare #p_top int, #p_search nvarchar(max), #p_userid int
select #p_top = #top, #p_search = #search, #p_userid = #userid
with search (elementid, siteid, title, description, site, link, addeddate)
as
(
select top(#p_top)
elementid,
elements.siteid, title, elements.description,
site =
case sites.description
when '' then sites.name
when null then sites.name
else sites.name + ' (' + sites.description + ')'
end,
elements.link,
elements.addeddate
from elements
left join sites on elements.siteid = sites.siteid
where title like #p_search and sites.userid = #p_userid
order by addeddate desc
)
select search.*, isnull(matches.elementid,0) as ismatch
from search
left join matches on matches.elementid = search.elementid
end
Second one, use inline table function
create function sf_search
(
#top int,
#search nvarchar(max),
#userid int
)
returns table
as
return
(
with search (elementid, siteid, title, description, site, link, addeddate)
as
(
select top(#top)
elementid,
elements.siteid, title, elements.description,
site =
case sites.description
when '' then sites.name
when null then sites.name
else sites.name + ' (' + sites.description + ')'
end,
elements.link,
elements.addeddate
from elements
left join sites on elements.siteid = sites.siteid
where title like #search and sites.userid = #userid
order by addeddate desc
)
select search.*, isnull(matches.elementid,0) as ismatch
from search
left join matches on matches.elementid = search.elementid
)
There is a similar question here
The problem was the stored proc declaration SET ANSI_NULLS OFF

Paging, sorting and filtering in a stored procedure (SQL Server)

I was looking at different ways of writing a stored procedure to return a "page" of data. This was for use with the ASP ObjectDataSource, but it could be considered a more general problem.
The requirement is to return a subset of the data based on the usual paging parameters; startPageIndex and maximumRows, but also a sortBy parameter to allow the data to be sorted. Also there are some parameters passed in to filter the data on various conditions.
One common way to do this seems to be something like this:
[Method 1]
;WITH stuff AS (
SELECT
CASE
WHEN #SortBy = 'Name' THEN ROW_NUMBER() OVER (ORDER BY Name)
WHEN #SortBy = 'Name DESC' THEN ROW_NUMBER() OVER (ORDER BY Name DESC)
WHEN #SortBy = ...
ELSE ROW_NUMBER() OVER (ORDER BY whatever)
END AS Row,
.,
.,
.,
FROM Table1
INNER JOIN Table2 ...
LEFT JOIN Table3 ...
WHERE ... (lots of things to check)
)
SELECT *
FROM stuff
WHERE (Row > #startRowIndex)
AND (Row <= #startRowIndex + #maximumRows OR #maximumRows <= 0)
ORDER BY Row
One problem with this is that it doesn't give the total count and generally we need another stored procedure for that. This second stored procedure has to replicate the parameter list and the complex WHERE clause. Not nice.
One solution is to append an extra column to the final select list, (SELECT COUNT(*) FROM stuff) AS TotalRows. This gives us the total but repeats it for every row in the result set, which is not ideal.
[Method 2]
An interesting alternative is given here (https://web.archive.org/web/20211020111700/https://www.4guysfromrolla.com/articles/032206-1.aspx) using dynamic SQL. He reckons that the performance is better because the CASE statement in the first solution drags things down. Fair enough, and this solution makes it easy to get the totalRows and slap it into an output parameter. But I hate coding dynamic SQL. All that 'bit of SQL ' + STR(#parm1) +' bit more SQL' gubbins.
[Method 3]
The only way I can find to get what I want, without repeating code which would have to be synchronized, and keeping things reasonably readable is to go back to the "old way" of using a table variable:
DECLARE #stuff TABLE (Row INT, ...)
INSERT INTO #stuff
SELECT
CASE
WHEN #SortBy = 'Name' THEN ROW_NUMBER() OVER (ORDER BY Name)
WHEN #SortBy = 'Name DESC' THEN ROW_NUMBER() OVER (ORDER BY Name DESC)
WHEN #SortBy = ...
ELSE ROW_NUMBER() OVER (ORDER BY whatever)
END AS Row,
.,
.,
.,
FROM Table1
INNER JOIN Table2 ...
LEFT JOIN Table3 ...
WHERE ... (lots of things to check)
SELECT *
FROM stuff
WHERE (Row > #startRowIndex)
AND (Row <= #startRowIndex + #maximumRows OR #maximumRows <= 0)
ORDER BY Row
(Or a similar method using an IDENTITY column on the table variable).
Here I can just add a SELECT COUNT on the table variable to get the totalRows and put it into an output parameter.
I did some tests and with a fairly simple version of the query (no sortBy and no filter), method 1 seems to come up on top (almost twice as quick as the other 2). Then I decided to test probably I needed the complexity and I needed the SQL to be in stored procedures. With this I get method 1 taking nearly twice as long as the other 2 methods. Which seems strange.
Is there any good reason why I shouldn't spurn CTEs and stick with method 3?
UPDATE - 15 March 2012
I tried adapting Method 1 to dump the page from the CTE into a temporary table so that I could extract the TotalRows and then select just the relevant columns for the resultset. This seemed to add significantly to the time (more than I expected). I should add that I'm running this on a laptop with SQL Server Express 2008 (all that I have available) but still the comparison should be valid.
I looked again at the dynamic SQL method. It turns out I wasn't really doing it properly (just concatenating strings together). I set it up as in the documentation for sp_executesql (with a parameter description string and parameter list) and it's much more readable. Also this method runs fastest in my environment. Why that should be still baffles me, but I guess the answer is hinted at in Hogan's comment.
I would most likely split the #SortBy argument into two, #SortColumn and #SortDirection, and use them like this:
…
ROW_NUMBER() OVER (
ORDER BY CASE #SortColumn
WHEN 'Name' THEN Name
WHEN 'OtherName' THEN OtherName
…
END *
CASE #SortDirection
WHEN 'DESC' THEN -1
ELSE 1
END
) AS Row
…
And this is how the TotalRows column could be defined (in the main select):
…
COUNT(*) OVER () AS TotalRows
…
I would definitely want to do a combination of a temp table and NTILE for this sort of approach.
The temp table will allow you to do your complicated series of conditions just once. Because you're only storing the pieces you care about, it also means that when you start doing selects against it further in the procedure, it should have a smaller overall memory usage than if you ran the condition multiple times.
I like NTILE() for this better than ROW_NUMBER() because it's doing the work you're trying to accomplish for you, rather than having additional where conditions to worry about.
The example below is one based off a similar query I'm using as part of a research query; I have an ID I can use that I know will be unique in the results. Using an ID that was an identity column would also be appropriate here, though.
--DECLARES here would be stored procedure parameters
declare #pagesize int, #sortby varchar(25), #page int = 1;
--Create temp with all relevant columns; ID here could be an identity PK to help with paging query below
create table #temp (id int not null primary key clustered, status varchar(50), lastname varchar(100), startdate datetime);
--Insert into #temp based off of your complex conditions, but with no attempt at paging
insert into #temp
(id, status, lastname, startdate)
select id, status, lastname, startdate
from Table1 ...etc.
where ...complicated conditions
SET #pagesize = 50;
SET #page = 5;--OR CAST(#startRowIndex/#pagesize as int)+1
SET #sortby = 'name';
--Only use the id and count to use NTILE
;with paging(id, pagenum, totalrows) as
(
select id,
NTILE((SELECT COUNT(*) cnt FROM #temp)/#pagesize) OVER(ORDER BY CASE WHEN #sortby = 'NAME' THEN lastname ELSE convert(varchar(10), startdate, 112) END),
cnt
FROM #temp
cross apply (SELECT COUNT(*) cnt FROM #temp) total
)
--Use the id to join back to main select
SELECT *
FROM paging
JOIN #temp ON paging.id = #temp.id
WHERE paging.pagenum = #page
--Don't need the drop in the procedure, included here for rerunnability
drop table #temp;
I generally prefer temp tables over table variables in this scenario, largely so that there are definite statistics on the result set you have. (Search for temp table vs table variable and you'll find plenty of examples as to why)
Dynamic SQL would be most useful for handling the sorting method. Using my example, you could do the main query in dynamic SQL and only pull the sort method you want to pull into the OVER().
The example above also does the total in each row of the return set, which as you mentioned was not ideal. You could, instead, have a #totalrows output variable in your procedure and pull it as well as the result set. That would save you the CROSS APPLY that I'm doing above in the paging CTE.
I would create one procedure to stage, sort, and paginate (using NTILE()) a staging table; and a second procedure to retrieve by page. This way you don't have to run the entire main query for each page.
This example queries AdventureWorks.HumanResources.Employee:
--------------------------------------------------------------------------
create procedure dbo.EmployeesByMartialStatus
#MaritalStatus nchar(1)
, #sort varchar(20)
as
-- Init staging table
if exists(
select 1 from sys.objects o
inner join sys.schemas s on s.schema_id=o.schema_id
and s.name='Staging'
and o.name='EmployeesByMartialStatus'
where type='U'
)
drop table Staging.EmployeesByMartialStatus;
-- Populate staging table with sort value
with s as (
select *
, sr=ROW_NUMBER()over(order by case #sort
when 'NationalIDNumber' then NationalIDNumber
when 'ManagerID' then ManagerID
-- plus any other sort conditions
else EmployeeID end)
from AdventureWorks.HumanResources.Employee
where MaritalStatus=#MaritalStatus
)
select *
into #temp
from s;
-- And now pages
declare #RowCount int; select #rowCount=COUNT(*) from #temp;
declare #PageCount int=ceiling(#rowCount/20); --assuming 20 lines/page
select *
, Page=NTILE(#PageCount)over(order by sr)
into Staging.EmployeesByMartialStatus
from #temp;
go
--------------------------------------------------------------------------
-- procedure to retrieve selected pages
create procedure EmployeesByMartialStatus_GetPage
#page int
as
declare #MaxPage int;
select #MaxPage=MAX(Page) from Staging.EmployeesByMartialStatus;
set #page=case when #page not between 1 and #MaxPage then 1 else #page end;
select EmployeeID,NationalIDNumber,ContactID,LoginID,ManagerID
, Title,BirthDate,MaritalStatus,Gender,HireDate,SalariedFlag,VacationHours,SickLeaveHours
, CurrentFlag,rowguid,ModifiedDate
from Staging.EmployeesByMartialStatus
where Page=#page
GO
--------------------------------------------------------------------------
-- Usage
-- Load staging
exec dbo.EmployeesByMartialStatus 'M','NationalIDNumber';
-- Get pages 1 through n
exec dbo.EmployeesByMartialStatus_GetPage 1;
exec dbo.EmployeesByMartialStatus_GetPage 2;
-- ...etc (this would actually be a foreach loop, but that detail is omitted for brevity)
GO
I use this method of using EXEC():
-- SP parameters:
-- #query: Your query as an input parameter
-- #maximumRows: As number of rows per page
-- #startPageIndex: As number of page to filter
-- #sortBy: As a field name or field names with supporting DESC keyword
DECLARE #query nvarchar(max) = 'SELECT * FROM sys.Objects',
#maximumRows int = 8,
#startPageIndex int = 3,
#sortBy as nvarchar(100) = 'name Desc'
SET #query = ';WITH CTE AS (' + #query + ')' +
'SELECT *, (dt.pagingRowNo - 1) / ' + CAST(#maximumRows as nvarchar(10)) + ' + 1 As pagingPageNo' +
', pagingCountRow / ' + CAST(#maximumRows as nvarchar(10)) + ' As pagingCountPage ' +
', (dt.pagingRowNo - 1) % ' + CAST(#maximumRows as nvarchar(10)) + ' + 1 As pagingRowInPage ' +
'FROM ( SELECT *, ROW_NUMBER() OVER (ORDER BY ' + #sortBy + ') As pagingRowNo, COUNT(*) OVER () AS pagingCountRow ' +
'FROM CTE) dt ' +
'WHERE (dt.pagingRowNo - 1) / ' + CAST(#maximumRows as nvarchar(10)) + ' + 1 = ' + CAST(#startPageIndex as nvarchar(10))
EXEC(#query)
At result-set after query result columns:
Note:
I add some extra columns that you can remove them:
pagingRowNo : The row number
pagingCountRow : The total number of rows
pagingPageNo : The current page number
pagingCountPage : The total number of pages
pagingRowInPage : The row number that started with 1 in this page

Resources