Select Range of Dates, Including Ones With No Results [duplicate] - sql-server

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
SQL Server: How to select all days in a date range even if no data exists for some days
I wasn't really sure how to word this question, but I'll try to explain. I'm trying to build some basic reporting using queries like the following:
SELECT COUNT(*) AS count, h_date FROM (SELECT CONVERT(VARCHAR(10), h_time, 102) AS h_date FROM hits h GROUP BY h_date ORDER BY h_date
This returns results like this, which I use to build a graph:
8 2012.05.06
2 2012.05.07
9 2012.05.09
As you can see, it's missing the 8th as there were no hits on that day. Is there a way to get a value of 0 for dates that have no results, or will I have to parse the results after the fact and add them manually?

You can use existing catalog views to derive a sequential range of dates between your start date and your end date. Then you can just left join to your data, and any missing dates will be there with 0s.
DECLARE #min SMALLDATETIME, #max SMALLDATETIME;
SELECT #min = MIN(h_time), #max = MAX(h_time)
FROM dbo.hits
-- WHERE ?
-- or if you just want a fixed range:
-- SELECT #min = '20120101', #max = '20120131';
;WITH n(d) AS
(
SELECT TOP (DATEDIFF(DAY, #min, #max)+1)
DATEADD(DAY, ROW_NUMBER() OVER (ORDER BY [object_id]) - 1, DATEDIFF(DAY, 0, #min))
FROM sys.all_objects ORDER BY [object_id]
)
SELECT n.d, [count] = COUNT(h.h_time)
FROM n
LEFT OUTER JOIN dbo.hits AS h
ON h.h_time >= n.d
AND h.h_time < DATEADD(DAY, 1, n.d)
-- AND --WHERE clause against hits?
GROUP BY n.d;

I've never been a big fan of using system tables to create dummy records to join against, but it's a very common approach.
I took Aaron Bertrand's answer and changed the Common Table Expression (CTE) to use a recursive one instead. It's quicker as it doesn't have to hit a table to do the query. Not that the previous version is slow anyway.
You need to specify "OPTION (MAXRECURSION 0);" otherwise it will limit the number of rows returned to the default (100). The value of 0 will return unlimited rows.
DECLARE #min SMALLDATETIME, #max SMALLDATETIME;
--SELECT #min = MIN(h_time), #max = MAX(h_time)
-- FROM dbo.hits
SELECT #min = '20120101', #max = '20121231';
WITH recursedate(each_date, date_index) AS
(
SELECT #min, 0
UNION ALL
SELECT DATEADD(DAY,date_index+1,#min), date_index+1
FROM recursedate
WHERE DATEADD(DAY,date_index+1,#min) <= #max
)
SELECT recursedate.each_date, [count] = COUNT(h.h_time)
FROM recursedate
LEFT OUTER JOIN dbo.hits AS h
ON --CONVERT(SMALLDATETIME,h.h_time) = recursedate.dates
h.h_time >= recursedate.each_date
AND h.h_time < DATEADD(DAY, 1, recursedate.each_date)
-- AND --WHERE clause against hits?
GROUP BY recursedate.each_date
OPTION (MAXRECURSION 0); -- The default is 100 so you'll only get 100 dates, 0 is unlimited.

Related

How can I get, and re-use, the next 6 *dates* from today in SQL Server?

I need to create a CTE I can re-use that will hold seven dates. That is today and the next six days.
So, output for today (4/22/2022) should be:
2022-04-22
2022-04-23
2022-04-24
2022-04-25
2022-04-26
2022-04-27
2022-04-28
So far, I have this:
WITH seq AS
(
SELECT 0 AS [idx]
UNION ALL
SELECT [idx] + 1
FROM seq
WHERE [idx] < 6
)
SELECT DATEADD(dd, [idx], CONVERT(date, GETDATE()))
FROM seq;
The problem is my SELECT is outside the WITH, so I would need to wrap this whole thing with another WITH to re-use it, for example to JOIN on it as a list of dates, and I'm not having luck getting that nested WITH to work. How else could I accomplish this?
To be clear: I'm not trying to find records in a specific table full of dates that are from the next seven days. There are plenty of easy solutions for that. I need a list of dates for today and the next six days, that I can re-use in other queries as a CTE.
You're close. Here's an example:
with cte as (
select
1 as n
,GETDATE() as dt
union all
select
n+1
,DATEADD(dd,n,GETDATE()) as dt
from cte
where n <= 6
)
select * from cte
Fiddle here
You can create a view for reusability and simply query the view rather than using the same CTE over and over again.
You can do this by adding a second column for the date to the CTE:
WITH seq AS (
SELECT 0 AS [idx], cast(current_timestamp as date) as date
UNION ALL
SELECT [idx] + 1, dateadd(dd, idx+1, cast(current_timestamp as date))
FROM seq
WHERE [idx] < 6
)
SELECT *
FROM seq;
See it here:
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=208ecd76be2071529078f38b1735b0cd
Another option is you can "stack" CTEs, rather than nest, to avoid the second column:
WITH seq0 AS (
SELECT 0 AS [idx]
UNION ALL
SELECT [idx] + 1
FROM seq0
WHERE [idx] < 6
),
seq As (
SELECT dateadd(dd, idx, cast(current_timestamp as date)) as idx
FROM seq0
)
SELECT *
FROM seq;
Note how the final query only needed to reference the 2nd CTE.
See it here:
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=22315438e4710792f368009cc6ff6451
Never recommend using recursion if you don't have a need for it. It's more complex and slower. I'd just use a hardcoded list of numbers, could encapsulate it in TVF if you wanted to reuse it across different stored procedures/functions. If you need to reuse it in 1 stored proc in multiple places, I'd just throw it in a temp table.
CTE Version without Recursion
WITH cte_7days AS (
SELECT theDate = CAST(DATEADD(dd,num,GETDATE()) AS DATE)
FROM (VALUES (0),(1),(2),(3),(4),(5),(6)) AS A(num)
)
SELECT *
FROM cte_7days
CROSS APPLY Version to Remove Need for CTE
Could use something like this as your base query, and then just add more joins below table depending on your query
SELECT theDate
FROM (VALUES (0),(1),(2),(3),(4),(5),(6)) AS A(num)
CROSS APPLY (SELECT theDate = CAST(DATEADD(DAY,num,GETDATE()) AS DATE)) AS B
TVF Version
CREATE FUNCTION dbo.uf_7days()
RETURNS TABLE AS
RETURN
(
SELECT theDate
FROM (VALUES (0),(1),(2),(3),(4),(5),(6)) AS A(num)
CROSS APPLY (SELECT theDate = CAST(DATEADD(DAY,num,GETDATE()) AS DATE)) AS B
)
GO

SQL CPU Script - count consecutive occurrences of value

so I'm working on a SQL CPU utilization script that gets the last (for ex, 10 mins of CPU usage), for a SQL instance as available from sys.dm_os_ring_buffers - pretty standard script.
however, what I want to do, is grab this info, but count the consecutive occurrences in the sample (ie 10 mins), so if for 10 mins (10 consecutive records where value > 90%) do X
here's the code i'm using: (EDITED FOR CORRECT CODE)
DECLARE #ts BIGINT;
DECLARE #lastNmin TINYINT;
SET #lastNmin = 10;
SELECT #ts =(SELECT cpu_ticks/(cpu_ticks/ms_ticks) FROM
sys.dm_os_sys_info);
SELECT TOP(#lastNmin)
SQLProcessUtilization AS [SQLServer_CPU_Utilization],
SystemIdle AS [System_Idle_Process],
100 - SystemIdle - SQLProcessUtilization AS
[Other_Process_CPU_Utilization],
DATEADD(ms,-1 *(#ts - [timestamp]),GETDATE())AS [Event_Time]
FROM (SELECT record.value('(./Record/#id)[1]','int')AS record_id,
record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)
[1]','int')AS [SystemIdle],record.value
('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)
[1]','int')AS [SQLProcessUtilization],
[timestamp]
FROM (SELECT[timestamp], convert(xml, record) AS [record]
FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type =N'RING_BUFFER_SCHEDULER_MONITOR'AND record
LIKE'%%')AS x )AS y
ORDER BY record_id DESC;
Thanks
It sounds like you want a gaps and islands approach. Here's what I came up with:
DROP TABLE IF EXISTS #tmp;
DECLARE #ts BIGINT;
DECLARE #lastNmin TINYINT;
SET #lastNmin = 10;
SELECT #ts =
(
SELECT cpu_ticks / (cpu_ticks / ms_ticks) FROM sys.dm_os_sys_info
);
SELECT TOP (#lastNmin)
SQLProcessUtilization AS [SQLServer_CPU_Utilization],
SystemIdle AS [System_Idle_Process],
100 - SystemIdle - SQLProcessUtilization AS [Other_Process_CPU_Utilization],
DATEADD(ms, -1 * (#ts - [timestamp]), GETDATE()) AS [Event_Time]
INTO #tmp
FROM
(
SELECT record.value('(./Record/#id)[1]', 'int') AS record_id,
record.value('(./Record/SchedulerMonitorEvent/SystemHealth/SystemIdle)[1]', 'int') AS [SystemIdle],
record.value('(./Record/SchedulerMonitorEvent/SystemHealth/ProcessUtilization)[1]', 'int') AS [SQLProcessUtilization],
[timestamp]
FROM
(
SELECT [timestamp],
CONVERT(XML, record) AS [record]
FROM sys.dm_os_ring_buffers
WHERE ring_buffer_type = N'RING_BUFFER_SCHEDULER_MONITOR'
AND record LIKE '%%'
) AS x
) AS y
ORDER BY record_id DESC;
WITH cte AS (
SELECT *, CAST(CASE WHEN [System_Idle_Process] >= 95 THEN 1 ELSE 0 END AS BIT) as [HighCPU]
FROM #tmp
),
GapsAndIslands AS (
SELECT *,
ROW_NUMBER() OVER (ORDER BY cte.Event_Time) AS rn1,
ROW_NUMBER() OVER (PARTITION BY cte.HighCPU ORDER BY cte.Event_Time) AS rn2
FROM cte
)
SELECT *, rn1 - rn2 AS GroupID
FROM GapsAndIslands
ORDER BY GapsAndIslands.Event_Time;
By way of explanation, I'm creating three synthetic columns
a boolean representing the condition you're looking to track (NB - I'm using a different metric than you should because my CPU usage is low!)
a row number column across the entire data set
a row number column for each distinct value of the tracked metric
What makes this solution work is noting that the difference in those two row number columns will be the same for consecutive rows that have the same value for your tracked metric and will change on the boundaries. I've left that as GroupID in the final result set and you can use that to track groups of consecutive rows.
If you instead replace that last select with this:
SELECT MIN(Event_Time), MAX(Event_Time)
FROM GapsAndIslands
WHERE [HighCPU] = 1
GROUP BY rn1 - rn2
ORDER BY MIN(Event_Time);
That will give you the time ranges for when the tracked metric was above threshold.

CTE slow performance on Left join

I need to provide a report that shows all users on a table and their scores. Not all users on said table will have a score, so in my solution I calculate the score first using a few CTE's then in a final CTE i pull a full roster and assign a default score to users with no actual score.
While the CTE's are not overly complex, they are also not simple. Separately when I run the calculation part of the CTE's for users with an actual score, it runs in less than a second. When I join to a final CTE that grabs the full roster and assigns default scores where the nulls appear (no actual score) the wheels completely fall off and it never completes.
I've experimented with switching up the indexes and refreshing them to no avail. I have noticed the join at agent_effectiveness when switched to INNER runs in one second, but I need it to be a LEFT join so it will pull in the whole roster even when no score is present.
EDIT*
Execution Plan Inner Join
Execution Plan Left Join
WITH agent_split_stats AS (
Select
racf,
agent_stats.SkillGroupSkillTargetID,
aht_target.EnterpriseName,
aht_target.target,
Sum(agent_stats.CallsHandled) as n_calls_handled,
CASE WHEN (Sum(agent_stats.TalkInTime) + Sum(agent_stats.IncomingCallsOnHoldTime) + Sum(agent_stats.WorkReadyTime)) = 0 THEN 1 ELSE
(Sum(agent_stats.TalkInTime) + Sum(agent_stats.IncomingCallsOnHoldTime) + Sum(agent_stats.WorkReadyTime)) END
AS total_handle_time
from tblAceyusAgntSklGrp as agent_stats
-- GET TARGETS
INNER JOIN tblCrosswalkWghtPhnEffTarget as aht_target
ON aht_target.SgId = agent_stats.SkillGroupSkillTargetID
AND agent_stats.DateTime BETWEEN aht_target.StartDt and aht_target.EndDt
-- GET RACF
INNER JOIN tblAgentMetricCrosswalk as xwalk
ON xwalk.SkillTargetID = agent_stats.SkillTargetID
--GET TAU DATA LIKE START DATE AND GRADUATED FLAG
INNER JOIN tblTauClassList AS T
ON T.SaRacf = racf
WHERE
--FILTERS BY A ROLLING 15 BUSINESS DAYS UNLESS THE DAYS BETWEEN CURRENT DATE AND TAU START DATE ARE <15
agent_stats.DateTime >=
CASE WHEN dbo.fn_WorkDaysAge(TauStart, GETDATE()) <15 THEN TauStart ELSE
dbo.fn_WorkDate15(TauStart)
END
And Graduated = 'No'
--WPE FILTERS TO ENSURE ACCURATE DATA
AND CallsHandled <> 0
AND Target is not null
Group By
racf, agent_stats.SkillGroupSkillTargetID, aht_target.EnterpriseName, aht_target.target
),
agent_split_stats_with_weight AS (
-- calculate weights
-- one row = one advocate + split
SELECT
agent_split_stats.*,
agent_split_stats.n_calls_handled/SUM(agent_split_stats.n_calls_handled) OVER(PARTITION BY agent_split_stats.racf) AS [weight]
FROM agent_split_stats
),
agent_split_effectiveness AS (
-- calculate the raw Effectiveness score for each eligible advocate/split
-- one row = one agent + split, with their raw Effectiveness score and the components of that
SELECT
agent_split_stats_with_weight.*,
-- these are the components of the Effectiveness score
(((agent_split_stats_with_weight.target * agent_split_stats_with_weight.n_calls_handled) / agent_split_stats_with_weight.total_handle_time)*100)*agent_split_stats_with_weight.weight AS effectiveness_sum
FROM agent_split_stats_with_weight
), -- this is where we show effectiveness per split select * from agent_split_effectiveness
agent_effectiveness AS (
-- sum all of the individual effectiveness raw scores for each agent to get each agent's raw score
SELECT
racf AS SaRacf,
ROUND(SUM(effectiveness_sum),2) AS WpeScore
FROM agent_split_effectiveness
GROUP BY racf
),
--GET FULL CLASS LIST, TAU DATES, GOALS FOR WHOLE CLASS
tau AS (
Select L.SaRacf, TauStart, Goal as WpeGoal
,CASE WHEN agent_effectiveness.WpeScore IS NULL THEN 1 ELSE WpeScore END as WpeScore
FROM tblTauClassList AS L
LEFT JOIN agent_effectiveness
ON agent_effectiveness.SaRacf = L.SaRacf
LEFT JOIN tblCrosswalkTauGoal AS G
ON G.Year = TauYear
AND G.Bucket = 'Wpe'
WHERE TermDate IS NULL
AND Graduated = 'No'
)
SELECT tau.*,
CASE WHEN dbo.fn_WorkDaysAge(TauStart, GETDATE()) > 14 --MUST BE AT LEAST 15 DAYS TO PASS
AND WpeScore >= WpeGoal THEN 'Pass'
ELSE 'Fail' END
from tau
This style of query runs fine in 3 other different calculation types (different score types). So i am unsure why its failing so badly here. Actual results should be a list of individuals, a date, a score, a goal and a score. When no score exists, a default score is provided. Additionally there is a pass/fail metric using the score/goal.
As #Habo mentioned, we need the actual execution plan (e.g. run the query with "include actual execution plan" turned on.) I looked over what you posted and there is nothing there that will explain the problem. The difference with the actual plan vs the estimated plan is that the actual number of rows retrieved are recorded; this is vital for troubleshooting poorly performing queries.
That said, I do see a HUGE problem with both queries. It's a problem that, once fixed will, improve both queries to less than a second. Your query is leveraging two scalar user Defined Functions (UDFs): dbo.fn_WorkDaysAge & dbo.fn_WorkDate15. Scalar UDFs ruin
everything. Not only are they slow, they force a serial execution plan which makes any query they are used in much slower.
I don't have the code for dbo.fn_WorkDaysAge or dbo.fn_WorkDate15 I have my own "WorkDays" function which is inline (code below). The syntax is a little different but the performance benefits are worth the effort. Here's the syntax difference:
-- Scalar
SELECT d.*, workDays = dbo.countWorkDays_scalar(d.StartDate,d.EndDate)
FROM <sometable> AS d;
-- Inline version
SELECT d.*, f.workDays
FROM <sometable> AS d
CROSS APPLY dbo.countWorkDays(d.StartDate,d.EndDate) AS f;
Here's a performance test I put together to show the difference between an inline version vs the scalar version:
-- SAMPLE DATA
IF OBJECT_ID('tempdb..#dates') IS NOT NULL DROP TABLE #dates;
WITH E1(x) AS (SELECT 1 FROM (VALUES(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) AS x(x)),
E3(x) AS (SELECT 1 FROM E1 a, E1 b, E1 c),
iTally AS (SELECT N=ROW_NUMBER() OVER (ORDER BY (SELECT 1)) FROM E3 a, E3 b)
SELECT TOP (100000)
StartDate = CAST(DATEADD(DAY,-ABS(CHECKSUM(NEWID())%1000),GETDATE()) AS DATE),
EndDate = CAST(DATEADD(DAY,+ABS(CHECKSUM(NEWID())%1000),GETDATE()) AS DATE)
INTO #dates
FROM iTally;
-- PERFORMANCE TESTS
PRINT CHAR(10)+'Scalar Version (always serial):'+CHAR(10)+REPLICATE('-',60);
GO
DECLARE #st DATETIME = GETDATE(), #workdays INT;
SELECT #workdays = dbo.countWorkDays_scalar(d.StartDate,d.EndDate)
FROM #dates AS d;
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
PRINT CHAR(10)+'Inline Version:'+CHAR(10)+REPLICATE('-',60);
GO
DECLARE #st DATETIME = GETDATE(), #workdays INT;
SELECT #workdays = f.workDays
FROM #dates AS d
CROSS APPLY dbo.countWorkDays(d.StartDate,d.EndDate) AS f
PRINT DATEDIFF(MS,#st,GETDATE());
GO 3
Results:
Scalar Version (always serial):
------------------------------------------------------------
Beginning execution loop
380
363
350
Batch execution completed 3 times.
Inline Version:
------------------------------------------------------------
Beginning execution loop
47
47
46
Batch execution completed 3 times.
As you can see - the inline version about 8 times faster than the scalar version. Replacing those scalar UDFs with an inline version will almost certainly speed this query up regardless of join type.
Other problems I see include:
I see a lot of Index scans, this is a sign you need more filtering and/or better indexes.
dbo.tblCrosswalkWghtPhnEffTarget does not have any indexes which means it will always get scanned.
Functions used for performance test:
-- INLINE VERSION
----------------------------------------------------------------------------------------------
IF OBJECT_ID('dbo.countWorkDays') IS NOT NULL DROP FUNCTION dbo.countWorkDays;
GO
CREATE FUNCTION dbo.countWorkDays (#startDate DATETIME, #endDate DATETIME)
/*****************************************************************************************
[Purpose]:
Calculates the number of business days between two dates (Mon-Fri) and excluded weekends.
dates.countWorkDays does not take holidays into considerations; for this you would need a
seperate "holiday table" to perform an antijoin against.
The idea is based on the solution in this article:
https://www.sqlservercentral.com/Forums/Topic153606.aspx?PageIndex=16
[Author]:
Alan Burstein
[Compatibility]:
SQL Server 2005+
[Syntax]:
--===== Autonomous
SELECT f.workDays
FROM dates.countWorkDays(#startdate, #enddate) AS f;
--===== Against a table using APPLY
SELECT t.col1, t.col2, f.workDays
FROM dbo.someTable t
CROSS APPLY dates.countWorkDays(t.col1, t.col2) AS f;
[Parameters]:
#startDate = datetime; first date to compare
#endDate = datetime; date to compare #startDate to
[Returns]:
Inline Table Valued Function returns:
workDays = int; number of work days between #startdate and #enddate
[Dependencies]:
N/A
[Developer Notes]:
1. NULL when either input parameter is NULL,
2. This function is what is referred to as an "inline" scalar UDF." Technically it's an
inline table valued function (iTVF) but performs the same task as a scalar valued user
defined function (UDF); the difference is that it requires the APPLY table operator
to accept column values as a parameter. For more about "inline" scalar UDFs see this
article by SQL MVP Jeff Moden: http://www.sqlservercentral.com/articles/T-SQL/91724/
and for more about how to use APPLY see the this article by SQL MVP Paul White:
http://www.sqlservercentral.com/articles/APPLY/69953/.
Note the above syntax example and usage examples below to better understand how to
use the function. Although the function is slightly more complicated to use than a
scalar UDF it will yield notably better performance for many reasons. For example,
unlike a scalar UDFs or multi-line table valued functions, the inline scalar UDF does
not restrict the query optimizer's ability generate a parallel query execution plan.
3. dates.countWorkDays requires that #enddate be equal to or later than #startDate. Otherwise
a NULL is returned.
4. dates.countWorkDays is NOT deterministic. For more deterministic functions see:
https://msdn.microsoft.com/en-us/library/ms178091.aspx
[Examples]:
--===== 1. Basic Use
SELECT f.workDays
FROM dates.countWorkDays('20180608', '20180611') AS f;
---------------------------------------------------------------------------------------
[Revision History]:
Rev 00 - 20180625 - Initial Creation - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT workDays =
-- If #startDate or #endDate are NULL then rerturn a NULL
CASE WHEN SIGN(DATEDIFF(dd, #startDate, #endDate)) > -1 THEN
(DATEDIFF(dd, #startDate, #endDate) + 1) --total days including weekends
-(DATEDIFF(wk, #startDate, #endDate) * 2) --Subtact 2 days for each full weekend
-- Subtract 1 when startDate is Sunday and Substract 1 when endDate is Sunday:
-(CASE WHEN DATENAME(dw, #startDate) = 'Sunday' THEN 1 ELSE 0 END)
-(CASE WHEN DATENAME(dw, #endDate) = 'Saturday' THEN 1 ELSE 0 END)
END;
GO
-- SCALAR VERSION
----------------------------------------------------------------------------------------------
IF OBJECT_ID('dbo.countWorkDays_scalar') IS NOT NULL DROP FUNCTION dbo.countWorkDays_scalar;
GO
CREATE FUNCTION dbo.countWorkDays_scalar (#startDate DATETIME, #endDate DATETIME)
RETURNS INT WITH SCHEMABINDING AS
BEGIN
RETURN
(
SELECT workDays =
-- If #startDate or #endDate are NULL then rerturn a NULL
CASE WHEN SIGN(DATEDIFF(dd, #startDate, #endDate)) > -1 THEN
(DATEDIFF(dd, #startDate, #endDate) + 1) --total days including weekends
-(DATEDIFF(wk, #startDate, #endDate) * 2) --Subtact 2 days for each full weekend
-- Subtract 1 when startDate is Sunday and Substract 1 when endDate is Sunday:
-(CASE WHEN DATENAME(dw, #startDate) = 'Sunday' THEN 1 ELSE 0 END)
-(CASE WHEN DATENAME(dw, #endDate) = 'Saturday' THEN 1 ELSE 0 END)
END
);
END
GO
UPDATE BASED ON OP'S QUESTION IN THE COMMENTS:
First for the inline table valued function version of each function. Note that I'm using my own tables and don't have time to make the names match your environment but I did my best to include comments in the code. Also note that if, in your function, workingday = '1' is simply pulling weekdays then you'll find my function above to be a much faster alternative to your dbo.fn_WorkDaysAge function. If workingday = '1' also filters out holidays then it won't work.
CREATE FUNCTION dbo.fn_WorkDaysAge_itvf
(
#first_date DATETIME,
#second_date DATETIME
)
RETURNS TABLE AS RETURN
SELECT WorkDays = COUNT(*)
FROM dbo.dimdate -- DateDimension
WHERE DateValue -- [date]
BETWEEN #first_date AND #second_date
AND IsWeekend = 0 --workingday = '1'
GO
CREATE FUNCTION dbo.fn_WorkDate15_itvf
(
#TauStartDate DATETIME
)
RETURNS TABLE AS RETURN
WITH DATES AS
(
SELECT
ROW_NUMBER() OVER(Order By DateValue Desc) as RowNum, DateValue
FROM dbo.dimdate -- DateDimension
WHERE DateValue BETWEEN #TauStartDate AND --GETDATE() testing below
CASE WHEN GETDATE() < #TauStartDate + 200 THEN GETDATE() ELSE #TauStartDate + 200 END
AND IsWeekend = 0 --workingday = '1'
)
--Get the 15th businessday from the current date
SELECT DateValue
FROM DATES
WHERE RowNum = 16;
GO
Now, to replace your scalar UDFs with the inline table valued functions, you would do this (note my comments):
WITH agent_split_stats AS (
Select
racf,
agent_stats.SkillGroupSkillTargetID,
aht_target.EnterpriseName,
aht_target.target,
Sum(agent_stats.CallsHandled) as n_calls_handled,
CASE WHEN (Sum(agent_stats.TalkInTime) + Sum(agent_stats.IncomingCallsOnHoldTime) + Sum(agent_stats.WorkReadyTime)) = 0 THEN 1 ELSE
(Sum(agent_stats.TalkInTime) + Sum(agent_stats.IncomingCallsOnHoldTime) + Sum(agent_stats.WorkReadyTime)) END
AS total_handle_time
from tblAceyusAgntSklGrp as agent_stats
INNER JOIN tblCrosswalkWghtPhnEffTarget as aht_target
ON aht_target.SgId = agent_stats.SkillGroupSkillTargetID
AND agent_stats.DateTime BETWEEN aht_target.StartDt and aht_target.EndDt
INNER JOIN tblAgentMetricCrosswalk as xwalk
ON xwalk.SkillTargetID = agent_stats.SkillTargetID
INNER JOIN tblTauClassList AS T
ON T.SaRacf = racf
-- INLINE FUNCTIONS HERE:
CROSS APPLY dbo.fn_WorkDaysAge_itvf(TauStart, GETDATE()) AS wd
CROSS APPLY dbo.fn_WorkDate15_itvf(TauStart) AS w15
-- NEW WHERE CLAUSE:
WHERE agent_stats.DateTime >=
CASE WHEN wd.workdays < 15 THEN TauStart ELSE w15.workdays END
And Graduated = 'No'
AND CallsHandled <> 0
AND Target is not null
Group By
racf, agent_stats.SkillGroupSkillTargetID, aht_target.EnterpriseName, aht_target.target
),
agent_split_stats_with_weight AS (
SELECT
agent_split_stats.*,
agent_split_stats.n_calls_handled/SUM(agent_split_stats.n_calls_handled) OVER(PARTITION BY agent_split_stats.racf) AS [weight]
FROM agent_split_stats
),
agent_split_effectiveness AS
(
SELECT
agent_split_stats_with_weight.*,
(((agent_split_stats_with_weight.target * agent_split_stats_with_weight.n_calls_handled) /
agent_split_stats_with_weight.total_handle_time)*100)*
agent_split_stats_with_weight.weight AS effectiveness_sum
FROM agent_split_stats_with_weight
),
agent_effectiveness AS
(
SELECT
racf AS SaRacf,
ROUND(SUM(effectiveness_sum),2) AS WpeScore
FROM agent_split_effectiveness
GROUP BY racf
),
tau AS
(
SELECT L.SaRacf, TauStart, Goal as WpeGoal
,CASE WHEN agent_effectiveness.WpeScore IS NULL THEN 1 ELSE WpeScore END as WpeScore
FROM tblTauClassList AS L
LEFT JOIN agent_effectiveness
ON agent_effectiveness.SaRacf = L.SaRacf
LEFT JOIN tblCrosswalkTauGoal AS G
ON G.Year = TauYear
AND G.Bucket = 'Wpe'
WHERE TermDate IS NULL
AND Graduated = 'No'
)
SELECT tau.*,
-- NEW CASE STATEMENT HERE:
CASE WHEN wd.workdays > 14 AND WpeScore >= WpeGoal THEN 'Pass' ELSE 'Fail' END
from tau
-- INLINE FUNCTIONS HERE:
CROSS APPLY dbo.fn_WorkDaysAge_itvf(TauStart, GETDATE()) AS wd
CROSS APPLY dbo.fn_WorkDate15_itvf(TauStart) AS w15;
Note that I can't test this right now but it should be correct (or close)
UPDATE
I accepted Alan's answer, i ended up doing the following. Posting examples hoping the formatting helps someone, it slowed me down a bit...or maybe I am just slow heh heh.
1. Changed my Scalar UDF to InlineTVF
SCALAR Function 1-
ALTER FUNCTION [dbo].[fn_WorkDaysAge]
(
-- Add the parameters for the function here
#first_date DATETIME,
#second_date DATETIME
)
RETURNS int
AS
BEGIN
-- Declare the return variable here
DECLARE #WorkDays int
-- Add the T-SQL statements to compute the return value here
SELECT #WorkDays = COUNT(*)
FROM DateDimension
WHERE Date BETWEEN #first_date AND #second_date
AND workingday = '1'
-- Return the result of the function
RETURN #WorkDays
END
iTVF function 1-
ALTER FUNCTION [dbo].[fn_iTVF_WorkDaysAge]
(
-- Add the parameters for the function here
#FirstDate as Date,
#SecondDate as Date
)
RETURNS TABLE AS RETURN
SELECT WorkDays = COUNT(*)
FROM DateDimension
WHERE Date BETWEEN #FirstDate AND #SecondDate
AND workingday = '1'
I then updated my next function the same way. I added the CROSS APPLY (something ive personally not used, im still a newbie) as indicated below and replaced the UDFs with the field names in my case statement.
Old Code
INNER JOIN tblTauClassList AS T
ON T.SaRacf = racf
WHERE
--FILTERS BY A ROLLING 15 BUSINESS DAYS UNLESS THE DAYS BETWEEN CURRENT DATE AND TAU START DATE ARE <15
agent_stats.DateTime >=
CASE WHEN dbo.fn_WorkDaysAge(TauStart, GETDATE()) <15 THEN TauStart ELSE
dbo.fn_WorkDate15(TauStart)
END
New Code
INNER JOIN tblTauClassList AS T
ON T.SaRacf = racf
--iTVFs
CROSS APPLY dbo.fn_iTVF_WorkDaysAge(TauStart, GETDATE()) as age
CROSS APPLY dbo.fn_iTVF_WorkDate_15(TauStart) as roll
WHERE
--FILTERS BY A ROLLING 15 BUSINESS DAYS UNLESS THE DAYS BETWEEN CURRENT DATE AND TAU START DATE ARE <15
agent_stats.DateTime >=
CASE WHEN age.WorkDays <15 THEN TauStart ELSE
roll.Date
END
New code runs in 3-4 seconds. I will go back and index the appropriate tables per your recommendation and probably gain more efficiency there.
Cannot thank you enough!

SQL Server 2008 : create data where there is none?

I am struggling to create data where there is no data, I have a query that returns similar data to this:
This table shows that for client 123 we had payments in June, July and December only.
The only notable item in the query is a DATEDIFF between the month opened and MonthPayment (this gives the mnth column).
Now where I’m falling over is I need to create the above columns but for all the months that have passed regardless like below
Ignore the month payment field for the none paying months, this shouldn't return anything!
What you need to do is connect this table to a list of all values you might want, as in the question pointed to by Zohar Peled. Your case is slightly complicated, since presumably you need to be able to return multiple clients at a time and only want to see data that pertains to that client's start and end range. I've adapted code from a similar answer I posted some time ago, which should show you how this is done.
-- set up some sample data
DECLARE #MyTable TABLE
(
ClientNo INT,
Collected NUMERIC(5,2),
MonthPayment DATETIME,
MonthOpened DATETIME,
CurrentMonth DATETIME
)
INSERT INTO #MyTable
(
ClientNo,
Collected,
MonthPayment,
MonthOpened,
CurrentMonth
) -- note: I'm in the US, so I'm using the US equivalent of the months you asked for
SELECT 123, 147.25, '7/1/2014', '12/1/2013', '4/1/2015'
UNION
SELECT 123, 40, '12/1/2014', '12/1/2013', '4/1/2015'
UNION
SELECT 123, 50, '6/1/2014', '12/1/2013', '4/1/2015'
-- create a recursive CTE that contains a list of all months that you could wish to see
--define start and end limits
Declare #todate datetime, #fromdate datetime
Select #fromdate=(SELECT MIN(MonthOpened) FROM #MyTable), #todate=DATEADD(MONTH, 1, GETDATE())
-- define CTE
;With DateSequence( DateValue ) as
(
Select #fromdate as DateValue
union all
Select dateadd(MONTH, 1, DateValue)
from DateSequence
where DateValue < #todate
)
--select result
SELECT
ClientStartEnd.ClientNo,
ISNULL(MyTable.Collected, 0.00) AS Collected,
DateSequence.DateValue AS MonthPayment,
ClientStartEnd.MonthOpened,
DATEDIFF(MONTH, ClientStartEnd.MonthOpened, DateSequence.DateValue) + 1 AS Mnth,
ClientStartEnd.CurrentMonth
FROM
DateSequence
INNER JOIN
(
SELECT DISTINCT
ClientNo,
MonthOpened,
CurrentMonth
FROM #MyTable
) ClientStartEnd ON
DateSequence.DateValue BETWEEN
ClientStartEnd.MonthOpened AND
ClientStartEnd.CurrentMonth
LEFT JOIN
#MyTable MyTable ON
ClientStartEnd.ClientNo = MyTable.ClientNo AND
DateSequence.DateValue = MyTable.MonthPayment
OPTION (MaxRecursion 0)

Function to Calculate Median in SQL Server

According to MSDN, Median is not available as an aggregate function in Transact-SQL. However, I would like to find out whether it is possible to create this functionality (using the Create Aggregate function, user defined function, or some other method).
What would be the best way (if possible) to do this - allow for the calculation of a median value (assuming a numeric data type) in an aggregate query?
If you're using SQL 2005 or better this is a nice, simple-ish median calculation for a single column in a table:
SELECT
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median
2019 UPDATE: In the 10 years since I wrote this answer, more solutions have been uncovered that may yield better results. Also, SQL Server releases since then (especially SQL 2012) have introduced new T-SQL features that can be used to calculate medians. SQL Server releases have also improved its query optimizer which may affect perf of various median solutions. Net-net, my original 2009 post is still OK but there may be better solutions on for modern SQL Server apps. Take a look at this article from 2012 which is a great resource: https://sqlperformance.com/2012/08/t-sql-queries/median
This article found the following pattern to be much, much faster than all other alternatives, at least on the simple schema they tested. This solution was 373x faster (!!!) than the slowest (PERCENTILE_CONT) solution tested. Note that this trick requires two separate queries which may not be practical in all cases. It also requires SQL 2012 or later.
DECLARE #c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (#c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - #c % 2) ROWS ONLY
) AS x;
Of course, just because one test on one schema in 2012 yielded great results, your mileage may vary, especially if you're on SQL Server 2014 or later. If perf is important for your median calculation, I'd strongly suggest trying and perf-testing several of the options recommended in that article to make sure that you've found the best one for your schema.
I'd also be especially careful using the (new in SQL Server 2012) function PERCENTILE_CONT that's recommended in one of the other answers to this question, because the article linked above found this built-in function to be 373x slower than the fastest solution. It's possible that this disparity has been improved in the 7 years since, but personally I wouldn't use this function on a large table until I verified its performance vs. other solutions.
ORIGINAL 2009 POST IS BELOW:
There are lots of ways to do this, with dramatically varying performance. Here's one particularly well-optimized solution, from Medians, ROW_NUMBERs, and performance. This is a particularly optimal solution when it comes to actual I/Os generated during execution – it looks more costly than other solutions, but it is actually much faster.
That page also contains a discussion of other solutions and performance testing details. Note the use of a unique column as a disambiguator in case there are multiple rows with the same value of the median column.
As with all database performance scenarios, always try to test a solution out with real data on real hardware – you never know when a change to SQL Server's optimizer or a peculiarity in your environment will make a normally-speedy solution slower.
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
-- SalesOrderId in the ORDER BY is a disambiguator to break ties
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
In SQL Server 2012 you should use PERCENTILE_CONT:
SELECT SalesOrderID, OrderQty,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
See also : http://blog.sqlauthority.com/2011/11/20/sql-server-introduction-to-percentile_cont-analytic-functions-introduced-in-sql-server-2012/
My original quick answer was:
select max(my_column) as [my_column], quartile
from (select my_column, ntile(4) over (order by my_column) as [quartile]
from my_table) i
--where quartile = 2
group by quartile
This will give you the median and interquartile range in one fell swoop. If you really only want one row that is the median then uncomment the where clause.
When you stick that into an explain plan, 60% of the work is sorting the data which is unavoidable when calculating position dependent statistics like this.
I've amended the answer to follow the excellent suggestion from Robert Ševčík-Robajz in the comments below:
;with PartitionedData as
(select my_column, ntile(10) over (order by my_column) as [percentile]
from my_table),
MinimaAndMaxima as
(select min(my_column) as [low], max(my_column) as [high], percentile
from PartitionedData
group by percentile)
select
case
when b.percentile = 10 then cast(b.high as decimal(18,2))
else cast((a.low + b.high) as decimal(18,2)) / 2
end as [value], --b.high, a.low,
b.percentile
from MinimaAndMaxima a
join MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
--where b.percentile = 5
This should calculate the correct median and percentile values when you have an even number of data items. Again, uncomment the final where clause if you only want the median and not the entire percentile distribution.
Even better:
SELECT #Median = AVG(1.0 * val)
FROM
(
SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
FROM dbo.EvenRows AS o
CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);
From the master Himself, Itzik Ben-Gan!
MS SQL Server 2012 (and later) has the PERCENTILE_DISC function which computes a specific percentile for sorted values. PERCENTILE_DISC (0.5) will compute the median - https://msdn.microsoft.com/en-us/library/hh231327.aspx
Simple, fast, accurate
SELECT x.Amount
FROM (SELECT amount,
Count(1) OVER (partition BY 'A') AS TotalRows,
Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder
FROM facttransaction ft) x
WHERE x.AmountOrder = Round(x.TotalRows / 2.0, 0)
If you want to use the Create Aggregate function in SQL Server, this is how to do it. Doing it this way has the benefit of being able to write clean queries. Note this this process could be adapted to calculate a Percentile value fairly easily.
Create a new Visual Studio project and set the target framework to .NET 3.5 (this is for SQL 2008, it may be different in SQL 2012). Then create a class file and put in the following code, or c# equivalent:
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO
<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
Implements IBinarySerialize
Private _items As List(Of Decimal)
Public Sub Init()
_items = New List(Of Decimal)()
End Sub
Public Sub Accumulate(value As SqlDecimal)
If Not value.IsNull Then
_items.Add(value.Value)
End If
End Sub
Public Sub Merge(other As Median)
If other._items IsNot Nothing Then
_items.AddRange(other._items)
End If
End Sub
Public Function Terminate() As SqlDecimal
If _items.Count <> 0 Then
Dim result As Decimal
_items = _items.OrderBy(Function(i) i).ToList()
If _items.Count Mod 2 = 0 Then
result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2#
Else
result = _items((_items.Count - 1) / 2)
End If
Return New SqlDecimal(result)
Else
Return New SqlDecimal()
End If
End Function
Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
'deserialize it from a string
Dim list = r.ReadString()
_items = New List(Of Decimal)
For Each value In list.Split(","c)
Dim number As Decimal
If Decimal.TryParse(value, number) Then
_items.Add(number)
End If
Next
End Sub
Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
'serialize the list to a string
Dim list = ""
For Each item In _items
If list <> "" Then
list += ","
End If
list += item.ToString()
Next
w.Write(list)
End Sub
End Class
Then compile it and copy the DLL and PDB file to your SQL Server machine and run the following command in SQL Server:
CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO
CREATE AGGREGATE Median(#value decimal(9, 3))
RETURNS decimal(9, 3)
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO
You can then write a query to calculate the median like this:
SELECT dbo.Median(Field) FROM Table
I just came across this page while looking for a set based solution to median. After looking at some of the solutions here, I came up with the following. Hope is helps/works.
DECLARE #test TABLE(
i int identity(1,1),
id int,
score float
)
INSERT INTO #test (id,score) VALUES (1,10)
INSERT INTO #test (id,score) VALUES (1,11)
INSERT INTO #test (id,score) VALUES (1,15)
INSERT INTO #test (id,score) VALUES (1,19)
INSERT INTO #test (id,score) VALUES (1,20)
INSERT INTO #test (id,score) VALUES (2,20)
INSERT INTO #test (id,score) VALUES (2,21)
INSERT INTO #test (id,score) VALUES (2,25)
INSERT INTO #test (id,score) VALUES (2,29)
INSERT INTO #test (id,score) VALUES (2,30)
INSERT INTO #test (id,score) VALUES (3,20)
INSERT INTO #test (id,score) VALUES (3,21)
INSERT INTO #test (id,score) VALUES (3,25)
INSERT INTO #test (id,score) VALUES (3,29)
DECLARE #counts TABLE(
id int,
cnt int
)
INSERT INTO #counts (
id,
cnt
)
SELECT
id,
COUNT(*)
FROM
#test
GROUP BY
id
SELECT
drv.id,
drv.start,
AVG(t.score)
FROM
(
SELECT
MIN(t.i)-1 AS start,
t.id
FROM
#test t
GROUP BY
t.id
) drv
INNER JOIN #test t ON drv.id = t.id
INNER JOIN #counts c ON t.id = c.id
WHERE
t.i = ((c.cnt+1)/2)+drv.start
OR (
t.i = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start
AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0
)
GROUP BY
drv.id,
drv.start
The following query returns the median from a list of values in one column. It cannot be used as or along with an aggregate function, but you can still use it as a sub-query with a WHERE clause in the inner select.
SQL Server 2005+:
SELECT TOP 1 value from
(
SELECT TOP 50 PERCENT value
FROM table_name
ORDER BY value
)for_median
ORDER BY value DESC
Although Justin grant's solution appears solid I found that when you have a number of duplicate values within a given partition key the row numbers for the ASC duplicate values end up out of sequence so they do not properly align.
Here is a fragment from my result:
KEY VALUE ROWA ROWD
13 2 22 182
13 1 6 183
13 1 7 184
13 1 8 185
13 1 9 186
13 1 10 187
13 1 11 188
13 1 12 189
13 0 1 190
13 0 2 191
13 0 3 192
13 0 4 193
13 0 5 194
I used Justin's code as the basis for this solution. Although not as efficient given the use of multiple derived tables it does resolve the row ordering problem I encountered. Any improvements would be welcome as I am not that experienced in T-SQL.
SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
SELECT PKEY,VALUE,ROWA,ROWD,
'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
FROM
(
SELECT
PKEY,
cast(VALUE as decimal(5,2)) as VALUE,
ROWA,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD
FROM
(
SELECT
PKEY,
VALUE,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA
FROM [MTEST]
)T1
)T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
In a UDF, write:
Select Top 1 medianSortColumn from Table T
Where (Select Count(*) from Table
Where MedianSortColumn <
(Select Count(*) From Table) / 2)
Order By medianSortColumn
Justin's example above is very good. But that Primary key need should be stated very clearly. I have seen that code in the wild without the key and the results are bad.
The complaint I get about the Percentile_Cont is that it wont give you an actual value from the dataset.
To get to a "median" that is an actual value from the dataset use Percentile_Disc.
SELECT SalesOrderID, OrderQty,
PERCENTILE_DISC(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
Using a single statement - One way is to use ROW_NUMBER(), COUNT() window function and filter the sub-query. Here is to find the median salary:
SELECT AVG(e_salary)
FROM
(SELECT
ROW_NUMBER() OVER(ORDER BY e_salary) as row_no,
e_salary,
(COUNT(*) OVER()+1)*0.5 AS row_half
FROM Employee) t
WHERE row_no IN (FLOOR(row_half),CEILING(row_half))
I have seen similar solutions over the net using FLOOR and CEILING but tried to use a single statement. (edited)
Median Finding
This is the simplest method to find the median of an attribute.
Select round(S.salary,4) median from employee S
where (select count(salary) from station
where salary < S.salary ) = (select count(salary) from station
where salary > S.salary)
See other solutions for median calculation in SQL here:
"Simple way to calculate median with MySQL" (the solutions are mostly vendor-independent).
Building on Jeff Atwood's answer above here it is with GROUP BY and a correlated subquery to get the median for each group.
SELECT TestID,
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score DESC) AS TopHalf)
) / 2 AS MedianScore,
AVG(Score) AS AvgScore, MIN(Score) AS MinScore, MAX(Score) AS MaxScore
FROM Posts_parent
GROUP BY Posts_parent.TestID
For a continuous variable/measure 'col1' from 'table1'
select col1
from
(select top 50 percent col1,
ROW_NUMBER() OVER(ORDER BY col1 ASC) AS Rowa,
ROW_NUMBER() OVER(ORDER BY col1 DESC) AS Rowd
from table1 ) tmp
where tmp.Rowa = tmp.Rowd
Frequently, we may need to calculate Median not just for the whole table, but for aggregates with respect to some ID. In other words, calculate median for each ID in our table, where each ID has many records. (based on the solution edited by #gdoron: good performance and works in many SQL)
SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val,
COUNT(*) OVER (PARTITION BY our_id) AS cnt,
ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rnk
FROM our_table
) AS x
WHERE rnk IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;
Hope it helps.
For large scale datasets, you can try this GIST:
https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2
It works by aggregating the distinct values you would find in your set (such as ages, or year of birth, etc.), and uses SQL window functions to locate any percentile position you specify in the query.
To get median value of salary from employee table
with cte as (select salary, ROW_NUMBER() over (order by salary asc) as num from employees)
select avg(salary) from cte where num in ((select (count(*)+1)/2 from employees), (select (count(*)+2)/2 from employees));
I wanted to work out a solution by myself, but my brain tripped and fell on the way. I think it works, but don't ask me to explain it in the morning. :P
DECLARE #table AS TABLE
(
Number int not null
);
insert into #table select 2;
insert into #table select 4;
insert into #table select 9;
insert into #table select 15;
insert into #table select 22;
insert into #table select 26;
insert into #table select 37;
insert into #table select 49;
DECLARE #Count AS INT
SELECT #Count = COUNT(*) FROM #table;
WITH MyResults(RowNo, Number) AS
(
SELECT RowNo, Number FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM #table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (#Count+1)/2 OR RowNo = ((#Count+1)%2) * ((#Count+2)/2)
--Create Temp Table to Store Results in
DECLARE #results AS TABLE
(
[Month] datetime not null
,[Median] int not null
);
--This variable will determine the date
DECLARE #IntDate as int
set #IntDate = -13
WHILE (#IntDate < 0)
BEGIN
--Create Temp Table
DECLARE #table AS TABLE
(
[Rank] int not null
,[Days Open] int not null
);
--Insert records into Temp Table
insert into #table
SELECT
rank() OVER (ORDER BY DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0), DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')),[SVR].[ref_num]) as [Rank]
,DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')) as [Days Open]
FROM
mdbrpt.dbo.View_Request SVR
LEFT OUTER JOIN dbo.dtv_apps_systems vapp
on SVR.category = vapp.persid
LEFT OUTER JOIN dbo.prob_ctg pctg
on SVR.category = pctg.persid
Left Outer Join [mdbrpt].[dbo].[rootcause] as [Root Cause]
on [SVR].[rootcause]=[Root Cause].[id]
Left Outer Join [mdbrpt].[dbo].[cr_stat] as [Status]
on [SVR].[status]=[Status].[code]
LEFT OUTER JOIN [mdbrpt].[dbo].[net_res] as [net]
on [net].[id]=SVR.[affected_rc]
WHERE
SVR.Type IN ('P')
AND
SVR.close_date IS NOT NULL
AND
[Status].[SYM] = 'Closed'
AND
SVR.parent is null
AND
[Root Cause].[sym] in ( 'RC - Application','RC - Hardware', 'RC - Operational', 'RC - Unknown')
AND
(
[vapp].[appl_name] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
OR
pctg.sym in ('Systems.Release Health Dashboard.Problem','DTV QA Test.Enterprise Release.Deferred Defect Log')
AND
[Net].[nr_desc] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
)
AND
DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0) = DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,#IntDate,getdate())), 0)
ORDER BY [Days Open]
DECLARE #Count AS INT
SELECT #Count = COUNT(*) FROM #table;
WITH MyResults(RowNo, [Days Open]) AS
(
SELECT RowNo, [Days Open] FROM
(SELECT ROW_NUMBER() OVER (ORDER BY [Days Open]) AS RowNo, [Days Open] FROM #table) AS Foo
)
insert into #results
SELECT
DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,#IntDate,getdate())), 0) as [Month]
,AVG([Days Open])as [Median] FROM MyResults WHERE RowNo = (#Count+1)/2 OR RowNo = ((#Count+1)%2) * ((#Count+2)/2)
set #IntDate = #IntDate+1
DELETE FROM #table
END
select *
from #results
order by [Month]
This works with SQL 2000:
DECLARE #testTable TABLE
(
VALUE INT
)
--INSERT INTO #testTable -- Even Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56
--
--INSERT INTO #testTable -- Odd Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 39 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56
DECLARE #RowAsc TABLE
(
ID INT IDENTITY,
Amount INT
)
INSERT INTO #RowAsc
SELECT VALUE
FROM #testTable
ORDER BY VALUE ASC
SELECT AVG(amount)
FROM #RowAsc ra
WHERE ra.id IN
(
SELECT ID
FROM #RowAsc
WHERE ra.id -
(
SELECT MAX(id) / 2.0
FROM #RowAsc
) BETWEEN 0 AND 1
)
For newbies like myself who are learning the very basics, I personally find this example easier to follow, as it is easier to understand exactly what's happening and where median values are coming from...
select
( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]
from (select
datediff(dd,startdate,enddate) as [Value1]
,xxxxxxxxxxxxxx as [Value2]
from dbo.table1
)a
In absolute awe of some of the codes above though!!!
This is as simple an answer as I could come up with. Worked well with my data. If you want to exclude certain values just add a where clause to the inner select.
SELECT TOP 1
ValueField AS MedianValue
FROM
(SELECT TOP(SELECT COUNT(1)/2 FROM tTABLE)
ValueField
FROM
tTABLE
ORDER BY
ValueField) A
ORDER BY
ValueField DESC
The following solution works under these assumptions:
No duplicate values
No NULLs
Code:
IF OBJECT_ID('dbo.R', 'U') IS NOT NULL
DROP TABLE dbo.R
CREATE TABLE R (
A FLOAT NOT NULL);
INSERT INTO R VALUES (1);
INSERT INTO R VALUES (2);
INSERT INTO R VALUES (3);
INSERT INTO R VALUES (4);
INSERT INTO R VALUES (5);
INSERT INTO R VALUES (6);
-- Returns Median(R)
select SUM(A) / CAST(COUNT(A) AS FLOAT)
from R R1
where ((select count(A) from R R2 where R1.A > R2.A) =
(select count(A) from R R2 where R1.A < R2.A)) OR
((select count(A) from R R2 where R1.A > R2.A) + 1 =
(select count(A) from R R2 where R1.A < R2.A)) OR
((select count(A) from R R2 where R1.A > R2.A) =
(select count(A) from R R2 where R1.A < R2.A) + 1) ;
DECLARE #Obs int
DECLARE #RowAsc table
(
ID INT IDENTITY,
Observation FLOAT
)
INSERT INTO #RowAsc
SELECT Observations FROM MyTable
ORDER BY 1
SELECT #Obs=COUNT(*)/2 FROM #RowAsc
SELECT Observation AS Median FROM #RowAsc WHERE ID=#Obs
I try with several alternatives, but due my data records has repeated values, the ROW_NUMBER versions seems are not a choice for me. So here the query I used (a version with NTILE):
SELECT distinct
CustomerId,
(
MAX(CASE WHEN Percent50_Asc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) +
MIN(CASE WHEN Percent50_desc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId)
)/2 MEDIAN
FROM
(
SELECT
CustomerId,
TotalDue,
NTILE(2) OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC) AS Percent50_Asc,
NTILE(2) OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC) AS Percent50_desc
FROM Sales.SalesOrderHeader SOH
) x
ORDER BY CustomerId;
For your question, Jeff Atwood had already given the simple and effective solution. But, if you are looking for some alternative approach to calculate the median, below SQL code will help you.
create table employees(salary int);
insert into employees values(8); insert into employees values(23); insert into employees values(45); insert into employees values(123); insert into employees values(93); insert into employees values(2342); insert into employees values(2238);
select * from employees;
declare #odd_even int; declare #cnt int; declare #middle_no int;
set #cnt=(select count(*) from employees); set #middle_no=(#cnt/2)+1; select #odd_even=case when (#cnt%2=0) THEN -1 ELse 0 END ;
select AVG(tbl.salary) from (select salary,ROW_NUMBER() over (order by salary) as rno from employees group by salary) tbl where tbl.rno=#middle_no or tbl.rno=#middle_no+#odd_even;
If you are looking to calculate median in MySQL, this github link will be useful.

Resources