Moving Average PER TICKER for each day - sql-server

I am trying to calculate the, say, 3 day moving average (in reality 30 day) volume for stocks.
I'm trying to get the average of the last 3 date entries (rather than today-3 days). I've been trying to do something with rownumber in SQL server 2012 but with no success. Can anyone help out. Below is a template schema, and my rubbish attempt at SQL. I have various incarnations of the below SQL with the group by's but still not working. Many thanks!
select dt_eod, ticker, volume
from
(
select dt_eod, ticker, avg(volume)
row_number() over(partition by dt_eod order by max_close desc) rn
from mytable
) src
where rn >= 1 and rn <= 3
order by dt_eod
Sample Schema:
CREATE TABLE yourtable
([dt_date] int, [ticker] varchar(1), [volume] int);
INSERT INTO yourtable
([dt_date], [ticker], [volume])
VALUES
(20121201, 'A', 5),
(20121201, 'B', 7),
(20121201, 'C', 6),
(20121202, 'A', 10),
(20121202, 'B', 8),
(20121202, 'C', 7),
(20121203, 'A', 10),
(20121203, 'B', 87),
(20121203, 'C', 74),
(20121204, 'A', 10),
(20121204, 'B', 86),
(20121204, 'C', 67),
(20121205, 'A', 100),
(20121205, 'B', 84),
(20121205, 'C', 70),
(20121206, 'A', 258),
(20121206, 'B', 864),
(20121206, 'C', 740);

Three day average for each row:
with top3Values as
(
select t.ticker, t.dt_date, top3.volume
from yourtable t
outer apply
(
select top 3 top3.volume
from yourtable top3
where t.ticker = top3.ticker
and t.dt_date >= top3.dt_date
order by top3.dt_date desc
) top3
)
select ticker, dt_date, ThreeDayVolume = avg(volume)
from top3Values
group by ticker, dt_date
order by ticker, dt_date
SQL Fiddle demo.
Latest value:
with tickers as
(
select distinct ticker from yourtable
), top3Values as
(
select t.ticker, top3.volume
from tickers t
outer apply
(
select top 3 top3.volume
from yourtable top3
where t.ticker = top3.ticker
order by dt_date desc
) top3
)
select ticker, ThreeDayVolume = avg(volume)
from top3Values
group by ticker
order by ticker
SQL Fiddle demo.
Realistically you wouldn't need to create the tickers CTE for the second query, as you'd be basing this on a [ticker] table, and you'd probably have some sort of date parameter in the query, but hopefully this will get you on the right track.

You mentioned SQL 2012, which means that you can leverage a much simpler paradigm.
select dt_date, ticker, avg(1.0*volume) over (
partition by ticker
order by dt_date
ROWS BETWEEN 2 preceding and current row
)
from yourtable
I find this much more transparent as to what is actually trying to be accomplished.

You may want to look at yet another technique that is presented here: SQL-Server Moving Averages set-based algorithm with flexible window-periods and no self-joins.
The algorithm is quite speedy (much faster than APPLY and does not degrade in performance like APPLY does as data-points-window expands), easily adaptable to your requirement, works with pre-SQL2012, and overcomes the limitations of SQL-2012's windowing functionality that requires hard-coding of window-width in the OVER/PARTITION-BY clause.
For a stock-market type application with moving price-averages, it is a common requirement to allow a user to vary the number of data-points included in the average (from a UI selection, like allowing a user to choose 7 day, 30 day, 60 day, etc), and SQL-2012's OVER clause cannot handle this variable partition-width requirement without dynamic SQL.

Related

T-SQL - get only latest row for selected condition

I have table with measurement with column SERIAL_NBR, DATE_TIME, VALUE.
There is a lot of data so when I need them to get the last 48 hours for 2000 devices
Select * from MY_TABLE where [TIME]> = DATEADD (hh, -48, #TimeNow)
takes a very long time.
Is there a way not to receive all the rows for each device, but only the latest entry? Would this speed up the query execution time?
Assuming that there is column named deviceId(change as per your needs), you can use top 1 with ties with window function row_number:
Select top 1 with ties *
from MY_TABLE
where [TIME]> = DATEADD (hh, -48, #TimeNow)
Order by row_number() over (
partition by deviceId
order by Time desc
);
You can simply create Common Table Expression that sorts and groups the entries and then pick the latest one from there.
;WITH numbered
AS ( SELECT [SERIAL_NBR], [TIME], [VALUE], row_nr = ROW_NUMBER() OVER (PARTITION BY [SERIAL_NBR] ORDER BY [TIME] DESC)
FROM MY_TABLE
WHERE [TIME]> = DATEADD (hh, -48, #TimeNow) )
SELECT [SERIAL_NBR], [TIME], [VALUE]
FROM numbered
WHERE row_nr = 1 -- we want the latest record only
Depending on the amount of data and the indexes available this might or might not be faster than Anthony Hancock's answer.
Similar to his answer you might also try the following:
(from MSSQL's point of view, the below query and Anthony's query are pretty much identical and they'll probably end up with the same query plan)
SELECT [SERIAL_NBR] , [TIME], [VALUE]
FROM MY_TABLE AS M
JOIN (SELECT [SERIAL_NBR] , max_time = MAX([TIME])
FROM MY_TABLE
GROUP BY [SERIAL_NBR]) AS L -- latest
ON L.[SERIAL_NBR] = M.[SERIAL_NBR]
AND L.max_time = M.[TIME]
WHERE M.DATE_TIME >= DATEADD(hh,-48,#TimeNow)
Your listed column values and your code don't quite match up so you'll probably have to change this code a little, but it sounds like for each SERIAL_NBR you want the record with the highest DATE_TIME in the last 48 hours. This should achieve that result for you.
SELECT SERIAL_NBR,DATE_TIME,VALUE
FROM MY_TABLE AS M
WHERE M.DATE_TIME >= DATEADD(hh,-48,#TimeNow)
AND M.DATE_TIME = (SELECT MAX(_M.DATE_TIME) FROM MY_TABLE AS _M WHERE M.SERIAL_NBR = _M.SERIAL_NBR)
This will get you details of the latest record per serial number:
Select t.SERIAL_NBR, q.FieldsYouWant
from MY_TABLE t
outer apply
(
selct top 1 t2.FieldsYouWant
from MY_TABLE t2
where t2.SERIAL_NBR = t.SERIAL_NBR
order by t2.[TIME] desc
)q
where t.[TIME]> = DATEADD (hh, -48, #TimeNow)
Also, worth sticking DATEADD (hh, -48, #TimeNow) into a variable rather than calculating inline.

Find the date when a bit column toggled state

I have this requirement.
My table contains a series of rows with serialnos and several bit columns and date-time.
To Simplify I will focus on 1 bit column.In essence, I need to know the recent date that this bit was toggled.
Ex: The following table depicts the bit values for 7 serials for the latest 6 days (10 to 5).
SQl Fiddle schema + query
I have succesfully managed to get the result in a sample but is taking ages on the real table containing over 30 million records and approx 300K serial nos.
Pseudo -->
For each Serial:
Get (max Date) bit value as A (latest bit value ex 1)
Get (max Date) NOT A as B ( Find most recent date that was ex 0)
Get the (Min Date) > B
Group by SNO
I am sure an optimised approach exists.
For completeness the dataset contains rows that I need to filter out etc. However I can build and add these later when getting the basic executing more efficiently.
Tks for your time!
with cte as
(
select *, rn = ROW_NUMBER() OVER (ORDER BY sno)
from dbo.TestCape2
)
select MAX(y.Device_date) as MaxDate,
y.SNo
from cte x
inner join cte as y
on x.rn = y.rn + 1
and x.SNo = y.SNo
and x.Cape <> y.Cape
group by y.SNo
order by SNo;
And if you're using SQL-Server 2012 and up you can make use of LAG, which will take a look at the previous row.
select max(Device_date) as MaxDate,
SNo
from (
select SNo
,Device_date
,Cape
,LAG (Cape, 1, 0) OVER (PARTITION BY Sno ORDER BY Device_date) AS PrevCape
,LAG (Sno, 1, 0) OVER (PARTITION BY Sno ORDER BY Device_date) AS PrevSno
from dbo.TestCape2) t
where sno = PrevSno
and t.Cape <> t.PrevCape
group by sno
order by sno;

How to combine PIVOT and aggregation?

Here is my sample schema and data (http://sqlfiddle.com/#!3/0d8b7/3/0):
CREATE TABLE cpv
(
ClientId INT,
CodeName VARCHAR(20),
Value VARCHAR(30),
LastModified DATETIME
);
INSERT INTO cpv (ClientId,CodeName,Value,LastModified)
VALUES
(1000, 'PropA', 'A', '2014-05-15 17:02:00'),
(1000, 'PropB', 'B', '2014-05-15 17:01:00'),
(1000, 'PropC', 'C', '2014-05-15 17:01:00'),
(2000, 'PropA', 'D', '2014-05-15 17:02:00'),
(2000, 'PropB', 'E', '2014-05-15 17:05:00');
I need to reshape it into:
ClientId PropA PropB PropC LastModified
1000 A B C '2014-05-15 17:02:00'
2000 D E NULL '2014-05-15 17:05:00'
There are Two operations involved here:
aggregation of the LastModified - taking the Max within the same ClientId
pivoting the CodeName column
I have no idea how to combine them.
This SQL Fiddle demonstrates pivoting the CodeName column:
SELECT PropA,PropB,PropC
FROM (
SELECT CodeName,Value FROM cpv
) src
PIVOT (
MAX(Value)
FOR CodeName IN (PropA,PropB,PropC)
) p
But it does not group by ClientId and neither takes the Maximum of the LastModified.
This SQL Fiddle demonstrates grouping by the ClientId and aggregating LastModified:
SELECT ClientId,MAX(LastModified) LastModified
FROM cpv
GROUP BY ClientId
But it totally ignores the Name and Value columns.
How can I group by ClientId, aggregate by taking the Maximum LastModified within the group and also pivot the CodeName column, again within each group?
EDIT
The answer is available here.
Try this:
;with cte as
(SELECT ClientID,PropA,PropB,PropC
FROM (
SELECT ClientID, CodeName,Value FROM cpv
) src
PIVOT (
MAX(Value)
FOR CodeName IN (PropA,PropB,PropC)
) p)
SELECT DISTINCT cte.ClientID, PropA, PropB, PropC, MAX(LastModified) OVER(PARTITION BY cte.clientid ORDER BY cte.clientid) MaxLastModified FROM cte
INNER JOIN cpv ON cte.clientid = cpv.clientid
Demo here.

T-SQL for a normalized average

I'm looking for a way to calculate a useful average for a given set of values which may contain huge spikes. (e.g 21, 54, 34, 14, 20, 300, 23 or 1, 1, 1, 1, 200, 1, 100) the spikes can throw things off when using the standard average calculation.
I looked into using the median but this doesn't really give the desired result.
I would like to implement this in T-SQL
Any ideas?
This way you can take away the highest and the lowest 25 % before calculating the result.
declare #t table (col1 int)
insert #t
select 21 union all
select 54 union all
select 34 union all
select 14 union all
select 20 union all
select 300 union all
select 23 union all
select 1 union all
select 1 union all
select 1 union all
select 1 union all
select 200 union all
select 1 union all
select 100
select avg(col1) from (
select top 67 percent col1 from (
select top 75 percent col1 from #t order by col1
) a order by col1 desc) b
Use median filter:
SELECT AVG(value)
FROM (
SELECT TOP 1 value AS median
FROM (
SELECT TOP 50 PERCENT value
FROM mytable
ORDER BY
value
) q
ORDER BY
value DESC
) q
JOIN mytable m
ON ABS(LOG10(value) - LOG10(median)) <= #filter_level
Create GROUP BY by logarithmic rule (for example difference between number not exceeds 10 times or any other base of log)
Create filtering (using HAVING) by non-representative groups (for example less than 3)
The danger in doing this is that you can't be certain that all those spikes are insignificant and worth discarding. One person's noise is another person's Black Swan.
If you're worried about large values skewing your view of the data needlessly, you'd be better off using a measure like median that's less sensitive to outliers. It's harder to calculate than mean, but it'll give you a measure of centrality that's not swayed as much by spikes.
You may consider using a windowing function like OVER / PARTITION BY. This will allow you to fine-tune exclusions within specific groups of rows (like by name, date, or hour). In this example, I borrow the rows from the example t-clausen.dk and expand by adding a name so we can demonstrate windowing.
-- Set boundaries, like the TOP PERCENT used in the afore mentioned example
DECLARE #UBOUND FLOAT, #LBOUND FLOAT
SET #UBOUND = 0.8 --(80%)
SET #LBOUND = 0.2 --(20%)
--Build a CTE table
;WITH tb_example AS (
select [Val]=21,[fname]='Bill' union all
select 54,'Tom' union all
select 34,'Tom' union all
select 14,'Bill' union all
select 20,'Bill' union all
select 300,'Tom' union all
select 23,'Bill' union all
select 1,'Tom' union all
select 1,'Tom' union all
select 1,'Bill' union all
select 1,'Tom' union all
select 200,'Bill' union all
select 1,'Tom' union all
select 12,'Tom' union all
select 8,'Tom' union all
select 11,'Bill' union all
select 100,'Bill'
)
--Outer query applies criteria of your choice to remove spikes
SELECT fname,AVG(Val) FROM (
-- Inner query applies windowed aggregate values for outer query processing
SELECT *
,ROW_NUMBER() OVER (PARTITION BY fname order by Val) RowNum
,COUNT(*) OVER (PARTITION BY fname) RowCnt
,MAX(Val) OVER (PARTITION BY fname) MaxVal
,MIN(Val) OVER (PARTITION BY fname) MinVal
FROM tb_example
) TB
WHERE
-- You can use the bounds to eliminate the top and bottom 20%
RowNum BETWEEN (RowCnt*#LBOUND) and (RowCnt*#UBOUND) -- Limits window
-- Or you may chose to simply eliminate the Max and MIN values
OR (Val > MinVal AND Val < MaxVal) -- Removes Lowest and Highest values
GROUP BY fname
In this case, I use both criteria and AVG the val by fname. But the sky is the limit with how you chose to mitigate spikes with this technique.

SQL Select Statement For Calculating A Running Average Column

I am trying to have a running average column in the SELECT statement based on a column from the n previous rows in the same SELECT statement. The average I need is based on the n previous rows in the resultset.
Let me explain
Id Number Average
1 1 NULL
2 3 NULL
3 2 NULL
4 4 2 <----- Average of (1, 3, 2),Numbers from previous 3 rows
5 6 3 <----- Average of (3, 2, 4),Numbers from previous 3 rows
. . .
. . .
The first 3 rows of the Average column are null because there are no previous rows. The row 4 in the Average column shows the average of the Number column from the previous 3 rows.
I need some help trying to construct a SQL Select statement that will do this.
This should do it:
--Test Data
CREATE TABLE RowsToAverage
(
ID int NOT NULL,
Number int NOT NULL
)
INSERT RowsToAverage(ID, Number)
SELECT 1, 1
UNION ALL
SELECT 2, 3
UNION ALL
SELECT 3, 2
UNION ALL
SELECT 4, 4
UNION ALL
SELECT 5, 6
UNION ALL
SELECT 6, 8
UNION ALL
SELECT 7, 10
--The query
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM RowsToAverage rta
)
SELECT nr.ID, nr.Number,
CASE
WHEN nr.RowNumber <=3 THEN NULL
ELSE ( SELECT avg(Number)
FROM NumberedRows
WHERE RowNumber < nr.RowNumber
AND RowNumber >= nr.RowNumber - 3
)
END AS MovingAverage
FROM NumberedRows nr
Assuming that the Id column is sequential, here's a simplified query for a table named "MyTable":
SELECT
b.Id,
b.Number,
(
SELECT
AVG(a.Number)
FROM
MyTable a
WHERE
a.id >= (b.Id - 3)
AND a.id < b.Id
AND b.Id > 3
) as Average
FROM
MyTable b;
Edit: I missed the point that it should average the three previous records...
For a general running average, I think something like this would work:
SELECT
id, number,
SUM(number) OVER (ORDER BY ID) /
ROW_NUMBER() OVER (ORDER BY ID) AS [RunningAverage]
FROM myTable
ORDER BY ID
A simple self join would seem to perform much better than a row referencing subquery
Generate 10k rows of test data:
drop table test10k
create table test10k (Id int, Number int, constraint test10k_cpk primary key clustered (id))
;WITH digits AS (
SELECT 0 as Number
UNION SELECT 1
UNION SELECT 2
UNION SELECT 3
UNION SELECT 4
UNION SELECT 5
UNION SELECT 6
UNION SELECT 7
UNION SELECT 8
UNION SELECT 9
)
,numbers as (
SELECT
(thousands.Number * 1000)
+ (hundreds.Number * 100)
+ (tens.Number * 10)
+ ones.Number AS Number
FROM digits AS ones
CROSS JOIN digits AS tens
CROSS JOIN digits AS hundreds
CROSS JOIN digits AS thousands
)
insert test10k (Id, Number)
select Number, Number
from numbers
I would pull the special case of the first 3 rows out of the main query, you can UNION ALL those back in if you really want it in the row set. Self join query:
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM test10k rta
)
SELECT nr.ID, nr.Number,
avg(trailing.Number) as MovingAverage
FROM NumberedRows nr
join NumberedRows as trailing on trailing.RowNumber between nr.RowNumber-3 and nr.RowNumber-1
where nr.Number > 3
group by nr.id, nr.Number
On my machine this takes about 10 seconds, the subquery approach that Aaron Alton demonstrated takes about 45 seconds (after I changed it to reflect my test source table) :
;WITH NumberedRows
AS
(
SELECT rta.*, row_number() OVER (ORDER BY rta.ID ASC) AS RowNumber
FROM test10k rta
)
SELECT nr.ID, nr.Number,
CASE
WHEN nr.RowNumber <=3 THEN NULL
ELSE ( SELECT avg(Number)
FROM NumberedRows
WHERE RowNumber < nr.RowNumber
AND RowNumber >= nr.RowNumber - 3
)
END AS MovingAverage
FROM NumberedRows nr
If you do a SET STATISTICS PROFILE ON, you can see the self join has 10k executes on the table spool. The subquery has 10k executes on the filter, aggregate, and other steps.
Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
Check out some solutions here. I'm sure that you could adapt one of them easily enough.
If you want this to be truly performant, and arn't afraid to dig into a seldom-used area of SQL Server, you should look into writing a custom aggregate function. SQL Server 2005 and 2008 brought CLR integration to the table, including the ability to write user aggregate functions. A custom running total aggregate would be the most efficient way to calculate a running average like this, by far.
Alternatively you can denormalize and store precalculated running values. Described here:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/01/23/denormalizing-to-enforce-business-rules-running-totals.aspx
Performance of selects is as fast as it goes. Of course, modifications are slower.

Resources