I have a data source with daily sales per product.
I want to create a field that calculates the average daily sales for the 7 last days, for each product and day (e.g. on day 10 for product A, it will give me the average sales for product A on days 3 - 9; on Day 15 for product B, I'll see the average sales of B on days 8 - 14).
Is this possible?
Example data (I have the first 3 columns. need to generate the fourth)
Date Product Sales 7-Day Average
1/11 A 983 201
2/11 A 650 983
3/11 A 328 817
4/11 A 728 654
5/11 A 246 672
6/11 A 613 587
7/11 A 575 591
8/11 A 601 589
9/11 A 462 534
10/11 A 979 508
11/11 A 148 601
12/11 A 238 518
13/11 A 53 517
14/11 A 500 437
15/11 A 684 426
16/11 A 261 438
17/11 A 69 409
18/11 A 159 279
19/11 A 964 281
20/11 A 429 384
21/11 A 731 438
1/11 B 790 471
2/11 B 265 486
3/11 B 94 487
4/11 B 66 490
5/11 B 124 477
6/11 B 555 357
7/11 B 190 375
8/11 B 232 298
9/11 B 747 218
10/11 B 557 287
11/11 B 432 353
12/11 B 526 405
13/11 B 690 463
14/11 B 350 482
15/11 B 512 505
16/11 B 273 545
17/11 B 679 477
18/11 B 164 495
19/11 B 799 456
20/11 B 749 495
21/11 B 391 504
Haven't really tried anything. Couldn't figure out how to do get started with this)
This may not be the super perfect solution but it does give your expected result in a crude way.
Cross-join the same data source first as shown in the screenshot
Use the calculated field to get the last 7 day average
(CASE WHEN Date (Table 2) BETWEEN DATETIME_SUB(Date (Table 1), INTERVAL 7 DAY) AND DATETIME_SUB(Date (Table 1), INTERVAL 1 DAY) THEN Sales (Table 2) ELSE 0 END)/7
-
I have problem that I find very hard to solve:
I need to calculate a column R_t in SQL where for each row, the sum of the "previous" calculated values SUM(R_t-1) is required as input. The calculation is done grouped over a ProjectID column. I have no clue how to proceed.
The formula for the calculation I am trying to achieve is R_t = ([Contract value]t - SUM(R{t-1})) / [Remaining Hours]_t * [HoursRegistered]t where "t" denotes time and SUM(R{t-1}) is the sum of R_t from t = 0 to t-1.
Time is always consecutive and always begin in t = 0. But number of time periods may differ across [ProjectID], i.e. one project having t = {0,1,2} and another t = {0,1,2,3,4,5}. The time period will never "jump" from 5 to 7
The expected output (using the data from below is) for ProjectID 101 is
R_0 = (500,000 - 0) / 500 * 65 = 65,000
R_1 = (500,000 - (65,000)) / 435 * 100 = 100,000
R_2 = (500,000 - (65,000 + 100,000)) / 335 * 85 = 85,000
R_3 = (500,000 - (65,000 + 100,000 + 85,000)) / 250 * 69 = 69,000
etc...
This calculation is done for each ProjectID.
My question is how to formulate this in a SQL query? My first thought was to create a recursive CTE, but I am actually not sure it is the right way proceed. Recursive CTE is (from my understanding) made for handling more of hierarchical like structure, which this isn't really.
My other thought was to calculate the SUM(R_t-1) using windowed functions, ie SUM OVER (PARITION BY ORDER BY) with a LAG, but the recursiveness really gives me trouble and I run my head against the wall when I am trying.
Below a query for creating the input data
CREATE TABLE [dbo].[InputForRecursiveCalculation]
(
[Time] int NULL,
ProjectID [int],
ContractValue float,
ContractHours float,
HoursRegistered float,
RemainingHours float
)
GO
INSERT INTO [dbo].[InputForRecursiveCalculation]
(
[Time]
,[ProjectID]
,[ContractValue]
,[ContractHours]
,[HoursRegistered]
,[RemainingHours]
)
VALUES
(0,101,500000,500,65,500),
(1,101,500000,500,100,435),
(2,101,500000,500,85,335),
(3,101,500000,500,69,250),
(4,101,450000,650,100,331),
(5,101,450000,650,80,231),
(6,101,450000,650,90,151),
(7,101,450000,650,45,61),
(8,101,450000,650,16,16),
(0,110,120000,90,10,90),
(1,110,120000,90,10,80),
(2,110,130000,90,10,70),
(3,110,130000,90,10,60),
(4,110,130000,90,10,50),
(5,110,130000,90,10,40),
(6,110,130000,90,10,30),
(7,110,130000,90,10,20),
(8,110,130000,90,10,10)
GO
For those of you who dare downloading something from a complete stranger, I have created an Excel file demonstrating the calculation (please download the file as you will not be to see the actual formula in the HTML representation shown when first clicking the link):
https://www.dropbox.com/s/3rxz72lbvooyc4y/Calculation%20example.xlsx?dl=0
Best regards,
Victor
I think it will be usefull for you. There is additional column SumR that stands for sumarry of previest rows (for ProjectID)
;with recu as
(
select
Time,
ProjectId,
ContractValue,
ContractHours,
HoursRegistered,
RemainingHours,
cast((ContractValue - 0)*HoursRegistered/RemainingHours as numeric(15,0)) as R,
cast((ContractValue - 0)*HoursRegistered/RemainingHours as numeric(15,0)) as SumR
from
InputForRecursiveCalculation
where
Time=0
union all
select
input.Time,
input.ProjectId,
input.ContractValue,
input.ContractHours,
input.HoursRegistered,
input.RemainingHours,
cast((input.ContractValue - prev.SumR)*input.HoursRegistered/input.RemainingHours as numeric(15,0)),
cast((input.ContractValue - prev.SumR)*input.HoursRegistered/input.RemainingHours + prev.SumR as numeric(15,0))
from
recu prev
inner join
InputForRecursiveCalculation input
on input.ProjectId = prev.ProjectId
and input.Time = prev.Time + 1
)
select
*
from
recu
order by
ProjectID,
Time
RESULTS:
Time ProjectId ContractValue ContractHours HoursRegistered RemainingHours R SumR
----------- ----------- ---------------------- ---------------------- ---------------------- ---------------------- --------------------------------------- ---------------------------------------
0 101 500000 500 65 500 65000 65000
1 101 500000 500 100 435 100000 165000
2 101 500000 500 85 335 85000 250000
3 101 500000 500 69 250 69000 319000
4 101 450000 650 100 331 39577 358577
5 101 450000 650 80 231 31662 390239
6 101 450000 650 90 151 35619 425858
7 101 450000 650 45 61 17810 443668
8 101 450000 650 16 16 6332 450000
0 110 120000 90 10 90 13333 13333
1 110 120000 90 10 80 13333 26666
2 110 130000 90 10 70 14762 41428
3 110 130000 90 10 60 14762 56190
4 110 130000 90 10 50 14762 70952
5 110 130000 90 10 40 14762 85714
6 110 130000 90 10 30 14762 100476
7 110 130000 90 10 20 14762 115238
8 110 130000 90 10 10 14762 130000
I am using below command in Hive. and getting correct result.
select acct_id,collect_list(expr_dt) from experiences
> group by acct_id;
Output:
900 ["2015-03-31"]
707 ["2015-03-31","2014-12-10"]
903 ["2015-03-31"]
-435 ["2015-03-31"]
718 ["2015-03-31","2014-06-03"]
I want to get the max date for each account.
When I am trying execute below query I am getting error.
select acct_id,max(collect_list(expr_dt)) from experiences
> group by acct_id;
and the error is -
SemanticException [Error 10128]: Line 1:19 Not yet supported place for
UDAF 'collect_list'
I want to do total operation in a single query.
You can go with max without collect_list if your goal is to only find out max expr_dt for each acct_id group
input:
hive> select * from experiences;
OK
900 2015-03-31
707 2015-03-31
707 2014-12-10
903 2015-03-31
-435 2015-03-31
718 2015-03-31
718 2014-06-03
query:
hive> select acct_id,max(expr_dt) from experiences group by acct_id;
output:
Total MapReduce CPU Time Spent: 4 seconds 30 msec
OK
-435 2015-03-31
707 2015-03-31
718 2015-03-31
900 2015-03-31
903 2015-03-31
SQL is not my forte, and I am not sure how to ask the question, some help would be much appreciated.
I need to extract a record that falls between a range of numbers. So I have a number e.g 230, that would return a rate of 60, based on the table below.
MinR MaxR Rate
1 3000 60.00
3001 5000 50.00
5001 7000 48.00
7001 10000 45.00
10000 999999 43.00
Logically I have tried MinR >=237 and MaxR <=237, to no avail.
Is there a simple statement to achieve this, or should I be tackling this more programatically (cursor, If..then, etc)
Many thanks
Graham
You can use BETWEEN as follows:
SELECT Rate
FROM YourTable
WHERE 230 BETWEEN MinR AND MaxR - 1
Used -1 part so that you do not get two records for one input.
You're almost there, you just have your logic backwards. Let's look at MinR >=237 and MaxR <=237 and plug in the numbers from the first row:
1 >= 237 AND 3000 <= 237
Is that condition satisfied? Obviously not: 1 is not greater or equal than 237. It works if you do it the other way around:
MinR <= 237 AND MaxR >= 237
or, to improve readability (and to avoid this kind of mistake in the future):
237 BETWEEN MinR And MaxR
I have a table which indexes the locations of words in a bunch of documents.
I want to identify the most common bigrams in the set.
How would you do this in MSSQL 2008?
the table has the following structure:
LocationID -> DocID -> WordID -> Location
I have thought about trying to do some kind of complicated join... and i'm just doing my head in.
Is there a simple way of doing this?
I think I better edit this on monday inorder to bump it up in the questions
Sample Data
LocationID DocID WordID Location
21952 534 27 155
21953 534 109 156
21954 534 4 157
21955 534 45 158
21956 534 37 159
21957 534 110 160
21958 534 70 161
It's been years since I've written SQL, so my syntax may be a bit off; however, I believe the logic is correct.
SELECT CONCAT(i.WordID, "|", j.WordID) as bigram, count(*) as freq
FROM index as i, index as j
WHERE j.Location = i.Location+1 AND
j.DocID = i.DocID
GROUP BY bigram
ORDER BY freq DESC
You can also add the actual word IDs to the select list if that's useful, and add a join to whatever table you've got that dereferences WordID to actual words.