Related
Let say I have a table of 10000 observations:
Obs X Y Z
1
2
3
...
10000
For each observation, I create a macro: mymacro(X, Y, Z) where I use X, Y, Z like inputs. My macro create a table with 1 observation, 4 new variables var1, var2, var3, var4.
I would like to know how to loop through 10000 observations in my initial set, and the result would be like:
Obs X Y Z Var1 Var2 Var3 Var4
1
2
3
...
10000
Update:
The calculation of Var1, Var2, Var3, Var4:
I have a reference table:
Z 25 26 27 28 29 30
0 10 000 10 000 10 000 10 000 10 000 10 000
1 10 000 10 000 10 000 10 000 10 000 10 000
2 10 000 10 000 10 000 10 000 10 000 10 000
3 10 000 10 000 10 000 10 000 10 000 10 000
4 9 269 9 322 9 322 9 381 9 381 9 436
5 8 508 8 619 8 619 8 743 8 743 8 850
6 7 731 7 914 7 914 8 102 8 102 8 258
7 6 805 7 040 7 040 7 280 7 280 7 484
8 5 864 6 137 6 137 6 421 6 421 6 655
9 5 025 5 328 5 328 5 629 5 629 5 929
10 4 359 4 648 4 648 4 934 4 934 5 320
And my have set is like:
Obs X Y Z
1 27 4 9
2
3
10000
So for the first observation (27, 4, 9):
Var1 = (8 619+ 7 914+ 7 040 + 6 137 + 5 328)/ 9 322
Var2 = (8 743+ 8 102+ 7 280+ 6 421 + 5 629 )/ 9 381
So that:
Var1 = Sum of all number in column 27 (X), from the observation 5 (Z+1) to the observation 9 (Z), and divided by the value in the (column 27 (X) - observation 4 (Z))
Var2 = Sum of all number in column 28 (X+1), from the observation 5 (Z+1) to the observation 9 (Z), and divided by the value in the (column 28 (X+1) - observation 4 (Z))
I would convert the reference table to a form that lets you do the calculations for all observations at once. So make your reference table into a tall structure, either by transposing the existing table or just reading it that way to start with:
data ref_tall;
input z #;
do col=25 to 30 ;
input value :comma9. #;
output;
end;
datalines;
0 10,000 10,000 10,000 10,000 10,000 10,000
1 10,000 10,000 10,000 10,000 10,000 10,000
2 10,000 10,000 10,000 10,000 10,000 10,000
3 10,000 10,000 10,000 10,000 10,000 10,000
4 9,269 9,322 9,322 9,381 9,381 9,436
5 8,508 8,619 8,619 8,743 8,743 8,850
6 7,731 7,914 7,914 8,102 8,102 8,258
7 6,805 7,040 7,040 7,280 7,280 7,484
8 5,864 6,137 6,137 6,421 6,421 6,655
9 5,025 5,328 5,328 5,629 5,629 5,929
10 4,359 4,648 4,648 4,934 4,934 5,320
;
Now take your list table HAVE:
data have;
input id x y z;
datalines;
1 27 4 9
2 25 2 4
;
And combine it with the reference table and make your calculations:
proc sql ;
create table want1 as
select a.id
, sum(b.value)/min(c.value) as var1
from have a
left join ref_tall b
on a.x=b.col
and b.z between a.y+1 and a.z
left join ref_tall c
on a.x=c.col
and c.z = a.y
group by a.id
;
create table want2 as
select a.id
, sum(d.value)/min(e.value) as var2
from have a
left join ref_tall d
on a.x+1=d.col
and d.z between a.y+1 and a.z
left join ref_tall e
on a.x+1=e.col
and e.z = a.y
group by a.id
;
create table want as
select *
from want1 natural join want2 natural join have
;
quit;
Results:
Obs id x y z var1 var2
1 1 27 4 9 3.75864 3.85620
2 2 25 2 4 1.92690 1.93220
The reference table can be established in an array that makes performing the specified computations easy. The reference values can than be accessed using a direct address reference.
Example
The reference table data was moved into a data set so the values can be changed over time or reloaded from some source such as Excel. The reference values can be loaded into an array for use during a DATA step.
* reference information in data set, x property column names are _<num>;
data ref;
input z (_25-_30) (comma9. &);
datalines;
0 10,000 10,000 10,000 10,000 10,000 10,000
1 10,000 10,000 10,000 10,000 10,000 10,000
2 10,000 10,000 10,000 10,000 10,000 10,000
3 10,000 10,000 10,000 10,000 10,000 10,000
4 9,269 9,322 9,322 9,381 9,381 9,436
5 8,508 8,619 8,619 8,743 8,743 8,850
6 7,731 7,914 7,914 8,102 8,102 8,258
7 6,805 7,040 7,040 7,280 7,280 7,484
8 5,864 6,137 6,137 6,421 6,421 6,655
9 5,025 5,328 5,328 5,629 5,629 5,929
10 4,359 4,648 4,648 4,934 4,934 5,320
;
* computation parameters, might be a thousand of them specified;
data have;
input id x y z;
datalines;
1 27 4 9
;
* perform computation for each parameters specified;
data want;
set have;
array ref[0:10,1:30] _temporary_;
if _n_ = 1 then do ref_row = 0 by 1 until (last_ref);
* load reference data into an array for direct addressing during computation;
set ref end=last_ref;
array ref_cols _25-_30;
do index = 1 to dim(ref_cols);
colname = vname(ref_cols[index]);
colnum = input(substr(colname,2),8.);
ref[ref_row,colnum] = ref_cols[index];
end;
end;
* perform computation for parameters specified;
array vars var1-var4;
do index = 1 to dim(vars);
ref_column = x + index - 1 ; * column x, then x+1, then x+2, then x+3;
numerator = 0; * algorithm against reference data;
do ref_row = y+1 to z;
numerator + ref[ref_row,ref_column];
end;
denominator = ref[y,ref_column];
vars[index] = numerator / denominator; * result;
end;
keep id x y z numerator denominator var1-var4;
run;
An example of my dataset would be:
ZoneA: 0-100
ZoneB: 100-200
Name SubName startValueA endValueA startValueB endValueB
A X 0 25 0 100
A X 25 35 0 100
A X 35 80 0 100
A X 80 95 0 100
A X 95 120 0 100
A Y 120 145 100 200
A Y 145 160 100 200
A Y 160 175 100 200
A Y 175 190 100 200
A Y 190 200 100 200
Essentially what I'm desiring is this:
Name SubName startValueA endValueA startValueB endValueB Percent
A X 0 25 0 100 1
A X 25 35 0 100 1
A X 35 80 0 100 1
A X 80 95 0 100 1
A X 95 100 0 100 .2 <--- (100-95)/(120-95)
A X 100 120 100 200 .8 <--- (120-100)/(120-95)
A Y 120 145 100 200 1
A Y 145 160 100 200 1
A Y 160 175 100 200 1
A Y 175 190 100 200 1
A Y 190 200 100 200 1
So a row is added where ValueA crosses over ValueB, and then the resulting percent of each is calculated. Basically I'm trying to figure out how much of valueA belongs in each Zone as defined by valueB. I have the first row done pretty simply with something along the lines of:
case
when endValueA <= endValueB then 1
else ((endValueB - startValueA)/(endValueA - startValueA))
I'm just not sure how to get the additional row added in with the inverse percent.
Thanks in advance for the help!
I have problem that I find very hard to solve:
I need to calculate a column R_t in SQL where for each row, the sum of the "previous" calculated values SUM(R_t-1) is required as input. The calculation is done grouped over a ProjectID column. I have no clue how to proceed.
The formula for the calculation I am trying to achieve is R_t = ([Contract value]t - SUM(R{t-1})) / [Remaining Hours]_t * [HoursRegistered]t where "t" denotes time and SUM(R{t-1}) is the sum of R_t from t = 0 to t-1.
Time is always consecutive and always begin in t = 0. But number of time periods may differ across [ProjectID], i.e. one project having t = {0,1,2} and another t = {0,1,2,3,4,5}. The time period will never "jump" from 5 to 7
The expected output (using the data from below is) for ProjectID 101 is
R_0 = (500,000 - 0) / 500 * 65 = 65,000
R_1 = (500,000 - (65,000)) / 435 * 100 = 100,000
R_2 = (500,000 - (65,000 + 100,000)) / 335 * 85 = 85,000
R_3 = (500,000 - (65,000 + 100,000 + 85,000)) / 250 * 69 = 69,000
etc...
This calculation is done for each ProjectID.
My question is how to formulate this in a SQL query? My first thought was to create a recursive CTE, but I am actually not sure it is the right way proceed. Recursive CTE is (from my understanding) made for handling more of hierarchical like structure, which this isn't really.
My other thought was to calculate the SUM(R_t-1) using windowed functions, ie SUM OVER (PARITION BY ORDER BY) with a LAG, but the recursiveness really gives me trouble and I run my head against the wall when I am trying.
Below a query for creating the input data
CREATE TABLE [dbo].[InputForRecursiveCalculation]
(
[Time] int NULL,
ProjectID [int],
ContractValue float,
ContractHours float,
HoursRegistered float,
RemainingHours float
)
GO
INSERT INTO [dbo].[InputForRecursiveCalculation]
(
[Time]
,[ProjectID]
,[ContractValue]
,[ContractHours]
,[HoursRegistered]
,[RemainingHours]
)
VALUES
(0,101,500000,500,65,500),
(1,101,500000,500,100,435),
(2,101,500000,500,85,335),
(3,101,500000,500,69,250),
(4,101,450000,650,100,331),
(5,101,450000,650,80,231),
(6,101,450000,650,90,151),
(7,101,450000,650,45,61),
(8,101,450000,650,16,16),
(0,110,120000,90,10,90),
(1,110,120000,90,10,80),
(2,110,130000,90,10,70),
(3,110,130000,90,10,60),
(4,110,130000,90,10,50),
(5,110,130000,90,10,40),
(6,110,130000,90,10,30),
(7,110,130000,90,10,20),
(8,110,130000,90,10,10)
GO
For those of you who dare downloading something from a complete stranger, I have created an Excel file demonstrating the calculation (please download the file as you will not be to see the actual formula in the HTML representation shown when first clicking the link):
https://www.dropbox.com/s/3rxz72lbvooyc4y/Calculation%20example.xlsx?dl=0
Best regards,
Victor
I think it will be usefull for you. There is additional column SumR that stands for sumarry of previest rows (for ProjectID)
;with recu as
(
select
Time,
ProjectId,
ContractValue,
ContractHours,
HoursRegistered,
RemainingHours,
cast((ContractValue - 0)*HoursRegistered/RemainingHours as numeric(15,0)) as R,
cast((ContractValue - 0)*HoursRegistered/RemainingHours as numeric(15,0)) as SumR
from
InputForRecursiveCalculation
where
Time=0
union all
select
input.Time,
input.ProjectId,
input.ContractValue,
input.ContractHours,
input.HoursRegistered,
input.RemainingHours,
cast((input.ContractValue - prev.SumR)*input.HoursRegistered/input.RemainingHours as numeric(15,0)),
cast((input.ContractValue - prev.SumR)*input.HoursRegistered/input.RemainingHours + prev.SumR as numeric(15,0))
from
recu prev
inner join
InputForRecursiveCalculation input
on input.ProjectId = prev.ProjectId
and input.Time = prev.Time + 1
)
select
*
from
recu
order by
ProjectID,
Time
RESULTS:
Time ProjectId ContractValue ContractHours HoursRegistered RemainingHours R SumR
----------- ----------- ---------------------- ---------------------- ---------------------- ---------------------- --------------------------------------- ---------------------------------------
0 101 500000 500 65 500 65000 65000
1 101 500000 500 100 435 100000 165000
2 101 500000 500 85 335 85000 250000
3 101 500000 500 69 250 69000 319000
4 101 450000 650 100 331 39577 358577
5 101 450000 650 80 231 31662 390239
6 101 450000 650 90 151 35619 425858
7 101 450000 650 45 61 17810 443668
8 101 450000 650 16 16 6332 450000
0 110 120000 90 10 90 13333 13333
1 110 120000 90 10 80 13333 26666
2 110 130000 90 10 70 14762 41428
3 110 130000 90 10 60 14762 56190
4 110 130000 90 10 50 14762 70952
5 110 130000 90 10 40 14762 85714
6 110 130000 90 10 30 14762 100476
7 110 130000 90 10 20 14762 115238
8 110 130000 90 10 10 14762 130000
I have a dataset in the below format:
Date 1 Date 1 Date 1 Date 2 Date 2 Date 3 Date 3
Product 1 10 20 10 5 10 20 30
Product 2 5 5 10 10 10 5 30
Product 3 30 10 5 10 30 30 40
Product 4 5 10 10 20 5 10 20
and I am trying to sum the sales of the products by the date, to create the below:
Date 1 Date 2 Date 3
Product 1 40 15 50
Product 3 45 40 70
Product 4 25 25 30
Product 2 20 20 35
The products in the second table will often be in a different order, so a simple SUMIF will not suffice.
I've attempted a combination of SUM, INDEX and MATCH, as well as SUM with nested IF function, but no amount of Googling or trial and error is getting me there. I keep just bringing back the values in one cell, but not managing to sum.
With the following setup:
I used the following formula
=SUMIF($B$1:$H$1,B$10,INDIRECT("$B" & MATCH($A11,$A$1:$A$5,0) & ":$H" &MATCH($A11,$A$1:$A$5,0)))
To get what was wanted. I put the formula in B11 and then copied across and Down
I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C
Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})
I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm