Loop though each observation in SAS - loops

Let say I have a table of 10000 observations:
Obs X Y Z
1
2
3
...
10000
For each observation, I create a macro: mymacro(X, Y, Z) where I use X, Y, Z like inputs. My macro create a table with 1 observation, 4 new variables var1, var2, var3, var4.
I would like to know how to loop through 10000 observations in my initial set, and the result would be like:
Obs X Y Z Var1 Var2 Var3 Var4
1
2
3
...
10000
Update:
The calculation of Var1, Var2, Var3, Var4:
I have a reference table:
Z 25 26 27 28 29 30
0 10 000 10 000 10 000 10 000 10 000 10 000
1 10 000 10 000 10 000 10 000 10 000 10 000
2 10 000 10 000 10 000 10 000 10 000 10 000
3 10 000 10 000 10 000 10 000 10 000 10 000
4 9 269 9 322 9 322 9 381 9 381 9 436
5 8 508 8 619 8 619 8 743 8 743 8 850
6 7 731 7 914 7 914 8 102 8 102 8 258
7 6 805 7 040 7 040 7 280 7 280 7 484
8 5 864 6 137 6 137 6 421 6 421 6 655
9 5 025 5 328 5 328 5 629 5 629 5 929
10 4 359 4 648 4 648 4 934 4 934 5 320
And my have set is like:
Obs X Y Z
1 27 4 9
2
3
10000
So for the first observation (27, 4, 9):
Var1 = (8 619+ 7 914+ 7 040 + 6 137 + 5 328)/ 9 322
Var2 = (8 743+ 8 102+ 7 280+ 6 421 + 5 629 )/ 9 381
So that:
Var1 = Sum of all number in column 27 (X), from the observation 5 (Z+1) to the observation 9 (Z), and divided by the value in the (column 27 (X) - observation 4 (Z))
Var2 = Sum of all number in column 28 (X+1), from the observation 5 (Z+1) to the observation 9 (Z), and divided by the value in the (column 28 (X+1) - observation 4 (Z))

I would convert the reference table to a form that lets you do the calculations for all observations at once. So make your reference table into a tall structure, either by transposing the existing table or just reading it that way to start with:
data ref_tall;
input z #;
do col=25 to 30 ;
input value :comma9. #;
output;
end;
datalines;
0 10,000 10,000 10,000 10,000 10,000 10,000
1 10,000 10,000 10,000 10,000 10,000 10,000
2 10,000 10,000 10,000 10,000 10,000 10,000
3 10,000 10,000 10,000 10,000 10,000 10,000
4 9,269 9,322 9,322 9,381 9,381 9,436
5 8,508 8,619 8,619 8,743 8,743 8,850
6 7,731 7,914 7,914 8,102 8,102 8,258
7 6,805 7,040 7,040 7,280 7,280 7,484
8 5,864 6,137 6,137 6,421 6,421 6,655
9 5,025 5,328 5,328 5,629 5,629 5,929
10 4,359 4,648 4,648 4,934 4,934 5,320
;
Now take your list table HAVE:
data have;
input id x y z;
datalines;
1 27 4 9
2 25 2 4
;
And combine it with the reference table and make your calculations:
proc sql ;
create table want1 as
select a.id
, sum(b.value)/min(c.value) as var1
from have a
left join ref_tall b
on a.x=b.col
and b.z between a.y+1 and a.z
left join ref_tall c
on a.x=c.col
and c.z = a.y
group by a.id
;
create table want2 as
select a.id
, sum(d.value)/min(e.value) as var2
from have a
left join ref_tall d
on a.x+1=d.col
and d.z between a.y+1 and a.z
left join ref_tall e
on a.x+1=e.col
and e.z = a.y
group by a.id
;
create table want as
select *
from want1 natural join want2 natural join have
;
quit;
Results:
Obs id x y z var1 var2
1 1 27 4 9 3.75864 3.85620
2 2 25 2 4 1.92690 1.93220

The reference table can be established in an array that makes performing the specified computations easy. The reference values can than be accessed using a direct address reference.
Example
The reference table data was moved into a data set so the values can be changed over time or reloaded from some source such as Excel. The reference values can be loaded into an array for use during a DATA step.
* reference information in data set, x property column names are _<num>;
data ref;
input z (_25-_30) (comma9. &);
datalines;
0 10,000 10,000 10,000 10,000 10,000 10,000
1 10,000 10,000 10,000 10,000 10,000 10,000
2 10,000 10,000 10,000 10,000 10,000 10,000
3 10,000 10,000 10,000 10,000 10,000 10,000
4 9,269 9,322 9,322 9,381 9,381 9,436
5 8,508 8,619 8,619 8,743 8,743 8,850
6 7,731 7,914 7,914 8,102 8,102 8,258
7 6,805 7,040 7,040 7,280 7,280 7,484
8 5,864 6,137 6,137 6,421 6,421 6,655
9 5,025 5,328 5,328 5,629 5,629 5,929
10 4,359 4,648 4,648 4,934 4,934 5,320
;
* computation parameters, might be a thousand of them specified;
data have;
input id x y z;
datalines;
1 27 4 9
;
* perform computation for each parameters specified;
data want;
set have;
array ref[0:10,1:30] _temporary_;
if _n_ = 1 then do ref_row = 0 by 1 until (last_ref);
* load reference data into an array for direct addressing during computation;
set ref end=last_ref;
array ref_cols _25-_30;
do index = 1 to dim(ref_cols);
colname = vname(ref_cols[index]);
colnum = input(substr(colname,2),8.);
ref[ref_row,colnum] = ref_cols[index];
end;
end;
* perform computation for parameters specified;
array vars var1-var4;
do index = 1 to dim(vars);
ref_column = x + index - 1 ; * column x, then x+1, then x+2, then x+3;
numerator = 0; * algorithm against reference data;
do ref_row = y+1 to z;
numerator + ref[ref_row,ref_column];
end;
denominator = ref[y,ref_column];
vars[index] = numerator / denominator; * result;
end;
keep id x y z numerator denominator var1-var4;
run;

Related

Issues Regarding SAS

I was working on a homework problem regarding using arrays and looping to create a new variable to identify the date of when the maximum blood lead value was obtained but got stuck. For context, here is the homework problem:
In 1990 a study was done on the blood lead levels of children in Boston. The following variables for twenty-five children from the study have been entered on multiple lines per subject in the file lead_sum2018.txt in a list format:
Line 1
ID Number (numeric, values 1-25)
Date of Birth (mmddyy8. format)
Day of Blood Sample 1 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 1 (numeric, initial possible range: -9 to 12)
Line 2
ID Number (numeric, values 1-25)
Day of Blood Sample 2 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 2 (numeric, initial possible range: -9 to 12)
Line 3
ID Number (numeric, values 1-25)
Day of Blood Sample 3 (numeric, initial possible range: -9 to 31)
Month of Blood Sample 3 (numeric, initial possible range: -9 to 12)
Line 4
ID Number (numeric, values 1-25)
Blood Lead Level Sample 1 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 2 (numeric, possible range: 0.01 – 20.00)
Blood Lead Level Sample 3 (numeric, possible range: 0.01 – 20.00)
Sex (character, ‘M’ or ‘F’)
All blood samples were drawn in 1990. However, during data entry the order of blood samples was scrambled so that the first blood sample in the data file (blood sample 1) may not correspond to the first blood sample taken on a subject, it could be the first, second or third. In addition, some of the months and days and days of blood sampling were not written on the forms. At data entry, missing month and missing day values were each coded as -9.
The team of investigators for this project has made the following decisions regarding the missing values. Any missing days are to set equal to 15, any missing months are to be set equal to 6. Any analyses that are done on this data set need to follow those decisions. Be sure to implement the SAS syntax as indicated for each question. For example, use SAS arrays and loops if the item states that these must be used.
Here is the data that the HW references (it is in list format and was contained in a separate file called lead_sum2018.txt):
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 6 6
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
5 12/26/80 21 5
5 3 7
5 -9 12
5 4.35 4.79 5.14 M
6 06/20/81 7 10
6 11 3
6 22 1
6 1.24 1.16 0.71 F
7 06/22/81 19 6
7 3 12
7 29 8
7 3.1 3.21 3.58 F
8 05/24/82 26 7
8 31 1
8 9 10
8 2.99 2.37 2.4 M
9 10/11/82 2 7
9 25 5
9 28 3
9 2.4 1.96 2.71 F
10 . 10 8
10 30 12
10 28 2
10 2.72 2.87 1.97 F
11 11/16/83 19 4
11 15 11
11 7 -9
11 4.8 4.5 4.96 M
12 03/02/84 17 6
12 11 2
12 17 11
12 2.38 2.6 2.88 F
13 04/19/84 2 12
13 -9 6
13 1 7
13 1.99 1.20 1.21 M
14 02/07/85 4 5
14 17 5
14 21 11
14 1.61 1.93 2.32 F
15 07/06/85 5 2
15 16 1
15 14 6
15 3.93 4 4.08 M
16 09/10/85 12 10
16 11 -9
16 23 6
16 3.29 2.88 2.97 M
17 11/05/85 12 7
17 18 1
17 11 11
17 1.31 0.98 1.04 F
18 12/07/85 16 2
18 18 4
18 -9 6
18 2.56 2.78 2.88 M
19 03/02/86 19 4
19 11 3
19 19 2
19 0.79 0.68 0.72 M
20 08/19/86 21 5
20 15 12
20 -9 4
20 0.66 1.15 1.42 F
21 02/22/87 16 12
21 17 9
21 13 4
21 2.92 3.27 3.23 M
22 10/11/87 7 6
22 1 12
22 -9 3
22 1.43 1.42 1.78 F
23 05/12/88 12 2
23 21 4
23 17 12
23 0.55 0.89 1.38 M
24 08/07/88 17 6
24 27 11
24 6 2
24 0.31 0.42 0.15 F
25 01/12/89 4 7
25 15 -9
25 23 1
25 1.69 1.58 1.53 M
A) Input the data and in the data step:
1) make sure that Date of Birth variable is recorded as a SAS date;
2) use SAS arrays and looping to create a SAS date variable for each of the three blood samples and to address the missing data in accordance to the decisions of the investigators. Hint: use a single array and do loop to recode the missing values for day and month, separately, and an array/do loop for creating the SAS date variable;
3) use a SAS function to create a variable for the highest, i.e., maximum, blood lead value for each child;
4) use SAS arrays and looping to identify the date on which this largest value was obtained and create a new variable for the date of the largest blood lead value;
5) determine the age of the child in years when the largest blood lead value was obtained (rounded to two decimal places);
6) create a new variable based on the age of the child in years when the largest lead value was obtained (call it, “agecat”) that takes on three levels: for children less than 4 years old, agecat should equal 1; for children at least 4 years old, but less than 8, agecat should equal 2; and for children at least 8 years of age, agecat should be 3.;
7) print out the variables for the date of birth, date of the largest lead level, age at blood sample for the largest blood lead level, agecat, sex, and the largest blood lead level (Only print out these requested variables). All dates should be formatted to use the mmddyy10. format on the output.
The code I used in response to this was:
libname HW3 'C:\Users\johns\Desktop\SAS';
filename HW3new 'C:\Users\johns\Desktop\SAS\lead_sum2018.txt';
data one;
infile HW3new;
informat dob mmddyy8.;
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
array dbs{3} dbs1 dbs2 dbs3;
array mbs{3} mbs1 mbs2 mbs3;
do i=1 to 3;
if dbs{i}=-9 then dbs{i}=15;
end;
do i=4 to 6;
if mbs{i}=-9 then mbs{i}=6;
end;
array date{3} mdy1 mdy2 mdy3;
do i=1 to 3;
date{i}=mdy(mbs{i}, dbs{i}, 1990);
end;
maxbls=max(of bls1-bls3);
array bls{3} bls1 bls2 bls3;
array maxdte{3} maxdte1 maxdte2 maxdte3;
do i=1 to i=3;
if bls{i}=maxbls then maxdte=i;
end;
agemax=maxdte-dob;
ageest=round(agemax/365.25,2);
if agemax=. then agecat=.;
else if agemax < 4 then agecat=1;
else if 4 <= agemax < 8 then agecat=2;
else if agemax ge 8 then agecat=3;
run;
I received this error:
22 maxbls=max(of bls1-bls3);
23 array bls{3} bls1 bls2 bls3;
24 array maxdte{3} maxdte1 maxdte2 maxdte3;
25 do i=1 to i=3;
26 if bls{i}=maxbls then maxdte=i;
ERROR: Illegal reference to the array maxdte.
27 end;
Does anyone have any tip is regards to this issue? What did I do wrong? Was I supposed to create an additional array for the date of when the maximum blood lead sample value was collected? Thanks!
**I'm stuck on #4 of Part A, but I included the other parts for context. Thanks!
**Edits: I included the data that I had to read into SAS and the file name of the file it came from
Just from looking at the code immediately prior to the error, you have a problem on this line:
26 if bls{i}=maxbls then maxdte=i;
You are getting the error because you are attempting to assign a value to the array maxdte. Arrays cannot be assigned values like that (unless you are using the deprecated do over syntax...) Instead, choose an element of the array and assign the value to the element. E.g. you could do:
26 if bls{i}=maxbls then maxdte{1}=i;
Or instead of a literal 1, you could use a variable containing the relevant array index.
You are not properly handling ID field from lines #2-4
input #1 id dob dbs1 mbs1
#2 dbs2 mbs2
#3 dbs3 mbs3
#4 bls1 bls2 bls3 sex;
For example you need to skip field 1 on line 2-3 or read the ids into array perhaps to check they are all the same.
input #1 id dob dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex;
This example show how to check that you have 4 lines with the same ID and if you do read the rest of the variables or execute LOSTCARD. ID 3 has a missing record;
353 data ex;
354 infile cards n=4 stopover;
355 input #1 id #2 id2 #3 id3 #4 id4 #;
356 if id eq id2 eq id3 eq id4
357 then input #1 id dob:mmddyy. dbs1 mbs1
358 #2 id2 dbs2 mbs2
359 #3 id3 dbs3 mbs3
360 #4 id4 bls1 bls2 bls3 sex :$1.;
361 else lostcard;
362 format dob mmddyy.;
363 cards;
NOTE: LOST CARD.
RULE: ----+----1----+----2----+----3----+----4----+----5----+----6----+----7----+----8----+----9----+----0
372 3 01/03/80 11 7
373 3 27 2
374 3 3.24 3.4 3.83 M
375 4 08/01/80 5 12
NOTE: LOST CARD.
376 4 28 -9
NOTE: LOST CARD.
377 4 3 4
NOTE: The data set WORK.EX has 3 observations and 15 variables.
data ex;
infile cards n=4 stopover;
input #1 id #2 id2 #3 id3 #4 id4 #;
if id eq id2 eq id3 eq id4
then input #1 id dob:mmddyy. dbs1 mbs1
#2 id2 dbs2 mbs2
#3 id3 dbs3 mbs3
#4 id4 bls1 bls2 bls3 sex :$1.;
else lostcard;
format dob mmddyy.;
cards;
1 04/30/78 6 10
1 -9 7
1 14 1
1 1.62 1.35 1.47 F
2 05/19/79 27 11
2 20 -9
2 5 6
2 1.71 1.31 1.76 F
3 01/03/80 11 7
3 27 2
3 3.24 3.4 3.83 M
4 08/01/80 5 12
4 28 -9
4 3 4
4 3.1 3.69 3.27 M
;;;;
run;
proc print;
run;

Sort array in descending order but keep NaNs at the end

I have an array:
Data = [10 20 30 50 40 60 NaN NaN 70 80; 1 2 3 4 5 6 7 8 9 10];
The first row represents the data and the second one contains the indices for the first row.
I want to arrange the first row in descending order, but with the correct indices.
Therefore, my output should be like
N_Data = [80 70 60 50 40 30 20 10 NaN NaN; 10 9 6 4 5 3 2 1 8 7];
I don't really care about the NaNs so it is ok to eliminate them from the array N_Data
You can just sort the negative of your data in the default ascending order which will automatically place the NaN values at the end. You can then negate the result to get the original values
result = -sortrows(-Data.').'
% 80 70 60 50 40 30 20 10 NaN NaN
% 10 9 6 4 5 3 2 1 8 7
And if you want to remove NaN values
result = result(:, any(isnan(result), 1));
Here you can find two options, inspired by this answers applied for your case:
%// Option 1
Data = [10 20 30 50 40 60 NaN NaN 70 80; 1 2 3 4 5 6 7 8 9 10]
Data(isnan(Data)) = -Inf;
N_Data = sortrows(Data.',-1).'
N_Data(isinf(N_Data)) = NaN
%// Option 2
Data = [10 20 30 50 40 60 NaN NaN 70 80; 1 2 3 4 5 6 7 8 9 10]
[sortedData, sortedIndices] = sort(Data(1,:),'descend')
N_Data = [sortedData; sortedIndices]
mask = isnan(sortedData)
N_Data = [ N_Data( :,~mask), N_Data( :, mask) ]
N_Data =
80 70 60 50 40 30 20 10 NaN NaN
10 9 6 4 5 3 2 1 7 8
If you don't care about the NaNs, the options can be shortened to:
%// Option 1
Data(isnan(Data)) = -Inf;
N_Data = sortrows(Data.',-1).'
N_Data(:,isinf(N_Data(1,:))) = []
%// Option 2
[sortedData, sortedIndices] = sort(Data(1,:),'descend')
N_Data = [sortedData; sortedIndices]
N_Data = N_Data( :,~isnan(sortedData) )
Explanation:
%// N_Data = N_Data( :,~isnan(sortedData) )
x = isnan(sortedData) %// creates logical mask, where NaNs exist
y = ~x %// negates mask, to mask values which are not Nan
N_Data = N_Data( :,y ) %// selects all rows and masked columns without NaN
N_Data =
80 70 60 50 40 30 20 10
10 9 6 4 5 3 2 1
Try using sortrows(Data.',-1).'.
It sorts rows and you want columns sorted, hence the .'. The -1 option is for descending order.

Adding and multiplying tables' data by values in another table

Say I have a table of subtractions and divisions sorted by date:
tblFactors
dt sub divide
2014-07-01 1 1
2014-06-01 0 5
2014-05-01 2 1
2014-05-01 0 3
I have another table of values, sorted by date:
tblValues
dt val
2014-07-05 4
2014-06-15 5
2014-05-15 21
2014-04-14 31
2014-03-15 71
I need to perform some sequential calculations. For the first value in tblFactors, I need to subtract 1 from every val where tblValues.dt < '2014-07-01'.
Next, I need to process the second row in tblFactors. There is nothing to subtract. However, the divide = 5 means that I need to divide every val by 5 where tblValues.dt < '2014-06-01'. The tricky thing is that I need to do this on the modified val from the row before (divide 20 / 5, not 21 / 5).
Each row in tblFactors would process in this manner, giving a sequence like this:
tblFactors: Row 1 Row 2 Row 3 Row 4
Dt Original Val Subtract 1 Divide by 5 Subtract 2 Divide by 3
7/5/2014 4
6/15/2014 5 4
5/15/2014 21 20 4
4/14/2014 31 30 6 4
3/25/2014 71 70 14 12 4
This would leave me with:
qryValues
dt val
2014-07-05 4
2014-06-15 4
2014-05-15 4
2014-04-14 4
2014-03-15 4
Right now I'm doing vector multiplications over loops in R. I was wondering if there was a clever way to accomplish this in the native sql. I tried doing some aggregations but I've had limited success.

Assigning a single value to all cells within a specified time period, matrix format

I have the following example dataset which consists of the # of fish caught per check of a net. The nets are not checked at uniform intervals. The day of the check is denoted in julian days as well as the number of days the net had been fishing since last checked (or since it's deployment in the case of the first check)
http://textuploader.com/9ybp
Site_Number Check_Day_Julian Set_Duration_Days Fish_Caught
2 5 3 100
2 10 5 70
2 12 2 65
2 15 3 22
100 4 3 45
100 10 6 20
100 18 8 8
450 10 10 10
450 14 4 4
In any case, I would like to turn the raw data above into the following format:
http://textuploader.com/9y3t
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
2 0 0 100 100 100 70 70 70 70 70 65 65 22 22 22 0 0 0
100 0 45 45 45 20 20 20 20 20 20 8 8 8 8 8 8 8 8
450 10 10 10 10 10 10 10 10 10 10 4 4 4 4 0 0 0 0
This is a matrix which assigns the # of fish caught during the period to EACH of the days that were within that period. The columns of the matrix are Julian days, the rows are site numbers.
I have tried to do this with some matrix functions but I have had much difficulty trying to populate all the fields that are within the time period, but I do not necessarily have a row of data for?
I had posted my small bit of code here, but upon reflection, my approach is quite archaic and a bit off point. Can anyone suggest a method to convert the data into the matrix provided? I've been scratching my head and googling all day but now I am stumped.
Cheers,
C
Two answers, the second one is faster but a bit low level.
Solution #1:
library(IRanges)
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
cov <- coverage(split(ir, Site_Number),
weight=split(Fish_Caught, Site_Number),
width=max(end(ir)))
do.call(rbind, lapply(cov, as.vector))
})
Solution #2:
with(d, {
ir <- IRanges(end=Check_Day_Julian, width=Set_Duration_Days)
site <- factor(Site_Number, unique(Site_Number))
m <- matrix(0, length(levels(site)), max(end(ir)))
ind <- cbind(rep(site, width(ir)), as.integer(ir))
m[ind] <- rep(Fish_Caught, width(ir))
m
})
I don't see a super obvious matrix transformation here. This is all i've got assuming the raw data is in a data.frame called dd
dd$Site_Number<-factor(dd$Site_Number)
mm<-matrix(0, nrow=nlevels(dd$Site_Number), ncol=18)
for(i in 1:nrow(dd)) {
mm[as.numeric(dd[i,1]), (dd[i,2]-dd[i,3]):dd[i,2] ] <- dd[i,4]
}
mm

combine row and columns in matlab for series

everyone
I have problem when make iteration for two variable but, it's combine just in one vector or array
At first i write my input for iteration w(0) as w
w=[1 50];
for number 1, I use array
e=0:1:(n-1);
f=0:2:(2*n-2); %for 50 in column 2.
I try to use this code
w=[1 50];
ww=kron(ones((n),1),w)
e=0:1:(n-1);
f=0:2:(2*n-2);
r=[e',f']
x=ww+r
and the output is
ww =
1 50
1 50
1 50
1 50
1 50
1 50
r =
0 0
1 2
2 4
3 6
4 8
5 10
x =
1 50
2 52
3 54
4 56
5 58
6 60
I want x is output just in one array, in example
x =
1
50
2
52
3
54
4
56
5
58
6
60
where w=[1 50] can be use difference addition for iteration
Apply this to your x matrix:
x = reshape(x.',[],1);
See reshape doc for details.
Here is a simple method to create your vector from scratch:
x = [1:6;50:2:60];
x(:)
Or with your variables:
x = [e; f];
x(:)

Resources