I am trying to log, square, cubic and log-odds transform my input data to provide an exhaustive overview of the best performing transformation in univariate regression
I have tried the following code on a dataset with 1,000 variables - It returns an error / runs out of memory or simply cannot execute. Are there any limitations with transforming variables en-masse in this way using arrays?
/*Create a table for reference*/
DATA input_data;
ARRAY var_[*] var_1-var_1000;
DO i = 1 to 1000;
DO i = 1 to 1000;
var_(i)= i*j;
output;
END;
END;
RUN;
/*Log, square, cubic, logit transform all variables*/
DATA input_transform;
SET input_data;
ARRAY var[*] var_1-var_1000;
ARRAY log[*] log_1-log_1000;
ARRAY logit[*] logit_1-logit_1000;
ARRAY sq[*] sq_1-sq_1000;
ARRAY cubic[*] cubic_1-cubic_1000;
DO i = 1 to 1000;
log(i) = log(var(i));
logit(i) = log((var(i))/(1-var(i)));
sq(i) = var(i)**2;
cubic(i) = var(i)**3;
END;
RUN;
A new dataset with 5000 variables each with the respective transformation
You are using I as the index variable for both or your two nested do loops. That is probably messing them up.
Also your first data step is writing 1,000,000 observations of 1,002 variables with only the lower left triangle of the "array" filled in. Do you really want the OUTPUT statement inside the loop?
Hypothetically there are no issues with this, as long as your code is correct. Here's an example and the log.
option notes;
%let size=1000;
/*Create a table for reference*/
DATA input_data;
ARRAY var_[*] var_1-var_&size.;
DO i = 1 to &size.;
DO j = 1 to &size.;
var_(j)= i*j;
END;
output;
END;
RUN;
/*Log, square, cubic, logit transform all variables*/
DATA input_transform;
SET input_data;
ARRAY _var[*] var_1-var_&size.;
ARRAY _log[*] log_1-log_&size.;
ARRAY _logit[*] logit_1-logit_&size.;
ARRAY _sq[*] sq_1-sq_&size.;
ARRAY _cubic[*] cubic_1-cubic_&size.;
DO i = 1 to &size.;
_log(i) = log(_var(i));
_logit(i) = sqrt(_var(i));
_sq(i) = _var(i)**2;
_cubic(i) = _var(i)**3;
END;
RUN;
and the log:
1576 option notes;
1577 %let size=1000;
1578
1579 /*Create a table for reference*/
1580 DATA input_data;
1581 ARRAY var_[*] var_1-var_&size.;
1582
1583 DO i = 1 to &size.;
1584 DO j = 1 to &size.;
1585 var_(j)= i*j;
1586 END;
1587 output;
1588 END;
1589 RUN;
NOTE: The data set WORK.INPUT_DATA has 1000 observations and 1002
variables.
NOTE: DATA statement used (Total process time):
real time 0.03 seconds
cpu time 0.03 seconds
1590
1591 /*Log, square, cubic, logit transform all variables*/
1592 DATA input_transform;
1593 SET input_data;
1594 ARRAY _var[*] var_1-var_&size.;
1595 ARRAY _log[*] log_1-log_&size.;
1596 ARRAY _logit[*] logit_1-logit_&size.;
1597 ARRAY _sq[*] sq_1-sq_&size.;
1598 ARRAY _cubic[*] cubic_1-cubic_&size.;
1599
1600 DO i = 1 to &size.;
1601 _log(i) = log(_var(i));
1602 _logit(i) = sqrt(_var(i));
1603 _sq(i) = _var(i)**2;
1604 _cubic(i) = _var(i)**3;
1605 END;
1606 RUN;
NOTE: There were 1000 observations read from the data set
WORK.INPUT_DATA.
NOTE: The data set WORK.INPUT_TRANSFORM has 1000 observations and 5002
variables.
NOTE: DATA statement used (Total process time):
real time 0.12 seconds
cpu time 0.10 seconds
Related
Looking at running a ProcFreq on the following snippet of data
SampleData
Looking to find out the proportion of MutYes to MutNo by gene, comparing/controlling cancer.
Here's the code that I've got so far:
Proc Freq data=polysorted;
by Gene;
weight Status;
table MutYes*MutNo /chisq ;
run;
My question is how do I need to rearrange the data to make this work correctly. Right now, it's giving me:
ERROR: Variable Status in list does not match type prescribed for this list.
Trying to get a layout like this:
Layout
clearly outlying the proportion of MutYes to MutNo by control and cancer for each gene
You need a variable(resp) with value Y/N and WEIGHT variable(Y) for the counts;
data mut;
do gene = 'ATPhase6','ATPhase8';
do status = 'Control','PC';
do resp = 'Yes','No';
input Y #;
output;
end;
end;
end;
cards;
29 236
21 169
6 259
13 177
;;;;
run;
proc print;
run;
proc freq data=mut order=data;
by gene;
table status*resp / cmh;
weight y;
run;
I have try to write a user-defined function MinDis() and apply it in data step, this function is used to compute the minimum distance from one point to each element of an (numeric)array. Code fllowing:
proc fcmp outlib = work.funcs.Math;
function MinDis(Exp,Arr[*]);
array dis[2] /symbols;
call dynamic_array(dis,dim(Arr));
do i = 1 to dim(Arr);
dis[i] = abs((Exp - Arr[i]));
end;
return(min(of dis[*]));
endsub;
quit;
option cmplib=work.funcs ;
data MinDis;
input LamdazLower LamdazUpper #;
cards;
2.50 10.0
2.51 10.8
2.49 9.97
2.75 9.50
;
run;
data _null_;
set ;
array _PTN_ [14] _temporary_ (0.5,1,1.5,2,2.5,3,4,5,6,7,8,9,10,12);
StdLamZLow = MinDis(LamdazLower,_PTN_);
StdLamZUpp = MinDis(LamdazUpper,_PTN_);
put _all_;
run;
It was compile rightly but gave wrong results. StdLamZLow just get the minimum distance from LamdazLower to the first two element of array _PTN_.
When I rewrite the dim of dis as 999 or something very big and get rid of call dynamic_array statement I would get it right. But I surely want to know why min(of dis[*]) just take dis as a 2-dim array.
By the way, how can I use implied DO-loops do over ... to instead of explicit DO loops? I have tried several times but haven`t success yet.
Thanks for any hints.
I think this happens cause of dynamic array. MIN function see just static length of dis array(2 in your case). So you should try to compute min of array without calling MIN function:
proc fcmp outlib = work.funcs.Math;
function MinDis(Exp,Arr[*]);
length=dim(Arr);
array dis[2] /nosymbols;
call dynamic_array(dis,length);
dis[1]=abs((Exp - Arr[1]));
min=dis[1];
do i = 1 to length;
dis[i] = abs((Exp - Arr[i]));
if dis[i] < min then min=dis[i];
end;
return(min);
endsub;
quit;
Output:
LamdazLower=2.5 LamdazUpper=10 StdLamZLow=0 StdLamZUpp=0
LamdazLower=2.51 LamdazUpper=10.8 StdLamZLow=0.01 StdLamZUpp=0.8
LamdazLower=2.49 LamdazUpper=9.97 StdLamZLow=0.01 StdLamZUpp=0.03
LamdazLower=2.75 LamdazUpper=9.5 StdLamZLow=0.25 StdLamZUpp=0.5
In addition about implied DO-loops(here on 6th page), if i correctly understood the question:
data temp;
array dis dis1-dis4;
do over dis;
dis=2;
put _all_;
end;
run;
Output:
I=1 dis1=2 dis2=. dis3=. dis4=. ERROR=0 N=1
I=2 dis1=2 dis2=2 dis3=. dis4=. ERROR=0 N=1
I=3 dis1=2 dis2=2 dis3=2 dis4=. ERROR=0 N=1
I=4 dis1=2 dis2=2 dis3=2 dis4=2 ERROR=0 N=1
I have an existing collection of variables a_0,...,a_45 where a_i represents the amount of stuff I have on day i. I'd like to create a new collection of variables b_0,...,b_45 to represent the incremental change in stuff I have on day i (i.e. b_k=a_k-a_(k-1) ). My approach:
data test;
set dataset;
array a a_0-a_45;
array b b_0-b_45;
b(1)=a(1);
do i=2 to 45;
b(i)=a(i)-a(i-1);
end;
run;
However my b variables just come out missing.
What initial values do you have for a_1 to a_45 before you start the loop? As you are not intialising them (except for a_0 ≡ a(1)), every b(i) term will be a difference of 2 a terms, of which at least one will be missing, unless these variables are populated in your input dataset.
Here is some sample code showing that the delta computation is correct when the variable names in the data set align with the variables named in the array statement in the data step.
Sample data
data have(keep=product_id note a_:);
do product_id = 1 to 100;
length note $15;
array amount a_0-a_45;
call missing(of amount(*));
if (ranuni(123) < 0.5) then do;
note = 'static deltas';
static_delta = ceil(5 * ranuni(123));
amount(1) = static_delta;
do inventory_day = 2 to dim(amount);
amount(inventory_day) = amount(inventory_day-1) + static_delta;
end;
end;
else do;
note = 'random deltas';
amount(1) = ceil(5 * ranuni(123));
do inventory_day = 2 to dim(amount);
amount(inventory_day) = max ( 0, amount(inventory_day-1) + floor(10 * ranuni(123)) - 5 );
end;
end;
OUTPUT;
end;
run;
Compute deltas
data want;
set have;
array amount a_0-a_45;
array delta b_0-b_45;
delta(1) = amount(1);
do i=2 to dim(amount);
delta(i) = amount(i) - amount(i-1);
end;
drop i;
format a_: b_: 4.;
run;
As Richard has already suggested in his comment while I was working on writing the code...Basically the only error that you have in your code is that your code should loop from 2 to 46 because there are 46 elements in the array. below code should work for you.
%macro f();
data dataset;
%do i = 0 %to 45;
a_&i. = ranuni(2);
%end;
run;
%mend;
%f();
data test;
set dataset;
array a1 a_0-a_45;
array b1 b_0-b_45;
/* This line will help in avoiding b_0 to have a missing value */
b1(1)=a1(1);
do i=2 to 46;
b1(i)=a1(i)-a1(i-1);
end;
run;
I have an array with totals for 210 days. I need to find the sum of all 90 day ranges. The new array is med_sum. So med_sum(1) =sum(of Total(32)-total(121)), then med_sum(2)=sum(of total(33)-total(122)), and so on, 90 different times all the way to med_sum(90)=sum(of total(121)-total(210)).
Below is the syntax, but the sum(of) function isn't allowing me to do this and errors out. I have tried quite a few different options but have been unable to find anything that works.
Thank you in advance!!
data work.total_base_3;
set work.total_base_2;
array med_total(*) total1-total210;
array med_sum(*) avg1-avg90;
do i = 1 to 90;
med_sum(i)=sum(of med_total(i+31)-med_total(i+120));
end;
run;
You cannot use array references in variable lists, just actual variable names. So you want to generate 90 sums of 90 values with the window sliding. In essence you want
avg1 = sum(of total32 - total121);
avg2 = sum(of total33 - total122);
avg3 = sum(of total34 - total123);
You could use macro logic to just generate that series of statements. But if you look at the relationship between the variables you can see that
med_sum(n+1) = sum(med_sum(n),med_total(n+1+120),-1*med_total(n+31));
So your loop will look something like:
med_sum(1) = sum(of total32-total121);
do n=1 to dim(med_sum)-1;
med_sum(n+1) = sum(med_sum(n),med_total(n+1+120),-1*med_total(n+31));
end;
here's a sample that you should be able to extend to your data (so in your case you would change the 3's to 90's):
case 1 Data in rows :
data test;
keep obs;
do i=1 to 10;
obs = i;
output;
end;
run;
data test1;
set test;
keep obs sum;
array x[3];
retain x;
x[mod(_n_ -1,3)+1] = obs;
if (_n_ >= 3)then do;
sum = 0;
do i = 1 to 3;
sum= sum + x[i];
end;
end;
run;
case 2 data in columns (use the test dataset from from above):
proc transpose data=test out=testrow;
var obs;
run;
data test2;
set testrow;
array med_total(*) col1-col10;
array med_sum(8) ;
do i = 3 to 10;
med_sum[i-2]=0;
do j = 1 +(i-3) to i;
med_sum(i-2)=med_sum[i-2] + med_total(j);
end;
end;
run;
My problem is following:
I have a dataset with two columns
number_of_years payment
4 100
5 123
2 52
and I would like to create new variable (or set of variables and then to sum them) and add value based on the value in the column number_of_years.
New variable should get following value:
number_of_years payment new_variable
4 100 100*1.01**4 + 100*1.01**3 + 100*1.01**2 + 100*1.01**1
5 123 123*1.01**5 + 123*1.01**4 + 123*1.01**3 + 123*1.01**2 + 123*1.01*1
2 52 52*1.01**2 + 52*1.01**1
e.t.c.
My original idea was to put a value from the column number_of_years into macro variable, loop with its value creating additional columns and then sum it, but it does not work.
data uprava;
set work.data_diskontace;
%let value1=number_of_years;
%macro spocti(n);
%do i=1 %to &n;
new_variable&i = payment*1.01**&i;
%end;
%mend doit;
%spocti(value1);
run;
Thank you for any suggestion which way to go.
You should use regular loops instead of macro loops, because number of iteration is dynamic depends on number_of_years variable.
data uprava;
set work.data_diskontace;
new_variable = 0;
do i = 1 to number_of_years;
new_variable = new_variable + payment*1.01**i;
end;
run;
No need for macros this is a geometric series that converge to
Simplest solution is:
data have;
input years payment;
cards;
4 100
5 123
2 52
;
run;
data want;
set have;
new_variable = (1.01*(1-1.01**years)/(1-1.01))*payment;
run;