I am completely unfamiliar with macros/do loops/arrays in SAS, but I have been trying to read up on them. It is not going well.
I have a dataset that has 148,176 rows, 9 columns. I want to run all 148176 combinations one by one through my program (so each row one by one) and have it spit out each result as one long list. I should have 148176 values at the end.
Before working with the macro piece, I just used macro variables so the user could input each value, like so:
%let classIin = 1;
%let classIIin = 0.8;
Now I would like to replace each number of the above %let statements with a variable from the 9 columns (each column would correspond to one of the above macro variables, there are 9 I just didn't list them all).
I started trying to write this code, but I am really confused and I know I am missing key things about this process. If anyone has some helpful video tutorials I should watch, I am happy to do that, because nothing I am finding is helping me much so far.
In the following, "AA" and "AB" are two of the column names in Work.MasterPlanList, but I'm not sure if I can call forth variables in this way.
%macro masterlist;
%do i=1 %to 148176;
Data Work.test;
Set work.MasterPlanList(firstobs=&i obs=&i);
call symputx ('classIin', AA)
call symputx ('classIIin', AB)
%end;
%mend;
Then I would theoretically call in the %macro in my code, but the other problem is that I need each variable from this list at different times in my code. Is that an issue or will my macro work by looking at row 1, go through my whole code/calculation set, spit out value 1, then go back to the beginning and look at row 2, go through the code/calc, value 2, etc. etc. etc. until 148176?
It is hard to answer without more specifics of the calculations you are doing. For example you could possibly just do all of your calculations in a data step and never use macro variables or macros.
But if have structured your analysis for one set of parameters as a macro then you can use the dataset to generate multiple calls to the macro. Although 150K calls to a long complex macro is quite a lot.
Say you had a macro called %MYMACRO that had 2 input parameters. And you had a SAS dataset with 2 variables with the values for those parameters. You could then use CALL EXECUTE() or other code generation methods to generate one macro call per observation.
For code generation on this scale I find that using a data step to write the code is easier to understand and debug than using CALL EXECUTE. Especially if you name your dataset variable with the same names as the macro parameters.
filename code temp;
data _null_;
set my_metadata ;
file code ;
put '%mymacro(' var1= ',' var2= ')';
run;
%include code /source2;
Is it possible to replace a column with an array?
I have a column from 1....24 and I want to replace it with an array that has 24 distinct elements in it.
How would I go about doing this?
Thanks!
Sounds like you want to transpose the table.
data have;
do i=1 to 24;
output;
end;
run;
proc transpose data=have out=want;
run;
Check out the documentation on PROC TRANSPOSE for more information and options. http://support.sas.com/documentation/cdl/en/proc/70377/HTML/default/viewer.htm#n1xno5xgs39b70n0zydov0owajj8.htm
There are a few other ways beyond proc transpose; I would suggest that for most use cases, but there are other cases where it's appropriate to do one of these.
The simplest way is to load it into the array using the set statement. This is a common way to set up a temporary array that is used in very similar fashion to a hash table - prior to hash tables being in base SAS, in fact, this was how a lot of that work was done.
Note below the use of do while even though I'm searching for a not condition; you cannot use do until here due to the timing of that statement (try it and see if you can figure out why).
data want;
array ages[19] _temporary_;
do _n_ = 1 by 1 while (not eof); *iterate over `class` and load the array;
set sashelp.class(keep=age) end=eof;
ages[_n_] = age;
end;
call missing(age); *not really needed;
call sortn(of ages[*]); *sort ascending;
set sashelp.class; *now the data step loop pulls one record at a time;
age_rank = whichn(age, of ages[*]); *calculates the rank of age;
run;
Of course don't use _temporary_ if you want to store the variables from that array in the dataset. And remember that array is a one-data-step construct, it never persists; you'd have to redeclare the array each data step you want to use it in, but the variables would already exist obviously.
Finally, if you want fewer rows, output selectively (after reaching the boundary condition) would be used to output only one row per [whatever].
There are other options; you could even load it into a hash table and then unload it into an array, if you had a reason to do that (I can't think of one, but who knows).
I am trying to convert a matlab code to C. The matlab code uses a singular value decomposition (SVD) of 3x3 matrices that I implemented in C using numerical reciepes. The matlab code works later with the right singular vectors wich are in some cases that I tested different between Matlab and C, either the second and third columns are swapped or some values are the opposites. In some cases the values are identical. Here are some examples:
Expl1: (Identical values without considering round off error)
Matlab:
-0.3939 0.9010 0.1819
0.6583 0.1385 0.7399
0.6414 0.4112 -0.6477
C:
-0.3939 0.9010 0.1819
0.6584 0.1385 0.7398
0.6414 0.4112 -0.6477
Expl2: (swapped 2nd and 3rd columns)
Matlab:
-0.0309 0.1010 0.9944
-0.0073 -0.9949 0.1008
0.9995 -0.0042 0.0315
C:
-0.0309 0.9944 0.1010
-0.0074 0.1008 -0.9949
0.9995 0.0315 -0.0042
Expl3:(opposite values)
Matlab:
-0.1712 -0.8130 -0.5566
-0.8861 -0.1199 0.4476
0.4306 -0.5698 0.6999
C:
-0.1712 0.8130 0.5566
-0.8861 0.1199 -0.4477
0.4307 0.5698 -0.6999
would this difference cause erroneous results?
The right singular vectors of a matrix are unique up to multiplication by a unit-phase factor if it has distinct singular values. When considering real singular vectors, this comes down to a change of sign (more information here).
Also, since singular vectors correspond to certain singular values (diagonal entries of Σ), their order can be changed when the position of the singular values on the diagonal of Σ is changed.
Whether these changes cause erroneous results depends heavily on what you intend to do with the right singular vectors later on in you code.
I am supposed to integrate data of acceleration and time to get velocity using a user defined script.
What I have so far is:
function myIntegral=myCumulativeTrapz(X,Y)
myIntegral=0.5*(Y+(Y+1))*((X+1)-X)
When I hit run, I get this error:
Error: File: myCumulativeTrapz.m Line: 27 Column: 1
Function definitions are not permitted in this context.
If the script for integration was successful, I would then put
velocity=myCumulativeTrapz(data_resultant_acc(:,1), data_resultant_acc(:,2))
in the command window. (Data_resultant_acc is an array where time is in the first column and acceleration is in the second column.)
Can someone help me out and tell me why is this not working?
The error message is shown because Matlab file can't contain both functions and commands that are outside of any functions. So, if you have something like
data_resultant_acc = rand(10,2);
velocity=myCumulativeTrapz(data_resultant_acc(:,1), data_resultant_acc(:,2));
function myIntegral=myCumulativeTrapz(X,Y)
myIntegral=0.5*(Y+(Y+1))*((X+1)-X)
end
change that to
function myProject
data_resultant_acc = rand(10,2);
velocity=myCumulativeTrapz(data_resultant_acc(:,1), data_resultant_acc(:,2));
end
function myIntegral=myCumulativeTrapz(X,Y)
myIntegral=0.5*(Y+(Y+1))*((X+1)-X)
end
thus making myProject the top-level function that will be executed when you run the file (for best results, the file name should be the name of that function).
After that, you will discover that 0.5*(Y+(Y+1))*((X+1)-X) is not a valid formula, for multiple reasons. Since both X and Y are column vectors, the first one should be transposed before multiplication. Also, you are adding 1 to vector components instead of shifting index by 1. A correct way to do the index shift is below:
myIntegral=0.5*(Y(1:end-1)+Y(2:end))'*(X(2:end)-X(1:end-1));
Here the comma selectors create vectors that omit either the very first or the very last entry. The average of two such vectors gives the averages of adjacent values. The difference gives the difference of adjacent values.
I'm new to SAS coming from python, java and C++. From these languages, the proper thing to do when writing/repeating large statements is to encapsulate them in a variable that is defined once and repeated several times in the code.
I.e. instead of writing the same where statement over and over each time two similar datasets are merged, I want to write:
WHERE_CONDITION_VARIABLE = 'X in (10, 100, 1000, 10000 ......100000000);
data output;
merge in1 in2;
WHERE WHERE_CONDITION_VARIABLE;
run;
data output2;
merge in3 in4;
WHERE WHERE_CONDITION_VARIABLE;
run;
Unfortunately, I haven't been able to figure out how to define a variable such as WHERE_CONDITION_VARIABLE to streamline the code. Is what I'm asking possible to do in SAS?
You can use macro variables.
You define them like this:
%let WHERE_CONDITION_VARIABLE = X in (10, 100, 1000);
And reference them like this:
&WHERE_CONDITION_VARIABLE
SAS has a lot of options for avoiding repeating code; in that way it's actually a lot like python, although the method for accomplishing it is a little different as you do have a separate compilation step (so you can't just say WHERE like you ask directly).
First off, you have the macro variable. If you're just repeating text several times, you can define it in a macro variable, like so:
%let condition=X in (1,10,100,1000);
Macro variables are treated as if they were text you had written. They do not need quotation marks or other text qualifiers, unless they are intended to contain them as legal code, ie:
%let condition=X in ('A','B','C');
would be legal, but
%let condition="X in ('A','B','C')";
would probably not be what you want (unless you want that to be evaluated as a string, anyhow).
Through macro variables, you also have the ability to generate larger amounts of code in a datastep and then include it. For example, if you have a dataset containing a list of conditions, you could apply them this way:
data conditions;
format condition $50.;
input condition $;
datalines4;
if x = 15 then y=5;
if x = 20 then y=10;
if x = 20 and z = 5 then y=15;
if x = 20 and z = 10 then y=20;
;;;;
run;
proc sql;
select condition into :condlist separated by ' ' from conditions;
quit;
data want;
set have;
&condlist;
run;
That would take the conditions from "conditions" dataset and push it into a macro variable "&condlist". The PROC SQL call is the easiest way to get it into a macro variable, but there are others; CALL SYMPUT also can do it in a data step, or you can write it to a text file and then %include the text file as code as well. This is more commonly used in advanced programming by generating calls to a macro, with the conditions dataset providing the macro parameters; in this case you might have a macro
%macro cond(x=,y=,z=);
if x=&x and z=&z then y=&y;
%mend cond;
Then you could generate calls to cond from a dataset with just x,y,z values:
proc sql;
select cats('%cond(x=',x,',y=',y,',z=',z,')') into :condlist separated by ' ' from conditions;
quit;
and use it in the same way.
Macro programming in general is a good solution for avoiding code creep; a macro is written once and then can be run multiple times with different parameters. A macro can be anywhere from one line of code (like above) executed inside a data step, to hundreds of lines containing multiple DATA and PROC steps. Macro programming is a complex topic in and of itself, and worth reading more on.
You can also write a function in SAS. PROC FCMP (function compile) allows you to write fairly complex functions and execute them in your data step or even your PROC statements. http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu07.pdf is a good place to start with FCMP if you have 9.2; if you have 9.3, I haven't seen any papers yet (but there may be some out there) showing the newer things in FCMP. FCMP is fairly new so there are still a lot of changes in each iteration of SAS.
Here's an example of FCMP to do your condition:
proc fcmp
outlib=work.funcs.Test; /* where will the functions be saved */
function condition(x); /* declare a function returning a number */
if x in (1,10,100,1000) then return(1);
else return(0);
endsub;
quit;
data have;
do x = 1,5,10,20,100,150,1000,1500;
output;
end;
run;
options cmplib=work.funcs;
data want;
set have;
if condition(x) then output;
run;
You also have the CALL EXECUTE statement, which allows you to directly execute code from a dataset. Using the same CONDITIONS dataset:
data _null_;
set conditions end=eof;
if _n_ = 1 then call execute('data want; set have;');
call execute(condition);
if eof then call execute('run;');
run;
That would effectively construct a data step that executes immediately following your data null step with the same code as in the macro variable example. Call execute works a little differently, so while in this example there shouldn't be any difference, there are a few issues with timing that can cause problems (or can be advantageous); which you use depends on the circumstance. Particularly for CALL EXECUTE, read up on the documentation and online papers (SUGI papers most commonly) to find out more details.
In addition to directly executing code via macro variables or CALL EXECUTE, you have a lot of other ways of performing tasks to avoid wallpaper code. For example, to more easily perform the if statements above, you might be able to use a format. Formats convert one value to another value; most commonly you might have something like 'DOLLAR6.2' which would give you $3.50 from the number 3.5. However, formats can also be used to replace if-this-then-that expressions. If there were only X and Y (and no Z conditions), then you could do this, given this conditions dataset:
data conditions;
input x y;
datalines;
1 5
2 10
3 20
4 50
5 100
;;;;
run;
data for_fmt;
set conditions;
rename x=start y=label;
fmtname='XTOY';
type='i'; *type=i means numeric informat, so numeric to numeric conversion. Informat = to numeric, Format= to character.;
run;
proc format cntlin=for_fmt;
quit;
data want;
set have;
y = input(x,XTOY.);
run;
There you have one line of code converting x to y. (Of course there is a bit of code setting up the format, but it can be separated from the main code, and included in the set-up portion of your code, like a .h file in c).
You also have hash table lookups, which are really helpful when you have more complex conversions - either 1 to many or many to 1. They work just like they sound - you load the hash table into memory and perform lookups. http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf is one good place to start.
Finally, one good way to avoid repeating code is to use fewer separate datasets. SAS data steps and procedures have the "BY" statement available, which means they treat each different value of the BY variable(s) as effectively a separate dataset. The variable names and lengths need to match, as it is still technically one dataset, but if you have many datasets of similar data, and want to perform the same action to each, you can perform them once with a BY statement rather than multiple times.
For example, say you had the dataset SASHELP.CARS. You might want to calculate something separately for each make of car. You could either do:
data acura;
set sashelp.cars;
if make='ACURA';
run;
data honda;
set sashelp.cars;
if make='HONDA';
run;
And then run your code on each dataset separately. However, a more SASsy way to do it is to use the BY statement:
proc means data=sashelp.cars;
by make;
var mpg_city mpg_highway;
run;
Now you get a separate page for each make. You can use the BY statement in data step processing as well; you get variables FIRST.make and LAST.make which tell you if you're on the first record of a new MAKE or the last record of a MAKE (the record just before a change in value), which allow you to do things based on where you are in a dataset's BY group (for example, if first.make then counter=0; would allow you to have a counter that is reset each time you have a new value in make. ) The only caveat for BY groups is you have to sort your dataset by the BY variable prior to using it (or have an index on that variable, or both). This is really helpful for analysis of bootstrap samples or other processes where you have many nearly-identical datasets and perform identical actions on them.
I am assuming you want to put all the WHERE Conditions variables to be put in a bucket and then utilizing them based on index like structure (Python).
If that's the case then you may want to have a look at "INTO".
In "INTO" you will drop all of your X's.
And then you can take them whenever you want.