Replacing an entire column with an array - SAS - arrays

Is it possible to replace a column with an array?
I have a column from 1....24 and I want to replace it with an array that has 24 distinct elements in it.
How would I go about doing this?
Thanks!

Sounds like you want to transpose the table.
data have;
do i=1 to 24;
output;
end;
run;
proc transpose data=have out=want;
run;
Check out the documentation on PROC TRANSPOSE for more information and options. http://support.sas.com/documentation/cdl/en/proc/70377/HTML/default/viewer.htm#n1xno5xgs39b70n0zydov0owajj8.htm

There are a few other ways beyond proc transpose; I would suggest that for most use cases, but there are other cases where it's appropriate to do one of these.
The simplest way is to load it into the array using the set statement. This is a common way to set up a temporary array that is used in very similar fashion to a hash table - prior to hash tables being in base SAS, in fact, this was how a lot of that work was done.
Note below the use of do while even though I'm searching for a not condition; you cannot use do until here due to the timing of that statement (try it and see if you can figure out why).
data want;
array ages[19] _temporary_;
do _n_ = 1 by 1 while (not eof); *iterate over `class` and load the array;
set sashelp.class(keep=age) end=eof;
ages[_n_] = age;
end;
call missing(age); *not really needed;
call sortn(of ages[*]); *sort ascending;
set sashelp.class; *now the data step loop pulls one record at a time;
age_rank = whichn(age, of ages[*]); *calculates the rank of age;
run;
Of course don't use _temporary_ if you want to store the variables from that array in the dataset. And remember that array is a one-data-step construct, it never persists; you'd have to redeclare the array each data step you want to use it in, but the variables would already exist obviously.
Finally, if you want fewer rows, output selectively (after reaching the boundary condition) would be used to output only one row per [whatever].
There are other options; you could even load it into a hash table and then unload it into an array, if you had a reason to do that (I can't think of one, but who knows).

Related

Random sample from another table's column

I am trying to figure out how to populate a "fake" column by choosing randomly from another table column.
So far this was easy using an array and the rantbl() function as there were not a lot of modalities.
data want;
set have;
array values[2] $10 _temporary_ ('NO','YES');
value=values[rantbl(0,0.5,0.5)];
array start_dates[4] _temporary_ (1735689600,1780358400,1798848000,1798848000);
START_DATE=start_dates[rantbl(0,0.25,0.25,0.25,0.25)];
format START_DATE datetime20.;
run;
However, my question is what happens if there are, for example, more than 150 modalities in the other table? Hence, is there a way to put into an array all the modalities that are in another table ? Or better, to populate the new "fake" column with modalities from another table's column with regards to the modalities's distribution in the other table ?
I'm not entirely sure, but here's how I interpret your request and how I would solve it.
You have a table one. You want to create a new data set want with an additional column. This column should have values that are sampled from a pool of values given in yet another data set two in column y. You want too simulate the new column in the want data set according to the distribution of y in the two data set.
So, in the example below, there should be a .5 change of simulating y = 3 and .25 for 1 and 2 respectively.
I think the way to go is not using arrays at all. See if this helps you.
data one;
do x = 1 to 1e4;
output;
end;
run;
data two;
input y;
datalines;
1
2
3
3
;
data want;
set one;
p = ceil(rand('uniform')*n);
set two(keep = y) nobs = n point = p;
run;
To verify that the new column resembles the distribution from the two data set:
proc freq data = want;
tables y / nocum;
run;
There are probably a dozen good ways to do this, which one being ideal depending on various details of your data - in particular, how performance sensitive this is.
The most SASsy way to do this, I would say, is to use PROC SURVEYSELECT. This generates a random sample of the size you want, and then merges it on. It is not the fastest way, but it is very easy to understand and is fast-ish as long as you aren't talking humungous data sizes.
data _null_;
set sashelp.cars nobs=nobs_cars;
call symputx('nobs_cars',nobs_Cars);
stop;
run;
proc surveyselect data=sashelp.class sampsize=&nobs_Cars out=names(keep=name)
seed=7 method=urs outhits outorder=random;
run;
data want;
merge sashelp.cars names;
run;
In this example, we are taking the dataset sashelp.cars, and appending an owner's name to each car, which we choose at random from the dataset sashelp.class.
What we're doing here is first determining how many records we need - the number of observations in the to-be-merged-to dataset. This step can be skipped if you know that already, but it takes basically zero time no matter what the dataset size.
Second, we use proc surveyselect to generate the random list. We use method=urs to ask for simple random sampling with replacement, meaning we take 428 (in this case) separate pulls, each time every row being equally likely to be chosen. We use outhits and outorder=random to get a dataset with one row per desired output dataset row and in a random order (without outhits it gives one row per input dataset row, and a number of times sampled variable, and without outrandom it gives them in sorted order). sampsize is used with our created macro variable that stores the number of observations in the eventual output dataset.
Third, we do a side by side merge (with no by statement, intentionally). Please note that in some installations, options mergenoby is set to give a warning or error for this particular usage; if so you may need to do this slightly differently, though it is easy to do so using two set statements (set sashelp.cars; set names;) to achieve the identical results.

SAS, array code, two indices, dropping records

I am looking through some code and wondering what this does. Below are the code comments. I'm still not sure what this code does even with the code comments. I have used arrays but not familiar with this code. It looks like this code dedupes by using two indices. Is that correct? So if there is a combination of CCS_DR_IDX and TXN_IDX, it will delete those records?
Now handle cases where the dollar matches. If ccs_dr_idx has already been used then delete the record. Dropped txns here will be added back in with the claim data called missing.
PROC SORT DATA=OUT.REQ_1_9_F_AMT_MATCH; BY CCS_DR_IDX DATEDIF; RUN;
DATA OUT.REQ_1_9_F_AMT_MATCH_V2;
SET OUT.REQ_1_9_F_AMT_MATCH;
ARRAY id_one{40000} id_one1-id_one40000;
ARRAY id_two{40000} id_two1-id_two40000;
RETAIN id_one1-id_one40000 id_two1-id_two40000;
IF _n_=1 then i=1;
else i+1;
do j=1 to i;
if CCS_DR_IDX=id_one{j} then delete;
end;
do k = 1 to i;
if TXN_IDX = id_two{k} then delete;
end;
id_one{i}=CCS_DR_IDX;
id_two{i}=TXN_IDX;
drop i j k id_one1-id_one40000 id_two1-id_two40000;
run;
The sort is
BY CCS_DR_IDX DATEDIF;
The filtering or selecting occurs when control reaches the bottom of the data step and implicitly OUTPUTs. That occurs only if CCS_DR_IDX and TXN_IDX is a combination where neither has appeared previously.
Since you have sorted by CCS_DR_IDX you can know there is implicit grouping and there will be at most one record per CCS_DR_IDX output, and for the first group it must be the first record in the group. Each successive row in a CCS_DR_IDX group, post output, will match an entry in id_one and be tossed away by DELETE.
When you start processing the next CCS_DR_IDX group the rows will be processed until you reach the next distinct TXN_ID with respect to those tracked in id_two. Because the sort had a second key DATDIF you can say the output is "a selection of the first occurring combinations of unique pair items CCS_DR_IDX TXN_ID" (somewhat akin to pair-sampling without repeats.)
There could be a case where some CCS_DR_IDX is not in the output -- that would happen when the group contains only TX_IDs that occurred in prior CCS_DR_IDXs.
Without seeing the data model and combination reasons (probably some sort of cartesian join) it's hard to make a less vague statement of what is being selected.

macro to run iterations from data table in existing program

I am completely unfamiliar with macros/do loops/arrays in SAS, but I have been trying to read up on them. It is not going well.
I have a dataset that has 148,176 rows, 9 columns. I want to run all 148176 combinations one by one through my program (so each row one by one) and have it spit out each result as one long list. I should have 148176 values at the end.
Before working with the macro piece, I just used macro variables so the user could input each value, like so:
%let classIin = 1;
%let classIIin = 0.8;
Now I would like to replace each number of the above %let statements with a variable from the 9 columns (each column would correspond to one of the above macro variables, there are 9 I just didn't list them all).
I started trying to write this code, but I am really confused and I know I am missing key things about this process. If anyone has some helpful video tutorials I should watch, I am happy to do that, because nothing I am finding is helping me much so far.
In the following, "AA" and "AB" are two of the column names in Work.MasterPlanList, but I'm not sure if I can call forth variables in this way.
%macro masterlist;
%do i=1 %to 148176;
Data Work.test;
Set work.MasterPlanList(firstobs=&i obs=&i);
call symputx ('classIin', AA)
call symputx ('classIIin', AB)
%end;
%mend;
Then I would theoretically call in the %macro in my code, but the other problem is that I need each variable from this list at different times in my code. Is that an issue or will my macro work by looking at row 1, go through my whole code/calculation set, spit out value 1, then go back to the beginning and look at row 2, go through the code/calc, value 2, etc. etc. etc. until 148176?
It is hard to answer without more specifics of the calculations you are doing. For example you could possibly just do all of your calculations in a data step and never use macro variables or macros.
But if have structured your analysis for one set of parameters as a macro then you can use the dataset to generate multiple calls to the macro. Although 150K calls to a long complex macro is quite a lot.
Say you had a macro called %MYMACRO that had 2 input parameters. And you had a SAS dataset with 2 variables with the values for those parameters. You could then use CALL EXECUTE() or other code generation methods to generate one macro call per observation.
For code generation on this scale I find that using a data step to write the code is easier to understand and debug than using CALL EXECUTE. Especially if you name your dataset variable with the same names as the macro parameters.
filename code temp;
data _null_;
set my_metadata ;
file code ;
put '%mymacro(' var1= ',' var2= ')';
run;
%include code /source2;

Saving parts of Matlab cell array

I am using Matlab for some data collection, and I want to save the data after each trial (just in case something goes wrong). The data is organized as a cell array of cell arrays, basically in the format
data{target}{trial} = zeros(1000,19)
But the actual data gets up to >150 MB by the end of the collection, so saving everything after each trial becomes prohibitively slow.
So now I am looking at opting for the matfile approach (http://www.mathworks.de/de/help/matlab/ref/matfile.html), which would allow me to only save parts of the data. The problem: this doesn't support cells of cell arrays, which means I couldn't change/update the data for a single trial; I would have to re-save the entire target's data (100 trials).
So, my question:
Is there another different method I can use to save parts of the cell array to speed up saving?
(OR)
Is there a better way to format my data that would work with this saving process?
A not very elegant but possibly effective solution is to use trial as part of the variable name. That is, use not a cell array of cell arrays (data{target}{trial}), but just different cell arrays such as data_1{target}, data_2{target}, where 1, 2 are the values of the trial counter.
You could do that with eval: for example
trial = 1; % change this value in a for lopp
eval([ 'data_' num2str(trial) '{target} = zeros(1000,19);']); % fill data_1{target}
You can then save the data for each trial in a different file. For example, this
eval([ 'save temp_save_file_' num2str(trial) ' data_' num2str(trial)])
saves data_1 in file temp_save_file_1, etc.
Update:
Actually it does appear to be possible to index into cell arrays, just not iside cell arrays. Hence, if you store your data slightly differently it seems like you can use matfile to update only part of it. See this example:
x = cell(3,4);
save x;
matObj = matfile('x.mat','writable',true);
matObj.x(3,4) = {eye(10)};
Note that this gives me a version warning, but it seems to work.
Hope this does the trick. However, still look into the next part of my answer as it may help you even more.
For calculations it is usually not required to save to disk after every iteration. An easy way to get a speedup (at the cost of a little more risk) is to save only after every n trials.
Like this for example:
maxTrial = 99;
saveEvery = 10;
for trial = 1:maxTrial
myFun; %Do your calculations here
if trial == maxTrial || mod(trial, saveEvery) == 0
save %Put your save command here
end
end
If your data is always at (or within) a certain size, you can also choose to store your data in a matrix rather than a cell array, then you can use indexing to save only part of the file.
In response to #Luis I will post an other way to deal with the situation.
It is indeed an option to save data in named variables or files, but to save a named variable in a named file seems too much.
If you only change the name of the file, you can save everything without using eval:
assuming you are dealing with trial 't':
filename = ['temp_save_file_' + num2str(t)];
If you really want, you can use print commands to write it as 001 for example.
Now you can simply use this:
save(filename, myData)
To use this, construct the filename again and so something like this:
totalData = {}; %Initialize your total data
And then read them as you wrote them (inside a loop):
load(filename)
totalData{t} = myData

Defining variables in sas to clean up code

I'm new to SAS coming from python, java and C++. From these languages, the proper thing to do when writing/repeating large statements is to encapsulate them in a variable that is defined once and repeated several times in the code.
I.e. instead of writing the same where statement over and over each time two similar datasets are merged, I want to write:
WHERE_CONDITION_VARIABLE = 'X in (10, 100, 1000, 10000 ......100000000);
data output;
merge in1 in2;
WHERE WHERE_CONDITION_VARIABLE;
run;
data output2;
merge in3 in4;
WHERE WHERE_CONDITION_VARIABLE;
run;
Unfortunately, I haven't been able to figure out how to define a variable such as WHERE_CONDITION_VARIABLE to streamline the code. Is what I'm asking possible to do in SAS?
You can use macro variables.
You define them like this:
%let WHERE_CONDITION_VARIABLE = X in (10, 100, 1000);
And reference them like this:
&WHERE_CONDITION_VARIABLE
SAS has a lot of options for avoiding repeating code; in that way it's actually a lot like python, although the method for accomplishing it is a little different as you do have a separate compilation step (so you can't just say WHERE like you ask directly).
First off, you have the macro variable. If you're just repeating text several times, you can define it in a macro variable, like so:
%let condition=X in (1,10,100,1000);
Macro variables are treated as if they were text you had written. They do not need quotation marks or other text qualifiers, unless they are intended to contain them as legal code, ie:
%let condition=X in ('A','B','C');
would be legal, but
%let condition="X in ('A','B','C')";
would probably not be what you want (unless you want that to be evaluated as a string, anyhow).
Through macro variables, you also have the ability to generate larger amounts of code in a datastep and then include it. For example, if you have a dataset containing a list of conditions, you could apply them this way:
data conditions;
format condition $50.;
input condition $;
datalines4;
if x = 15 then y=5;
if x = 20 then y=10;
if x = 20 and z = 5 then y=15;
if x = 20 and z = 10 then y=20;
;;;;
run;
proc sql;
select condition into :condlist separated by ' ' from conditions;
quit;
data want;
set have;
&condlist;
run;
That would take the conditions from "conditions" dataset and push it into a macro variable "&condlist". The PROC SQL call is the easiest way to get it into a macro variable, but there are others; CALL SYMPUT also can do it in a data step, or you can write it to a text file and then %include the text file as code as well. This is more commonly used in advanced programming by generating calls to a macro, with the conditions dataset providing the macro parameters; in this case you might have a macro
%macro cond(x=,y=,z=);
if x=&x and z=&z then y=&y;
%mend cond;
Then you could generate calls to cond from a dataset with just x,y,z values:
proc sql;
select cats('%cond(x=',x,',y=',y,',z=',z,')') into :condlist separated by ' ' from conditions;
quit;
and use it in the same way.
Macro programming in general is a good solution for avoiding code creep; a macro is written once and then can be run multiple times with different parameters. A macro can be anywhere from one line of code (like above) executed inside a data step, to hundreds of lines containing multiple DATA and PROC steps. Macro programming is a complex topic in and of itself, and worth reading more on.
You can also write a function in SAS. PROC FCMP (function compile) allows you to write fairly complex functions and execute them in your data step or even your PROC statements. http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu07.pdf is a good place to start with FCMP if you have 9.2; if you have 9.3, I haven't seen any papers yet (but there may be some out there) showing the newer things in FCMP. FCMP is fairly new so there are still a lot of changes in each iteration of SAS.
Here's an example of FCMP to do your condition:
proc fcmp
outlib=work.funcs.Test; /* where will the functions be saved */
function condition(x); /* declare a function returning a number */
if x in (1,10,100,1000) then return(1);
else return(0);
endsub;
quit;
data have;
do x = 1,5,10,20,100,150,1000,1500;
output;
end;
run;
options cmplib=work.funcs;
data want;
set have;
if condition(x) then output;
run;
You also have the CALL EXECUTE statement, which allows you to directly execute code from a dataset. Using the same CONDITIONS dataset:
data _null_;
set conditions end=eof;
if _n_ = 1 then call execute('data want; set have;');
call execute(condition);
if eof then call execute('run;');
run;
That would effectively construct a data step that executes immediately following your data null step with the same code as in the macro variable example. Call execute works a little differently, so while in this example there shouldn't be any difference, there are a few issues with timing that can cause problems (or can be advantageous); which you use depends on the circumstance. Particularly for CALL EXECUTE, read up on the documentation and online papers (SUGI papers most commonly) to find out more details.
In addition to directly executing code via macro variables or CALL EXECUTE, you have a lot of other ways of performing tasks to avoid wallpaper code. For example, to more easily perform the if statements above, you might be able to use a format. Formats convert one value to another value; most commonly you might have something like 'DOLLAR6.2' which would give you $3.50 from the number 3.5. However, formats can also be used to replace if-this-then-that expressions. If there were only X and Y (and no Z conditions), then you could do this, given this conditions dataset:
data conditions;
input x y;
datalines;
1 5
2 10
3 20
4 50
5 100
;;;;
run;
data for_fmt;
set conditions;
rename x=start y=label;
fmtname='XTOY';
type='i'; *type=i means numeric informat, so numeric to numeric conversion. Informat = to numeric, Format= to character.;
run;
proc format cntlin=for_fmt;
quit;
data want;
set have;
y = input(x,XTOY.);
run;
There you have one line of code converting x to y. (Of course there is a bit of code setting up the format, but it can be separated from the main code, and included in the set-up portion of your code, like a .h file in c).
You also have hash table lookups, which are really helpful when you have more complex conversions - either 1 to many or many to 1. They work just like they sound - you load the hash table into memory and perform lookups. http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf is one good place to start.
Finally, one good way to avoid repeating code is to use fewer separate datasets. SAS data steps and procedures have the "BY" statement available, which means they treat each different value of the BY variable(s) as effectively a separate dataset. The variable names and lengths need to match, as it is still technically one dataset, but if you have many datasets of similar data, and want to perform the same action to each, you can perform them once with a BY statement rather than multiple times.
For example, say you had the dataset SASHELP.CARS. You might want to calculate something separately for each make of car. You could either do:
data acura;
set sashelp.cars;
if make='ACURA';
run;
data honda;
set sashelp.cars;
if make='HONDA';
run;
And then run your code on each dataset separately. However, a more SASsy way to do it is to use the BY statement:
proc means data=sashelp.cars;
by make;
var mpg_city mpg_highway;
run;
Now you get a separate page for each make. You can use the BY statement in data step processing as well; you get variables FIRST.make and LAST.make which tell you if you're on the first record of a new MAKE or the last record of a MAKE (the record just before a change in value), which allow you to do things based on where you are in a dataset's BY group (for example, if first.make then counter=0; would allow you to have a counter that is reset each time you have a new value in make. ) The only caveat for BY groups is you have to sort your dataset by the BY variable prior to using it (or have an index on that variable, or both). This is really helpful for analysis of bootstrap samples or other processes where you have many nearly-identical datasets and perform identical actions on them.
I am assuming you want to put all the WHERE Conditions variables to be put in a bucket and then utilizing them based on index like structure (Python).
If that's the case then you may want to have a look at "INTO".
In "INTO" you will drop all of your X's.
And then you can take them whenever you want.

Resources