Random sample from another table's column - arrays

I am trying to figure out how to populate a "fake" column by choosing randomly from another table column.
So far this was easy using an array and the rantbl() function as there were not a lot of modalities.
data want;
set have;
array values[2] $10 _temporary_ ('NO','YES');
value=values[rantbl(0,0.5,0.5)];
array start_dates[4] _temporary_ (1735689600,1780358400,1798848000,1798848000);
START_DATE=start_dates[rantbl(0,0.25,0.25,0.25,0.25)];
format START_DATE datetime20.;
run;
However, my question is what happens if there are, for example, more than 150 modalities in the other table? Hence, is there a way to put into an array all the modalities that are in another table ? Or better, to populate the new "fake" column with modalities from another table's column with regards to the modalities's distribution in the other table ?

I'm not entirely sure, but here's how I interpret your request and how I would solve it.
You have a table one. You want to create a new data set want with an additional column. This column should have values that are sampled from a pool of values given in yet another data set two in column y. You want too simulate the new column in the want data set according to the distribution of y in the two data set.
So, in the example below, there should be a .5 change of simulating y = 3 and .25 for 1 and 2 respectively.
I think the way to go is not using arrays at all. See if this helps you.
data one;
do x = 1 to 1e4;
output;
end;
run;
data two;
input y;
datalines;
1
2
3
3
;
data want;
set one;
p = ceil(rand('uniform')*n);
set two(keep = y) nobs = n point = p;
run;
To verify that the new column resembles the distribution from the two data set:
proc freq data = want;
tables y / nocum;
run;

There are probably a dozen good ways to do this, which one being ideal depending on various details of your data - in particular, how performance sensitive this is.
The most SASsy way to do this, I would say, is to use PROC SURVEYSELECT. This generates a random sample of the size you want, and then merges it on. It is not the fastest way, but it is very easy to understand and is fast-ish as long as you aren't talking humungous data sizes.
data _null_;
set sashelp.cars nobs=nobs_cars;
call symputx('nobs_cars',nobs_Cars);
stop;
run;
proc surveyselect data=sashelp.class sampsize=&nobs_Cars out=names(keep=name)
seed=7 method=urs outhits outorder=random;
run;
data want;
merge sashelp.cars names;
run;
In this example, we are taking the dataset sashelp.cars, and appending an owner's name to each car, which we choose at random from the dataset sashelp.class.
What we're doing here is first determining how many records we need - the number of observations in the to-be-merged-to dataset. This step can be skipped if you know that already, but it takes basically zero time no matter what the dataset size.
Second, we use proc surveyselect to generate the random list. We use method=urs to ask for simple random sampling with replacement, meaning we take 428 (in this case) separate pulls, each time every row being equally likely to be chosen. We use outhits and outorder=random to get a dataset with one row per desired output dataset row and in a random order (without outhits it gives one row per input dataset row, and a number of times sampled variable, and without outrandom it gives them in sorted order). sampsize is used with our created macro variable that stores the number of observations in the eventual output dataset.
Third, we do a side by side merge (with no by statement, intentionally). Please note that in some installations, options mergenoby is set to give a warning or error for this particular usage; if so you may need to do this slightly differently, though it is easy to do so using two set statements (set sashelp.cars; set names;) to achieve the identical results.

Related

SAS, array code, two indices, dropping records

I am looking through some code and wondering what this does. Below are the code comments. I'm still not sure what this code does even with the code comments. I have used arrays but not familiar with this code. It looks like this code dedupes by using two indices. Is that correct? So if there is a combination of CCS_DR_IDX and TXN_IDX, it will delete those records?
Now handle cases where the dollar matches. If ccs_dr_idx has already been used then delete the record. Dropped txns here will be added back in with the claim data called missing.
PROC SORT DATA=OUT.REQ_1_9_F_AMT_MATCH; BY CCS_DR_IDX DATEDIF; RUN;
DATA OUT.REQ_1_9_F_AMT_MATCH_V2;
SET OUT.REQ_1_9_F_AMT_MATCH;
ARRAY id_one{40000} id_one1-id_one40000;
ARRAY id_two{40000} id_two1-id_two40000;
RETAIN id_one1-id_one40000 id_two1-id_two40000;
IF _n_=1 then i=1;
else i+1;
do j=1 to i;
if CCS_DR_IDX=id_one{j} then delete;
end;
do k = 1 to i;
if TXN_IDX = id_two{k} then delete;
end;
id_one{i}=CCS_DR_IDX;
id_two{i}=TXN_IDX;
drop i j k id_one1-id_one40000 id_two1-id_two40000;
run;
The sort is
BY CCS_DR_IDX DATEDIF;
The filtering or selecting occurs when control reaches the bottom of the data step and implicitly OUTPUTs. That occurs only if CCS_DR_IDX and TXN_IDX is a combination where neither has appeared previously.
Since you have sorted by CCS_DR_IDX you can know there is implicit grouping and there will be at most one record per CCS_DR_IDX output, and for the first group it must be the first record in the group. Each successive row in a CCS_DR_IDX group, post output, will match an entry in id_one and be tossed away by DELETE.
When you start processing the next CCS_DR_IDX group the rows will be processed until you reach the next distinct TXN_ID with respect to those tracked in id_two. Because the sort had a second key DATDIF you can say the output is "a selection of the first occurring combinations of unique pair items CCS_DR_IDX TXN_ID" (somewhat akin to pair-sampling without repeats.)
There could be a case where some CCS_DR_IDX is not in the output -- that would happen when the group contains only TX_IDs that occurred in prior CCS_DR_IDXs.
Without seeing the data model and combination reasons (probably some sort of cartesian join) it's hard to make a less vague statement of what is being selected.

Replacing an entire column with an array - SAS

Is it possible to replace a column with an array?
I have a column from 1....24 and I want to replace it with an array that has 24 distinct elements in it.
How would I go about doing this?
Thanks!
Sounds like you want to transpose the table.
data have;
do i=1 to 24;
output;
end;
run;
proc transpose data=have out=want;
run;
Check out the documentation on PROC TRANSPOSE for more information and options. http://support.sas.com/documentation/cdl/en/proc/70377/HTML/default/viewer.htm#n1xno5xgs39b70n0zydov0owajj8.htm
There are a few other ways beyond proc transpose; I would suggest that for most use cases, but there are other cases where it's appropriate to do one of these.
The simplest way is to load it into the array using the set statement. This is a common way to set up a temporary array that is used in very similar fashion to a hash table - prior to hash tables being in base SAS, in fact, this was how a lot of that work was done.
Note below the use of do while even though I'm searching for a not condition; you cannot use do until here due to the timing of that statement (try it and see if you can figure out why).
data want;
array ages[19] _temporary_;
do _n_ = 1 by 1 while (not eof); *iterate over `class` and load the array;
set sashelp.class(keep=age) end=eof;
ages[_n_] = age;
end;
call missing(age); *not really needed;
call sortn(of ages[*]); *sort ascending;
set sashelp.class; *now the data step loop pulls one record at a time;
age_rank = whichn(age, of ages[*]); *calculates the rank of age;
run;
Of course don't use _temporary_ if you want to store the variables from that array in the dataset. And remember that array is a one-data-step construct, it never persists; you'd have to redeclare the array each data step you want to use it in, but the variables would already exist obviously.
Finally, if you want fewer rows, output selectively (after reaching the boundary condition) would be used to output only one row per [whatever].
There are other options; you could even load it into a hash table and then unload it into an array, if you had a reason to do that (I can't think of one, but who knows).

Creating a vector with unique observations from a variable in Stata

What I am mainly trying to do is to create a variable in which I can assign, within a stratum of my sample (defined by an 'id' variable, for instance), a name that is associated with the highest frequency (in the stratum) of this same name in another (string) variable. If tabulate* would work the way I need it to work, my code would run like this:
gen new_class_within_id=""
forvarlues i=1/80 {
tab class_var, matcell(x) if id==`i'
svmat x
sum x2
local name =x1 if x2==r(max)
replace new_class_within_id=`name' if id==`i'
}
That would be the general idea if tabulate would permit storing the unique observation names in a matrix -- the code might have some unintended errors too, of course. But while it does not seem to be possible using the above code, I thought that I could use mkmat if I would be able to store, in the loop, the unique observations inside a vector with some additional coding. Would that be possible? Also, is there an easier way to perform what I want to do?
*Firstly, I thought that using tabulate and extracting the results into a matrix would do the work that I need, but tabulate does not allow me to extract the names of the observations, just the frequencies. tabulate seemed nice because in its output it shows the unique observations of a variable in a column, but I could not find a way to extract those observations the way the output shows.
I think I understand your question, but maybe I don't. Some code:
clear
set more off
input ///
id str1 anothvar
1 a
1 a
1 a
1 b
1 m
2 c
2 c
2 m
2 a
2 z
end
list, sepby(id)
*-----
bysort id anothvar : gen count = _N
bysort id (count): gen newvar = anothvar[_N]
list, sepby(id)
More work needs to be done if you have missings and/or ties.

Array multiplication in Excel

In my excel document I have two sheets. The first is a data set and the second is a matrix of the relationship between two of the variables in my data set. Each possibility of the variable is a column in my matrix. I'm trying to get the sum of the products of the elements in two different arrays. Right now I'm using the formula {=SUM(N3:N20 * F3:F20)} and manually changing the columns each time. But my data set is over 800 items...
Ideally I'd like to know how to write a program that reads the value of the variable in my dataset looks up the correct columns in the matrix, multiplies them together, sums the products, and puts the result in the correct place in my data set. However, just knowing the result for all the possible combinations of columns would also save me alot of time. Its an 18x18 matrix. Thanks for any feedback!
Your question is a little bit ambiguous but as far as i understand your question you want to multiply different sets of two columns in the same sheet and put their result into the next sheet, is it so? if so, please post images of your work (all sheets). Your answer is possible even in Excel only without any vba code, thanks.
you can also use =SUMPRODUCT(N3:N20,F3:F20) for your formula instead of {=SUM(N3:N20 * F3:F20)}

Defining variables in sas to clean up code

I'm new to SAS coming from python, java and C++. From these languages, the proper thing to do when writing/repeating large statements is to encapsulate them in a variable that is defined once and repeated several times in the code.
I.e. instead of writing the same where statement over and over each time two similar datasets are merged, I want to write:
WHERE_CONDITION_VARIABLE = 'X in (10, 100, 1000, 10000 ......100000000);
data output;
merge in1 in2;
WHERE WHERE_CONDITION_VARIABLE;
run;
data output2;
merge in3 in4;
WHERE WHERE_CONDITION_VARIABLE;
run;
Unfortunately, I haven't been able to figure out how to define a variable such as WHERE_CONDITION_VARIABLE to streamline the code. Is what I'm asking possible to do in SAS?
You can use macro variables.
You define them like this:
%let WHERE_CONDITION_VARIABLE = X in (10, 100, 1000);
And reference them like this:
&WHERE_CONDITION_VARIABLE
SAS has a lot of options for avoiding repeating code; in that way it's actually a lot like python, although the method for accomplishing it is a little different as you do have a separate compilation step (so you can't just say WHERE like you ask directly).
First off, you have the macro variable. If you're just repeating text several times, you can define it in a macro variable, like so:
%let condition=X in (1,10,100,1000);
Macro variables are treated as if they were text you had written. They do not need quotation marks or other text qualifiers, unless they are intended to contain them as legal code, ie:
%let condition=X in ('A','B','C');
would be legal, but
%let condition="X in ('A','B','C')";
would probably not be what you want (unless you want that to be evaluated as a string, anyhow).
Through macro variables, you also have the ability to generate larger amounts of code in a datastep and then include it. For example, if you have a dataset containing a list of conditions, you could apply them this way:
data conditions;
format condition $50.;
input condition $;
datalines4;
if x = 15 then y=5;
if x = 20 then y=10;
if x = 20 and z = 5 then y=15;
if x = 20 and z = 10 then y=20;
;;;;
run;
proc sql;
select condition into :condlist separated by ' ' from conditions;
quit;
data want;
set have;
&condlist;
run;
That would take the conditions from "conditions" dataset and push it into a macro variable "&condlist". The PROC SQL call is the easiest way to get it into a macro variable, but there are others; CALL SYMPUT also can do it in a data step, or you can write it to a text file and then %include the text file as code as well. This is more commonly used in advanced programming by generating calls to a macro, with the conditions dataset providing the macro parameters; in this case you might have a macro
%macro cond(x=,y=,z=);
if x=&x and z=&z then y=&y;
%mend cond;
Then you could generate calls to cond from a dataset with just x,y,z values:
proc sql;
select cats('%cond(x=',x,',y=',y,',z=',z,')') into :condlist separated by ' ' from conditions;
quit;
and use it in the same way.
Macro programming in general is a good solution for avoiding code creep; a macro is written once and then can be run multiple times with different parameters. A macro can be anywhere from one line of code (like above) executed inside a data step, to hundreds of lines containing multiple DATA and PROC steps. Macro programming is a complex topic in and of itself, and worth reading more on.
You can also write a function in SAS. PROC FCMP (function compile) allows you to write fairly complex functions and execute them in your data step or even your PROC statements. http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu07.pdf is a good place to start with FCMP if you have 9.2; if you have 9.3, I haven't seen any papers yet (but there may be some out there) showing the newer things in FCMP. FCMP is fairly new so there are still a lot of changes in each iteration of SAS.
Here's an example of FCMP to do your condition:
proc fcmp
outlib=work.funcs.Test; /* where will the functions be saved */
function condition(x); /* declare a function returning a number */
if x in (1,10,100,1000) then return(1);
else return(0);
endsub;
quit;
data have;
do x = 1,5,10,20,100,150,1000,1500;
output;
end;
run;
options cmplib=work.funcs;
data want;
set have;
if condition(x) then output;
run;
You also have the CALL EXECUTE statement, which allows you to directly execute code from a dataset. Using the same CONDITIONS dataset:
data _null_;
set conditions end=eof;
if _n_ = 1 then call execute('data want; set have;');
call execute(condition);
if eof then call execute('run;');
run;
That would effectively construct a data step that executes immediately following your data null step with the same code as in the macro variable example. Call execute works a little differently, so while in this example there shouldn't be any difference, there are a few issues with timing that can cause problems (or can be advantageous); which you use depends on the circumstance. Particularly for CALL EXECUTE, read up on the documentation and online papers (SUGI papers most commonly) to find out more details.
In addition to directly executing code via macro variables or CALL EXECUTE, you have a lot of other ways of performing tasks to avoid wallpaper code. For example, to more easily perform the if statements above, you might be able to use a format. Formats convert one value to another value; most commonly you might have something like 'DOLLAR6.2' which would give you $3.50 from the number 3.5. However, formats can also be used to replace if-this-then-that expressions. If there were only X and Y (and no Z conditions), then you could do this, given this conditions dataset:
data conditions;
input x y;
datalines;
1 5
2 10
3 20
4 50
5 100
;;;;
run;
data for_fmt;
set conditions;
rename x=start y=label;
fmtname='XTOY';
type='i'; *type=i means numeric informat, so numeric to numeric conversion. Informat = to numeric, Format= to character.;
run;
proc format cntlin=for_fmt;
quit;
data want;
set have;
y = input(x,XTOY.);
run;
There you have one line of code converting x to y. (Of course there is a bit of code setting up the format, but it can be separated from the main code, and included in the set-up portion of your code, like a .h file in c).
You also have hash table lookups, which are really helpful when you have more complex conversions - either 1 to many or many to 1. They work just like they sound - you load the hash table into memory and perform lookups. http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf is one good place to start.
Finally, one good way to avoid repeating code is to use fewer separate datasets. SAS data steps and procedures have the "BY" statement available, which means they treat each different value of the BY variable(s) as effectively a separate dataset. The variable names and lengths need to match, as it is still technically one dataset, but if you have many datasets of similar data, and want to perform the same action to each, you can perform them once with a BY statement rather than multiple times.
For example, say you had the dataset SASHELP.CARS. You might want to calculate something separately for each make of car. You could either do:
data acura;
set sashelp.cars;
if make='ACURA';
run;
data honda;
set sashelp.cars;
if make='HONDA';
run;
And then run your code on each dataset separately. However, a more SASsy way to do it is to use the BY statement:
proc means data=sashelp.cars;
by make;
var mpg_city mpg_highway;
run;
Now you get a separate page for each make. You can use the BY statement in data step processing as well; you get variables FIRST.make and LAST.make which tell you if you're on the first record of a new MAKE or the last record of a MAKE (the record just before a change in value), which allow you to do things based on where you are in a dataset's BY group (for example, if first.make then counter=0; would allow you to have a counter that is reset each time you have a new value in make. ) The only caveat for BY groups is you have to sort your dataset by the BY variable prior to using it (or have an index on that variable, or both). This is really helpful for analysis of bootstrap samples or other processes where you have many nearly-identical datasets and perform identical actions on them.
I am assuming you want to put all the WHERE Conditions variables to be put in a bucket and then utilizing them based on index like structure (Python).
If that's the case then you may want to have a look at "INTO".
In "INTO" you will drop all of your X's.
And then you can take them whenever you want.

Resources