I am looking through some code and wondering what this does. Below are the code comments. I'm still not sure what this code does even with the code comments. I have used arrays but not familiar with this code. It looks like this code dedupes by using two indices. Is that correct? So if there is a combination of CCS_DR_IDX and TXN_IDX, it will delete those records?
Now handle cases where the dollar matches. If ccs_dr_idx has already been used then delete the record. Dropped txns here will be added back in with the claim data called missing.
PROC SORT DATA=OUT.REQ_1_9_F_AMT_MATCH; BY CCS_DR_IDX DATEDIF; RUN;
DATA OUT.REQ_1_9_F_AMT_MATCH_V2;
SET OUT.REQ_1_9_F_AMT_MATCH;
ARRAY id_one{40000} id_one1-id_one40000;
ARRAY id_two{40000} id_two1-id_two40000;
RETAIN id_one1-id_one40000 id_two1-id_two40000;
IF _n_=1 then i=1;
else i+1;
do j=1 to i;
if CCS_DR_IDX=id_one{j} then delete;
end;
do k = 1 to i;
if TXN_IDX = id_two{k} then delete;
end;
id_one{i}=CCS_DR_IDX;
id_two{i}=TXN_IDX;
drop i j k id_one1-id_one40000 id_two1-id_two40000;
run;
The sort is
BY CCS_DR_IDX DATEDIF;
The filtering or selecting occurs when control reaches the bottom of the data step and implicitly OUTPUTs. That occurs only if CCS_DR_IDX and TXN_IDX is a combination where neither has appeared previously.
Since you have sorted by CCS_DR_IDX you can know there is implicit grouping and there will be at most one record per CCS_DR_IDX output, and for the first group it must be the first record in the group. Each successive row in a CCS_DR_IDX group, post output, will match an entry in id_one and be tossed away by DELETE.
When you start processing the next CCS_DR_IDX group the rows will be processed until you reach the next distinct TXN_ID with respect to those tracked in id_two. Because the sort had a second key DATDIF you can say the output is "a selection of the first occurring combinations of unique pair items CCS_DR_IDX TXN_ID" (somewhat akin to pair-sampling without repeats.)
There could be a case where some CCS_DR_IDX is not in the output -- that would happen when the group contains only TX_IDs that occurred in prior CCS_DR_IDXs.
Without seeing the data model and combination reasons (probably some sort of cartesian join) it's hard to make a less vague statement of what is being selected.
Is it possible to replace a column with an array?
I have a column from 1....24 and I want to replace it with an array that has 24 distinct elements in it.
How would I go about doing this?
Thanks!
Sounds like you want to transpose the table.
data have;
do i=1 to 24;
output;
end;
run;
proc transpose data=have out=want;
run;
Check out the documentation on PROC TRANSPOSE for more information and options. http://support.sas.com/documentation/cdl/en/proc/70377/HTML/default/viewer.htm#n1xno5xgs39b70n0zydov0owajj8.htm
There are a few other ways beyond proc transpose; I would suggest that for most use cases, but there are other cases where it's appropriate to do one of these.
The simplest way is to load it into the array using the set statement. This is a common way to set up a temporary array that is used in very similar fashion to a hash table - prior to hash tables being in base SAS, in fact, this was how a lot of that work was done.
Note below the use of do while even though I'm searching for a not condition; you cannot use do until here due to the timing of that statement (try it and see if you can figure out why).
data want;
array ages[19] _temporary_;
do _n_ = 1 by 1 while (not eof); *iterate over `class` and load the array;
set sashelp.class(keep=age) end=eof;
ages[_n_] = age;
end;
call missing(age); *not really needed;
call sortn(of ages[*]); *sort ascending;
set sashelp.class; *now the data step loop pulls one record at a time;
age_rank = whichn(age, of ages[*]); *calculates the rank of age;
run;
Of course don't use _temporary_ if you want to store the variables from that array in the dataset. And remember that array is a one-data-step construct, it never persists; you'd have to redeclare the array each data step you want to use it in, but the variables would already exist obviously.
Finally, if you want fewer rows, output selectively (after reaching the boundary condition) would be used to output only one row per [whatever].
There are other options; you could even load it into a hash table and then unload it into an array, if you had a reason to do that (I can't think of one, but who knows).
What I am mainly trying to do is to create a variable in which I can assign, within a stratum of my sample (defined by an 'id' variable, for instance), a name that is associated with the highest frequency (in the stratum) of this same name in another (string) variable. If tabulate* would work the way I need it to work, my code would run like this:
gen new_class_within_id=""
forvarlues i=1/80 {
tab class_var, matcell(x) if id==`i'
svmat x
sum x2
local name =x1 if x2==r(max)
replace new_class_within_id=`name' if id==`i'
}
That would be the general idea if tabulate would permit storing the unique observation names in a matrix -- the code might have some unintended errors too, of course. But while it does not seem to be possible using the above code, I thought that I could use mkmat if I would be able to store, in the loop, the unique observations inside a vector with some additional coding. Would that be possible? Also, is there an easier way to perform what I want to do?
*Firstly, I thought that using tabulate and extracting the results into a matrix would do the work that I need, but tabulate does not allow me to extract the names of the observations, just the frequencies. tabulate seemed nice because in its output it shows the unique observations of a variable in a column, but I could not find a way to extract those observations the way the output shows.
I think I understand your question, but maybe I don't. Some code:
clear
set more off
input ///
id str1 anothvar
1 a
1 a
1 a
1 b
1 m
2 c
2 c
2 m
2 a
2 z
end
list, sepby(id)
*-----
bysort id anothvar : gen count = _N
bysort id (count): gen newvar = anothvar[_N]
list, sepby(id)
More work needs to be done if you have missings and/or ties.
In my excel document I have two sheets. The first is a data set and the second is a matrix of the relationship between two of the variables in my data set. Each possibility of the variable is a column in my matrix. I'm trying to get the sum of the products of the elements in two different arrays. Right now I'm using the formula {=SUM(N3:N20 * F3:F20)} and manually changing the columns each time. But my data set is over 800 items...
Ideally I'd like to know how to write a program that reads the value of the variable in my dataset looks up the correct columns in the matrix, multiplies them together, sums the products, and puts the result in the correct place in my data set. However, just knowing the result for all the possible combinations of columns would also save me alot of time. Its an 18x18 matrix. Thanks for any feedback!
Your question is a little bit ambiguous but as far as i understand your question you want to multiply different sets of two columns in the same sheet and put their result into the next sheet, is it so? if so, please post images of your work (all sheets). Your answer is possible even in Excel only without any vba code, thanks.
you can also use =SUMPRODUCT(N3:N20,F3:F20) for your formula instead of {=SUM(N3:N20 * F3:F20)}
I'm new to SAS coming from python, java and C++. From these languages, the proper thing to do when writing/repeating large statements is to encapsulate them in a variable that is defined once and repeated several times in the code.
I.e. instead of writing the same where statement over and over each time two similar datasets are merged, I want to write:
WHERE_CONDITION_VARIABLE = 'X in (10, 100, 1000, 10000 ......100000000);
data output;
merge in1 in2;
WHERE WHERE_CONDITION_VARIABLE;
run;
data output2;
merge in3 in4;
WHERE WHERE_CONDITION_VARIABLE;
run;
Unfortunately, I haven't been able to figure out how to define a variable such as WHERE_CONDITION_VARIABLE to streamline the code. Is what I'm asking possible to do in SAS?
You can use macro variables.
You define them like this:
%let WHERE_CONDITION_VARIABLE = X in (10, 100, 1000);
And reference them like this:
&WHERE_CONDITION_VARIABLE
SAS has a lot of options for avoiding repeating code; in that way it's actually a lot like python, although the method for accomplishing it is a little different as you do have a separate compilation step (so you can't just say WHERE like you ask directly).
First off, you have the macro variable. If you're just repeating text several times, you can define it in a macro variable, like so:
%let condition=X in (1,10,100,1000);
Macro variables are treated as if they were text you had written. They do not need quotation marks or other text qualifiers, unless they are intended to contain them as legal code, ie:
%let condition=X in ('A','B','C');
would be legal, but
%let condition="X in ('A','B','C')";
would probably not be what you want (unless you want that to be evaluated as a string, anyhow).
Through macro variables, you also have the ability to generate larger amounts of code in a datastep and then include it. For example, if you have a dataset containing a list of conditions, you could apply them this way:
data conditions;
format condition $50.;
input condition $;
datalines4;
if x = 15 then y=5;
if x = 20 then y=10;
if x = 20 and z = 5 then y=15;
if x = 20 and z = 10 then y=20;
;;;;
run;
proc sql;
select condition into :condlist separated by ' ' from conditions;
quit;
data want;
set have;
&condlist;
run;
That would take the conditions from "conditions" dataset and push it into a macro variable "&condlist". The PROC SQL call is the easiest way to get it into a macro variable, but there are others; CALL SYMPUT also can do it in a data step, or you can write it to a text file and then %include the text file as code as well. This is more commonly used in advanced programming by generating calls to a macro, with the conditions dataset providing the macro parameters; in this case you might have a macro
%macro cond(x=,y=,z=);
if x=&x and z=&z then y=&y;
%mend cond;
Then you could generate calls to cond from a dataset with just x,y,z values:
proc sql;
select cats('%cond(x=',x,',y=',y,',z=',z,')') into :condlist separated by ' ' from conditions;
quit;
and use it in the same way.
Macro programming in general is a good solution for avoiding code creep; a macro is written once and then can be run multiple times with different parameters. A macro can be anywhere from one line of code (like above) executed inside a data step, to hundreds of lines containing multiple DATA and PROC steps. Macro programming is a complex topic in and of itself, and worth reading more on.
You can also write a function in SAS. PROC FCMP (function compile) allows you to write fairly complex functions and execute them in your data step or even your PROC statements. http://www.lexjansen.com/pharmasug/2011/tu/pharmasug-2011-tu07.pdf is a good place to start with FCMP if you have 9.2; if you have 9.3, I haven't seen any papers yet (but there may be some out there) showing the newer things in FCMP. FCMP is fairly new so there are still a lot of changes in each iteration of SAS.
Here's an example of FCMP to do your condition:
proc fcmp
outlib=work.funcs.Test; /* where will the functions be saved */
function condition(x); /* declare a function returning a number */
if x in (1,10,100,1000) then return(1);
else return(0);
endsub;
quit;
data have;
do x = 1,5,10,20,100,150,1000,1500;
output;
end;
run;
options cmplib=work.funcs;
data want;
set have;
if condition(x) then output;
run;
You also have the CALL EXECUTE statement, which allows you to directly execute code from a dataset. Using the same CONDITIONS dataset:
data _null_;
set conditions end=eof;
if _n_ = 1 then call execute('data want; set have;');
call execute(condition);
if eof then call execute('run;');
run;
That would effectively construct a data step that executes immediately following your data null step with the same code as in the macro variable example. Call execute works a little differently, so while in this example there shouldn't be any difference, there are a few issues with timing that can cause problems (or can be advantageous); which you use depends on the circumstance. Particularly for CALL EXECUTE, read up on the documentation and online papers (SUGI papers most commonly) to find out more details.
In addition to directly executing code via macro variables or CALL EXECUTE, you have a lot of other ways of performing tasks to avoid wallpaper code. For example, to more easily perform the if statements above, you might be able to use a format. Formats convert one value to another value; most commonly you might have something like 'DOLLAR6.2' which would give you $3.50 from the number 3.5. However, formats can also be used to replace if-this-then-that expressions. If there were only X and Y (and no Z conditions), then you could do this, given this conditions dataset:
data conditions;
input x y;
datalines;
1 5
2 10
3 20
4 50
5 100
;;;;
run;
data for_fmt;
set conditions;
rename x=start y=label;
fmtname='XTOY';
type='i'; *type=i means numeric informat, so numeric to numeric conversion. Informat = to numeric, Format= to character.;
run;
proc format cntlin=for_fmt;
quit;
data want;
set have;
y = input(x,XTOY.);
run;
There you have one line of code converting x to y. (Of course there is a bit of code setting up the format, but it can be separated from the main code, and included in the set-up portion of your code, like a .h file in c).
You also have hash table lookups, which are really helpful when you have more complex conversions - either 1 to many or many to 1. They work just like they sound - you load the hash table into memory and perform lookups. http://support.sas.com/rnd/base/datastep/dot/hash-getting-started.pdf is one good place to start.
Finally, one good way to avoid repeating code is to use fewer separate datasets. SAS data steps and procedures have the "BY" statement available, which means they treat each different value of the BY variable(s) as effectively a separate dataset. The variable names and lengths need to match, as it is still technically one dataset, but if you have many datasets of similar data, and want to perform the same action to each, you can perform them once with a BY statement rather than multiple times.
For example, say you had the dataset SASHELP.CARS. You might want to calculate something separately for each make of car. You could either do:
data acura;
set sashelp.cars;
if make='ACURA';
run;
data honda;
set sashelp.cars;
if make='HONDA';
run;
And then run your code on each dataset separately. However, a more SASsy way to do it is to use the BY statement:
proc means data=sashelp.cars;
by make;
var mpg_city mpg_highway;
run;
Now you get a separate page for each make. You can use the BY statement in data step processing as well; you get variables FIRST.make and LAST.make which tell you if you're on the first record of a new MAKE or the last record of a MAKE (the record just before a change in value), which allow you to do things based on where you are in a dataset's BY group (for example, if first.make then counter=0; would allow you to have a counter that is reset each time you have a new value in make. ) The only caveat for BY groups is you have to sort your dataset by the BY variable prior to using it (or have an index on that variable, or both). This is really helpful for analysis of bootstrap samples or other processes where you have many nearly-identical datasets and perform identical actions on them.
I am assuming you want to put all the WHERE Conditions variables to be put in a bucket and then utilizing them based on index like structure (Python).
If that's the case then you may want to have a look at "INTO".
In "INTO" you will drop all of your X's.
And then you can take them whenever you want.