Multiple strings to inntegers in SAS - arrays

I have a series of string variables (x, y, z) whose observations I need to change from strings (x= less than 1 mile, more than 1 mile less then 5 miles, etc.) to integers (xrecode= 1, 2, etc).
Is there any automated way to do this? I need an automated method that gets away from this values equal 1, that value equals 2, ...(Do Loops, Arrays, Macros welcome)?

You can use an INFORMAT to convert from text to integer.
proc format ;
invalue distance
'less than 1 mile'=1
'more than 1 mile'=2
'less then 5 miles'=3
;
quit;
You can apply the same operation to multiple similar columns by looping over an ARRAY.
data want ;
set have ;
array in x y z ;
array out nx ny nz ;
do i=1 to dim(in);
out(i)=input(in(i),distance.);
end;
run;

Reeza is right. Proc format for instance:
proc format;
value ToForm
low-1 = 'less than one'
1-5 = 'one to five'
5-high = 'over five'
;quit;
data wanted;
set begin;
format val_to_format ToForm.;
run;
For more on proc format see: SAS documentation: proc format

Related

Reset a temporary array in SAS

After I declares an array, I'd like to reset its values for the rest of the code.
array cutoffs[4] _temporary_ (1 2 3 4); /*works well*/
... use of the array
array cutoffs[3] _temporary_ (3.5 5 7.5); /*Error*/
... use of the updated array
The error is as following :
ERROR 124-185: The variable cutoffs has already been defined.
This error is very clear but I wonder how could I reattribute the array without changing its name (which would be most tedious).
I tried some syntaxes but couldn't find by myself, and I saw no ressources on google, nor on stackoverflow.
How can I do it ?
EDIT : the main purpose is that I created a function (with proc fcmp) that take arrays as parameter and cut the value (like R's cut function). The function is to be used on a lot of columns but with different cutoffs, and I don't want to tediously create an array for each and every column.
Here is a macro version of your FCMP function:
%macro cut2string(var,cutoffs,values);
%if &var. lt %scan(&cutoffs.,1,%str( )) %then "%scan(&values.,1,%str( ))";
%else %if &var. ge %scan(&cutoffs.,-1,%str( )) %then "%scan(&values.,-1,%str( ))";
%else %do i=1 %to %sysfunc(countw(&cutoffs.,%str( )));
%if &var. ge %scan(&cutoffs.,&i.,%str( )) and &var. lt %scan(&cutoffs.,%eval(&i.+1),%str( )) %then "%scan(&values.,%eval(&i.+1),%str( ))";
%end;
%mend;
And here is how you would call it, using the same example as you used in your linked page:
data Work.nonsales2;
/*set Work.nonsales;*/
salary_string = %cut2string(30000, 20000 100000 500000, <20k 20k-100k 100k-500k >500k);
run;
You could use keyword parameter instead of positional to make your calls clearer:
%macro cut2string(var=,cutoffs=,values=);
...
salary_string = %cut2string(var=30000,cutoffs=20000 100000 500000,values=<20k 20k-100k 100k-500k >500k);
HOWEVER now that I see the code, this should really be a format in SAS:
proc format;
values cutoffs
low-<20000='<20k'
20000-<100000='20k-100k'
100000-<500000='100k-500k'
500000-high='>500k'
;
run;
data work.nonsales2
salarystrings=put(30000,cutoffs.);
run;
You can change the values of the cutoffs array one by one.
array cutoffs{4} _temporary_ (1 2 3 4); /*works well*/
... use of the array
cutoffs[1]=3.5;
cutoffs{2}=5;
cutoffs{3}=7.5;
cutoffs{4}=.;
or you could just use another name for the array the second time.
With that said, the way you are using this seems a bit strange.
EDIT: you could consider rewriting your proc fcmp function to expect the list of values as a character string (e.g. '3.5,5,7.5') instead of an array and do away with arrays entirely.
Your proc fcmp would change from something like
do i=1 to dim(array);
val=array{i};
...
end;
to something like;
do i=1 to countw(array,',');
val=input(scan(array,i,','),best32.);
...
end;
Why not use a macro instead of function?
%macro cut(invar,outvar,cutoffs,categories,dlm=%str( ));
/*
"CUT" continuous variable into categories
by generating SELECT code that can be used in
a data step.
The list of CATEGORIES must have one more entry that the list of CUTOFFS
*/
%local i ;
select ;
%do i=1 %to %sysfunc(countw(&cutoffs,&dlm));
when (&invar <= %scan(&cutoffs,&i,&dlm)) &outvar=%scan(&categories,&i,&dlm) ;
%end;
otherwise &outvar= %scan(&categories,-1,&dlm);
end;
%mend ;
Here is an example that creates both a numeric and a character output variable. For character variables either define the variable before using the macro or make sure the values for the first category is long enough for all values.
Let's test it.
data test ;
input x ##;
%cut(invar=x,outvar=y,cutoffs=3.5 5 7,categories=1 2 3 4)
%cut(invar=x,outvar=z,cutoffs=3.5|5|7,categories="One "|"Two"|"Three"|"Four",dlm=|)
cards;
2 3.5 4 5 6 7.4 8
;
If you turn on the MPRINT option you can see the generated code in the SAS log.
2275 %cut(invar=x,outvar=y,cutoffs=3.5 5 7,categories=1 2 3 4)
MPRINT(CUT): select ;
MPRINT(CUT): when (x <= 3.5) y=1 ;
MPRINT(CUT): when (x <= 5) y=2 ;
MPRINT(CUT): when (x <= 7) y=3 ;
MPRINT(CUT): otherwise y= 4;
MPRINT(CUT): end;
2276 %cut(invar=x,outvar=z,cutoffs=3.5|5|7,categories="One "|"Two "|"Three"|"Four ",dlm=|)
MPRINT(CUT): select ;
MPRINT(CUT): when (x <= 3.5) z="One " ;
MPRINT(CUT): when (x <= 5) z="Two" ;
MPRINT(CUT): when (x <= 7) z="Three" ;
MPRINT(CUT): otherwise z= "Four";
MPRINT(CUT): end;
Results
Obs x y z
1 2.0 1 One
2 3.5 1 One
3 4.0 2 Two
4 5.0 2 Two
5 6.0 3 Three
6 7.4 4 Four
7 8.0 4 Four

SAS how to create variable with corresponding values efficiently

I am trying to complete the following.
Variable Letter has three values (a, b, c). I would like to create a variable Letter_2 with values corresponding to the values of Letter, namely (1, 2, 3).
I know I can do this using three IF Then statements.
if Letter='a' then Letter_2='1';
if Letter='b' then Letter_2='2';
if Letter='c' then Letter_2='3';
Suppose I have 15 values for the variable Letter, and 15 corresponding values for the replacement. Is there a way to do it efficiently without typing the same If Then statement 15 times?
I am new to SAS. Any clue will be appreciated.
Lisa
Looks like an application for a FORMAT.
First define the format.
proc format ;
value $lookup 'a'='1' 'b'='2' 'c'='3' ;
run;
Then use it to re-code your variable.
data want;
set have;
letter2 = put(letter,$lookup.);
run;
Or perhaps you could use two temporary arrays and the WHICHC() function?
data have;
input letter $10. ;
cards;
car
apple
box
;;;;
data want ;
set have ;
array from (3) $10 _temporary_ ('apple','box','car');
array to (3) $10 _temporary_ ('first','second','third');
if whichc(letter,of from(*)) then
letter_2 = to(whichc(letter,of from(*)))
;
run;

Get rid of kth smallest and largest values of a dataset in SAS

I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!

Count the number of times a value occurs

I have 7 variables, 489 observations with variable values of 0-4.
What I need is the count percentage of use.
Answers 0,1 stand for non usage, and answers 2,3,4 stand for usage.
I created 7 additional vars and turned all the values above to:
1=usage - 0=non-usage.
Now, I don't know how to count and present how many "1" I have for each var and divide it by 489.
data LAB7;
set LAB3;
array v{*} v21-v27;
array VU{7};
DO i=1 to dim(v);
if v[i] = 1|0 THEN VU[i]=0;
else VU[i]=1;
END;
run;
You can do this:
data usage;
set lab3 end=eof;
array v{*} v21-v27
array n{7};
retain n: 0;
do i = 1 to dim(v);
if v[i] in (2, 3, 4) then n[i] + 1;
end;
if eof then do j = 1 to dim(v);
variable = vname(v[j]);
pct_usage = 100 * n[j] / _n_;
output;
end;
keep variable pct_usage;
run;
This creates an array of counters, one per variable, that are incremented by one whenever the corresponding variable is equal to 2, 3, or 4.
At the end of the data step, we output a record for each variable and record the percentage as the counter divided by the number of observations (_n_ when eof is true).
An alternative would be to use proc freq.
data indicators;
set lab3;
array v{*} v21-v27;
array ind{7};
do i = 1 to dim(v);
ind[i] = (v[i] in (2, 3, 4));
end;
run;
proc freq data = indicators;
tables ind: / out = usage;
run;
This creates binary indicator variables, one for each of the input variables, that are 1 when the input is 2, 3, or 4, and 0 otherwise. Counts and percentages are then obtained using proc freq.

Checking correct order of values

I have a data set that looks similar to the one below. Basically, I have current prices for three different sizes of an item type. If the sizes are priced correctly (ie small<medium<large) I want to flag them with a “Y” and continue to use the current price. If they are not priced correctly, I want to flag them with a “N” and use the recommended price. I know that this is probably a time to use array programming, but my array skills are, admittedly, a bit weak. There are hundreds of locations, but only one item type. I currently have the unique locations loaded in a macro variable list.
data have;
input type location $ size $ cur_price rec_price;
cards;
x NY S 4 1
x NY M 5 2
x NY L 6 3
x LA S 5 1
x LA M 4 2
x LA L 3 3
x DC S 5 1
x DC M 5 2
x DC L 5 3
;
run;
proc sql;
select distinct location into :loc_list from have;
quit;
Any help would be greatly appreciated.
Thanks.
Not sure why you'd want to use an array here...proc transpose and some data step logic
can easily solve this problem. Arrays are very useful (gotta admit, I'm not entirely
comfortable with them either), but in a situation where you have that many locations,
I think transpose is better.
Does the code below accomplish your goal?
/*sorts to get ready for transpose*/
proc sort data=have;
by location;
run;
/*transpose current price*/
proc transpose data=have out=cur_tran prefix=cur_price;
by location;
id size;
var cur_price;
run;
/*transpose recommended price*/
proc transpose data=have out=rec_tran prefix=rec_price;
by location;
id size;
var rec_price;
run;
/*merge back together*/
data merged;
merge cur_tran rec_tran;
by location;
run;
/*creates flags and new field for final price*/
data want;
set merged;
if cur_priceS<cur_priceM<cur_priceL then
do;
FLAG='Y';
priceS=cur_priceS;
priceM=cur_priceM;
priceL=cur_priceL;
end;
else do;
FLAG='N';
priceS=rec_priceS;
priceM=rec_priceM;
priceL=rec_priceL;
end;
run;
I don't see how arrays would help here. How about just checking using dif to queue the last record's price and verify it (could also retain the last price if you prefer). Make sure the dataset's properly sorted by type location descending size, then:
data want;
set have;
by type location descending size; *S > M > L alphabetically;
retain price_check;
if not first.location and dif(cur_price) lt 0 then price_check=1;
*if dif < 0 then cur rec is smaller;
else if first.location then price_check=0; *reset it;
if last.location;
keep type location price_check;
run;
Then merge that back to your original dataset by type location, and use the other price if cur_price=1.
Alternatively you could do it in a single query which is almost a re-statement of your requirements.
Proc sql;
create table want as
select *
/* Basically, I have current prices for three different sizes of an item type.
If the sizes are priced correctly (ie small<medium<large) */
, case
when max ( case when size eq 'S' then cur_price end)
lt max ( case when size eq 'M' then cur_price end)
and max ( case when size eq 'M' then cur_price end)
lt max ( case when size eq 'L' then cur_price end)
/* I want to flag them with a “Y” and continue to use the current price */
then 'Y'
/* If they are not priced correctly,
I want to flag them with a “N” and use the recommended price. */
else 'N'
end as Cur_Price_Sizes_Correct
, case
when calculate Cur_Price_Correct eq 'Y'
then cur_price
else rec_price
end as Price
From have
Group by Type
, Location
;
Quit;

Resources