thanks for taking time to help me out.
Basically, I would like to generate a random number from 0 to 1, 15,000 times and if the generated value is below .25, then I would like it to display 0 in that spot in the table. If it is greater than .25, I would like to keep the original value.
Any tips on what I should use in the asterik part? The code is pasted below in the format I used:
data rand_data;
call streaminit(123);
do i = 1 to 15000;
u = rand('Uniform');
***if u < .25 then do;
0
else***
output;
end;
run;
proc print data= rand_data;
run;
It sounds like you want to replace values that are less than 0.25 with 0.
if u < 0.25 then u=0;
output;
If instead you want to ignore the values that are less than 0.25 then make the OUTPUT statement conditional.
if u >= 0.25 then output;
Given your code structure this could result in less than 15,000 observations.
Related
So I have this:
Initial database:
Variable1, variable2, value, percentvalue
Keyword1, a, 234, 0.7
Keyword1, a, 64, 0.18
Keyword1, a, 4, 0.05
Keyword1, a, 2, 0.025
Keyword1, a, 300, 0.84
Keyword2
Keyword2
Keyword3
Keyword4
Keyword4
and so on.
When I run this individually, it work:
data Filename1;
set filename0;
if variable1 = 'Keyword1' then do;
retain sumCol;
sumCol = sum(sumCol, percentvalue);
if sumCol>0.95 then DELETE;
output;
end;
This return the first 3 row of keyword1
Which is what I want.
But when I try to do it for the entire table which has like 600 keywords.
I'm currently running the test with only one keyword to make sure it work in the same way.
But when I run:
data Filename1;
set filename0;
array MyArrayVariable1{1} $ Keyword1;
do i=1 to dim(MyArrayVariable1);
if variable1 = MyArrayVariable1[i] then do;
retain sumCol;
sumCol = sum(sumCol, percentvalue);
if sumCol>0.95 then DELETE;
output;
end;
end;
run;
When I run it, It just pull an empty table instead of the selected value.
And if I get rid of the output; it pulls the entire table without filtering anything.
Looks like you just want to use BY group processing.
data Filename1;
set filename0;
by variable1 ;
if first.variable1 then sumcol=0;
sumCol + percentvalue;
if sumCol<=0.95 then output;
run;
Note that using a SUM statement
sumCol + percentvalue;
is a simplified way to code these two statements in your original code.
retain sumCol;
sumCol = sum(sumCol, percentvalue);
BY group processing with an I/O criterion based on a groupwise computation can also be succinctly coded in what is commonly called a DOW loop in the SAS community. One hallmark of the technique is to place the SET statement inside a DO loop.
Example:
data want;
do until (last.variable1);
SET have;
by variable1;
pctsum = sum(pctsum,percentvalue);
if pctsum <= 0.95 then OUTPUT;
end;
run;
NOTE:
I'm not sure of the role of your Variable2. Should it be part of a hierarchy wherein the pctsum is reset if the Variable2 value changes within a Variable1 group?
I'm trying to understand a friend's code to see if I can find some inspiration for my dissertation. He runs a section where he creates a dataset and inputs 3 datasets. However, what I don't understand is that he uses 3 set statements and the latter datasets use point = "_ N _"
What is the use of the following code?
data Other;
set One;
set Two point = _N_;
set Three point = _N_;
array Rating[*] Unrated;
array Amortising[*] '1'n;
array Rating_old[*] old_Unrated;
AM = 0;
do i = 1 to dim(Rating);
Rating[i] = Rating[i] + Rating_old[i] * Amortising[i];
end;
run;
The input datasets look like this
data one;
input Segment count weight ;
datalines;
1 0 0.1
99 1 0.2
;
run;
data two;
input block $ type '0'n '1'n '99'n;
datalines;
50 A 100% 10% 0%
50 S 100% 10% 0%
51 S 100% 10% 0%
52 S 100% 10% 0%
132 S 100% 12% 0%
;
run;
data three;
input DPD $ block type $ segment count weight;
datalines;
AM 50 S 1 0 0.1
Unrated 51 S 99 0.2
NPE 132 S 1 0.5
;
run;
Just looking to see what the point = _ N _ would be used for!
In this program it does nothing. The program would run exactly the same without the point= option on the last two set statements.
The POINT= let's you access observations directly. The _N_ automatic variable is incremented once for each iteration of the data step. So on the first iteration the step will read the first observation from each of the three inputs. Which is exactly what would happen without the point= option.
Note that this program will stop when the first SET statement reads past the end of the file. Without the POINT= then it would stop when ANY of the three set statements attempted to read past the end of the input file. You could do the same and avoid the ERRORs in the SAS log by using and testing the NOBS= options.
set One;
if _n_ <= nobs2 then set Two nobs=nobs2;
if _n_ <= nobs3 then set Three nobs=nobs3;
Given the datasets shown, it doesn't do anything.
However, if the ONE dataset had more rows than one or both of the other two datasets, it would avoid the data step stopping when it ran out of rows from the shortest dataset. For example, run this:
data Other;
set Two;
set One point = _N_;
set Three point = _N_;
array Rating[*] Unrated;
array Amortising[*] '1'n;
array Rating_old[*] old_Unrated;
AM = 0;
do i = 1 to dim(Rating);
Rating[i] = Rating[i] + Rating_old[i] * Amortising[i];
end;
run;
Just swapping TWO and ONE. Now you get 5 rows, while if you took off the point=_n_, you'd only get two still. So the program is likely being written to ensure all of ONE's rows are represented (similar to a left join in SQL except you're not joining to anything here). This would probably be more clearly written as a merge, even without a by statement if it's just a one-to-one merge. Usually, though, there's a valid merge key to merge on.
I have a series of string variables (x, y, z) whose observations I need to change from strings (x= less than 1 mile, more than 1 mile less then 5 miles, etc.) to integers (xrecode= 1, 2, etc).
Is there any automated way to do this? I need an automated method that gets away from this values equal 1, that value equals 2, ...(Do Loops, Arrays, Macros welcome)?
You can use an INFORMAT to convert from text to integer.
proc format ;
invalue distance
'less than 1 mile'=1
'more than 1 mile'=2
'less then 5 miles'=3
;
quit;
You can apply the same operation to multiple similar columns by looping over an ARRAY.
data want ;
set have ;
array in x y z ;
array out nx ny nz ;
do i=1 to dim(in);
out(i)=input(in(i),distance.);
end;
run;
Reeza is right. Proc format for instance:
proc format;
value ToForm
low-1 = 'less than one'
1-5 = 'one to five'
5-high = 'over five'
;quit;
data wanted;
set begin;
format val_to_format ToForm.;
run;
For more on proc format see: SAS documentation: proc format
I have a SAS data set grouped by clusters, as follows
data have;
input cluster date date9.;
cards;
1 1JAN2017
1 2JAN2017
1 7JAN2017
2 1JAN2017
2 3JAN2017
2 10JAN2017
;
run;
Within each cluster, I'd like to subtract a date from it's previous date, so I have the dataset below:
data want;
input cluster date date_diff;
cards;
1 1JAN2017 0
1 2JAN2017 1
1 7JAN2017 5
2 1JAN2017 0
2 3JAN2017 2
2 10JAN2017 7
;
run;
I think perhaps I should be using a lag function similar to what I have written below.
DATA test;
SET have;
BY cluster;
if first.cluster then do;
date_diff = date - lag(date);
END;
RUN;
Any advice would be appreciated! Thanks
I like dif for this (lag plus subtract in one function). You have the if first backwards, I think, but dif and lag have the same restriction - what they're really doing is building a queue, so the lag or dif statement cannot be conditionally executed for most use cases. Here I flip it around and calculate the dif, then set it to missing if on first.cluster.
I also encourage you to use missing, not 0, for the first.cluster dif.
DATA test;
SET have;
BY cluster;
date_diff = dif(date);
if first.cluster then call missing(date_diff);
RUN;
Conditional lags are tricky. It this case (and in many), you don't actually want a conditional lag. You want to compute the lag for every record, and then you can use it conditionally. One way is:
data want ;
set have ;
by cluster ;
lagdate=lag(date) ;
if first.cluster then date_diff=0 ;
else date_diff=date-lagdate ;
run ;
So you compute lagdate for every record. Then you can conditionally compute date_diff.
I have a datset sort of like this
obs| foo | bar | more
1 | 111 | 11 | 9
2 | 9 | 2 | 2
........
I need to throw out the 4 largest and 4 smallest of foo (later then I would do a similar thing with bar) basically to proceed but I'm unsure the most effective way to do this. I know there are functions smallest and largest but I don't understand how I can use them to get the smallest 4 or largest 4 from an already made dataset. I guess alternatively I could just remove the min and max 4 times but that sounds needlessly tedious/time consuming. Is there a better way?
PROC RANK will do this for you pretty easily. If you know the total count of observations, it's trivial - it's slightly harder if you don't.
proc rank data=sashelp.class out=class_ranks(where=(height_r>4 and weight_r>4));
ranks height_r weight_r;
var height weight;
run;
That removes any observation that is in the 4 smallest heights or weights, for example. The largest 4 would require knowing the maximum rank, or doing a second processing step.
data class_final;
set class_ranks nobs=nobs;
if height_r lt (nobs-3) and weight_r lt (nobs-3);
run;
Of course if you're just removing the values then do it all in the data step and call missing the variable if the condition is met rather than deleting the observation.
You are going to need to make at least 2 passes through your dataset however you do this - one to find out what the top and bottom 4 values are, and one to exclude those observations.
You can use proc univariate to get the top and bottom 5 values, and then use the output from that to create a where filter for a subsequent data step. Here's an example:
ods _all_ close;
ods output extremeobs = extremeobs;
proc univariate data = sashelp.cars;
var MSRP INVOICE;
run;
ods listing;
data _null_;
do _N_ = 1 by 1 until (last.varname);
set extremeobs;
by varname notsorted;
if _n_ = 2 then call symput(cats(varname,'_top4'),high);
if _n_ = 4 then call symput(cats(varname,'_bottom4'),low);
end;
run;
data cars_filtered;
set sashelp.cars(where = ( &MSRP_BOTTOM4 < MSRP < &MSRP_TOP4
and &INVOICE_BOTTOM4 < INVOICE < &INVOICE_TOP4
)
);
run;
If there are multiple observations that tie for 4th largest / smallest this will filter out all of them.
You can use proc sql to place the number of distinct values of foo into a macro var (includes null values as distinct).
In you data step you can use first.foo and the macro var to selectively output only those that are no the smallest or largest 4 values.
proc sql noprint;
select count(distinct foo) + count(distinct case when foo is null then 1 end)
into :distinct_obs from have;
quit;
proc sort data = have; by foo; run;
data want;
set have;
by foo;
if first.foo then count+1;
if 4 < count < (&distinct_obs. - 3) then output;
drop count;
run;
I also found a way to do it that seems to work with IML (I'm practicing by trying to redo things different ways). I knew my maximum number of observations and had already sorted it by order of the variable of interest.
PROC IML;
EDIT data_set;
DELETE point {{1, 2, 3, 4,51, 52, 53, 54};
PURGE;
close data_set;
run;
I've not used IML very much but I stumbled upon this while reading documentation. Thank you to everyone who answered my question!