How to get the count of values greater than zero from a subset of an array in SAS - arrays

I want to get a data set with an array that saves the count of values greater than zero in a subset of an array.
My code:
%Macro Test(input_array, window);
array initial{*} &input_array;
array position[&window];
array cumulative[&window];
/* Fill array indicating position with value zero, previous value greater than zero */
do i = 1 to dim(initial) - 1;
if initial(i) gt 0 and initial(i+1) eq 0 then
position(i) = i + 1;
end;
/* Fill array indicating the count of values greater than zero until the index in the position array*/
%let j = 1;
%do %while (&j lt &window);
end_ = coalesce(of position&j - position&window);
if not missing(end_) then do;
gt_0_cnt = 0;
do k = &j to end_ - 1;
gt_0_cnt + ifn(initial(k) > 0,1,0);
end;
cumulative(end_ - 1) = gt_0_cnt;
end;
%let j = %eval(&j + end_);
%end;
%Mend;
DATA HAVE;
INPUT ID FM1-FM18;
DATALINES;
A 1 2 0 0 1 0 0 0 0 2 2 2 3 3 4 4 4 0
B 0 0 1 2 3 4 5 1 2 3 4 0 0 0 1 2 0 0
;
RUN;
DATA WANT;
SET HAVE;
%Test(FM: 18);
RUN;
The output I need:
But I have a problem when trying to evaluate this expression
%let j = %eval(&j + end_)
I get the messaje ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was:
1 + end_
I don't know of any other way to get the desired result.
If someone can help me I will be grateful.

Doesn't seem like you need the macro language for this.
data want;
set have;
array fm fm:;
array cum cum_1-cum_18;
do _i = 1 to dim(fm);
if fm[_i] eq 0 then call missing(cum[_i]);
else do;
do count = 1 by 1 until (fm[_i+count] eq 0 or (count+_i eq dim(fm)));
end;
put _i= count=;
cum[_i+count-1] = count;
_i = _i + count - 1;
end;
end;
run;
Obviously you can specify the 18 max on the cum array through a macro parameter, or what the variable names are, but all of the stuff you're doing is perfectly doable through the data step language or simple macro variable parameters.

Related

Proportion of the variables in each observation that satisfy certain conditions in SAS

HAVE is a SAS data set with 1700 observations and ~1,000 variables. There are three "types" of variables beyond the id. They are denoted by different prefixes. Here is a subset of the file:
id a_dog b_dog c_dog a_cat b_cat c_cat a_mouse b_mouse c_mouse ...
prsn1 1 -1 -2 2 2 0 1 4 1
prsn2 -1 -3 4 2 2 -1 0 -1 -1
...
I need to calculate the proportion of values that are above, below, or equal to zero for each respondent, by the type of variable (i.e., (a_, b_, or c_). The solution should append these new variables to the file:
... prop_a_gt0 prop_a_lt0 prop_a_eq0 prop_b_gt0 prop_b_lt0 prop_b_eq0 prop_c_gt0 prop_c_lt0 prop_c_eq0
... 1.0000 0.0000 0.0000 0.6667 0.3333 0.0000 0.3333 0.3333 0.3333
... 0.3333 0.3333 0.3333 0.3333 0.6667 0.0000 0.3333 0.6667 0.0000
Note how prop_b_gt0, for example, is 0.6667 for prsn1 because two of the three b_ variables in the prsn1 row have values greater than 0.
I'm not sure how to accomplish this systematically. Perhaps there's a way to combine arrays with a proc sql step? Any solution welcome!
With an array you will need to loop through the array and count the number greater (and possibly count the number non-missing).
data want;
set have ;
array a a_: ;
numerator=0;
denominator=0;
do index=1 to dim(a);
numerator=sum(numerator,a[index]>0);
denominator=sum(denominator,not missing(a[index]));
end;
prob_a_gt0=numerator/denominator;
drop index numerator denominator;
run;
Just replicate the block of code for the B and C variables also.
For the case of more than three arrays (grouped by variable name suffix A, B, C) a macro will help ensure there are no typos or stray edits that can happen during copy and paste (code replication).
Suppose a macro compute_proportions emits code that loops over a variable array defined in a DATA Step. The code generator counts each conditional states met by criteria during the loop and calculates the proportion after looping.
* simulate data;
data have;
array a a_1-a_300; * for simplicity, presume 1 to 300 correspond to dog, cat, mouse, ...;
array b b_1-b_300;
array c c_1-c_300;
call streaminit(123);
do id = 1 to 10;
do _n_ = 1 to dim(a);
a (_n_) = ceil(rand('uniform', 9)) - 5;
b (_n_) = ceil(rand('uniform', 9)) - 5;
c (_n_) = ceil(rand('uniform', 9)) - 5;
end;
output;
end;
run;
%macro compute_proportions(array=, prefix=);
_lt = 0; %* <0 count;
_eq = 0; %* =0 count;
_gt = 0; %* >0 count;
_n = 0;
do _index = 1 to dim(&array);
_v = &array(_n_);
if not missing(_v) then do;
_lt + _v < 0;
_eq + _v = 0;
_gt + _v > 0;
_n + 1;
end;
end;
if _n > 0 then do;
&prefix.prop_lt0 = _lt / _n;
&prefix.prop_eq0 = _eq / _n;
&prefix.prop_gt0 = _gt / _n;
end;
drop _lt _eq _gt _index _v _n;
%mend;
data want;
set have;
array a a_:; * all variables whose names start with a_ can be array referenced during step;
array b b_:;
array c c_:;
%compute_proportions (array=a, prefix=a_)
%compute_proportions (array=b, prefix=b_)
%compute_proportions (array=c, prefix=c_)
run;

SAS - Find and print first non-zero value from a dataset in columns

I have a data set with ID in rows and months in columns, as the one shown below.
I want to create an auxiliary column that records the first value that is not zero of each line.
ID M1 M2 M3 M4 M5 Auxiliary column
1 0 0 8 8 7 8
2 7 7 7 . . 7
3 0 0 0 0 9 9
4 0 9 9 9 8 9
5 1 1 1 1 1 1
6 0 2 2 1 1 2
Currently l am using this code, but I haven't been able to get the results I am looking for. Any ideas?
data new_ops04;
set new_ops03;
array MONTHS (24) M1-M24;
RETAIN AUXILIARY_COLUMN 0;
do i=1 to 24;
IF MONTHS(i) ne 0 and AUXILIARY_COLUMN = 0 THEN
AUXILIARY_COLUMN = MONTHS(i);
end;
drop i;
run;
Thanks a lot!
You're very close. Just drop the retain statement:
data new_ops04;
set new_ops03;
array MONTHS (24) M1-M24;
AUXILIARY_COLUMN = 0;
do i=1 to 24;
IF MONTHS(i) ne 0 and AUXILIARY_COLUMN = 0 THEN
AUXILIARY_COLUMN = MONTHS(i);
end;
drop i;
run;
you need to consider what happens if the first observation(s) are missing
I would do this use case in proc sql. But your problem is that you are not stopping when you reach the first value. So:
flag = 0;
do i=1 to 24 until (flag)
if MONTHS(i) ne 0 and AUXILIARY_COLUMN = 0 THEN
AUXILIARY_COLUMN = MONTHS(i);
flag = 1;
end;
drop i, flag;

Find where condition is true n times consecutively

I have an array (say of 1s and 0s) and I want to find the index, i, for the first location where 1 appears n times in a row.
For example,
x = [0 0 1 0 1 1 1 0 0 0] ;
i = 5, for n = 3, as this is the first time '1' appears three times in a row.
Note: I want to find where 1 appears n times in a row so
i = find(x,n,'first');
is incorrect as this would give me the index of the first n 1s.
It is essentially a string search? eg findstr but with a vector.
You can do it with convolution as follows:
x = [0 0 1 0 1 1 1 0 0 0];
N = 3;
result = find(conv(x, ones(1,N), 'valid')==N, 1)
How it works
Convolve x with a vector of N ones and find the first time the result equals N. Convolution is computed with the 'valid' flag to avoid edge effects and thus obtain the correct value for the index.
Another answer that I have is to generate a buffer matrix where each row of this matrix is a neighbourhood of overlapping n elements of the array. Once you create this, index into your array and find the first row that has all 1s:
x = [0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
%// Solution
ind = bsxfun(#plus, (1:numel(x)-n+1).', 0:n-1); %'
out = find(all(x(ind),2), 1);
The first line is a bit tricky. We use bsxfun to generate a matrix of size m x n where m is the total number of overlapping neighbourhoods while n is the size of the window you are searching for. This generates a matrix where the first row is enumerated from 1 to n, the second row is enumerated from 2 to n+1, up until the very end which is from numel(x)-n+1 to numel(x). Given n = 3, we have:
>> ind
ind =
1 2 3
2 3 4
3 4 5
4 5 6
5 6 7
6 7 8
7 8 9
8 9 10
These are indices which we will use to index into our array x, and for your example it generates the following buffer matrix when we directly index into x:
>> x = [0 0 1 0 1 1 1 0 0 0];
>> x(ind)
ans =
0 0 1
0 1 0
1 0 1
0 1 1
1 1 1
1 1 0
1 0 0
0 0 0
Each row is an overlapping neighbourhood of n elements. We finally end by searching for the first row that gives us all 1s. This is done by using all and searching over every row independently with the 2 as the second parameter. all produces true if every element in a row is non-zero, or 1 in our case. We then combine with find to determine the first non-zero location that satisfies this constraint... and so:
>> out = find(all(x(ind), 2), 1)
out =
5
This tells us that the fifth location of x is where the beginning of this duplication occurs n times.
Based on Rayryeng's approach you can loop this as well. This will definitely be slower for short array sizes, but for very large array sizes this doesn't calculate every possibility, but stops as soon as the first match is found and thus will be faster. You could even use an if statement based on the initial array length to choose whether to use the bsxfun or the for loop. Note also that for loops are rather fast since the latest MATLAB engine update.
x = [0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
for idx = 1:numel(x)-n
if all(x(idx:idx+n-1))
break
end
end
Additionally, this can be used to find the a first occurrences:
x = [0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0 0 0 1 0 1 1 1 0 0 0]; %// Example data
n = 3; %// How many times we look for duplication
a = 2; %// number of desired matches
collect(1,a)=0; %// initialise output
kk = 1; %// initialise counter
for idx = 1:numel(x)-n
if all(x(idx:idx+n-1))
collect(kk) = idx;
if kk == a
break
end
kk = kk+1;
end
end
Which does the same but shuts down after a matches have been found. Again, this approach is only useful if your array is large.
Seeing you commented whether you can find the last occurrence: yes. Same trick as before, just run the loop backwards:
for idx = numel(x)-n:-1:1
if all(x(idx:idx+n-1))
break
end
end
One possibility with looping:
i = 0;
n = 3;
for idx = n : length(x)
idx_true = 1;
for sub_idx = (idx - n + 1) : idx
idx_true = idx_true & (x(sub_idx));
end
if(idx_true)
i = idx - n + 1;
break
end
end
if (i == 0)
disp('No index found.')
else
disp(i)
end

How to find the longest interval of 1's in a list [matlab]

I need to find the longest interval of 1's in a matrix, and the position of the first "1" in that interval.
For example if i have a matrix: [1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 ]
I need to have both the length of 7 and that the first 1's position is 11.
Any suggestions on how to proceed would be appreciated.
Using this anwser as a basis, you can do as follows:
a = [1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 ]
dsig = diff([0 a 0]);
startIndex = find(dsig > 0);
endIndex = find(dsig < 0) - 1;
duration = endIndex-startIndex+1;
duration
startIdx = startIndex(duration == max(duration))
endIdx = endIndex(duration == max(duration))
This outputs:
duration =
1 3 7
startIdx =
11
endIdx =
17
Please note, this probably needs double checking if it works for other cases than your example. Nevertheless, I think this is the way in the right directions. If not, in the linked anwser you can find more info and possibilities.
If there are multiple intervals of one of the same length, it will only give the position of the first interval.
A=round(rand(1,20)) %// test vector
[~,p2]=find(diff([0 A])==1); %// finds where a string of 1's starts
[~,p3]=find(diff([A 0])==-1); %// finds where a string of 1's ends
le=p3-p2+1; %// length of each interval of 1's
ML=max(le); %// length of longest interval
ML %// display ML
p2(le==ML) %// find where strings of maximum length begin (per Marcin's answer)
I have thought of a brute force approach;
clc; clear all; close all;
A= [1 0 0 1 1 1 0 0 0 0 1 1 1 1 1 1 1 ];
index = 1;
globalCount = 0;
count = 0;
flag = 0; %// A flag to keep if the previous encounter was 0 or 1
for i = 1 : length(A)
if A(i) == 1
count = count + 1;
if flag == 0
index = i
flag = 1;
end
end
if A(i) == 0 || i == length(A)
if count > globalCount
globalCount = count;
end
flag = 0;
count = 0;
end
end

input array of strings in matlab

Problem: Solve linear equations
I have a 3×3 matrix and I wanted to take 3 expressions as inputs which contain matrix cells like
2*b(1,1)+3*b(1,2)+3*b(1,3)
3*b(2,1)+4*b(2,3)+3*b(2,3)
and evaluate them with different cell values in matrix
0 1 0
1 0 0
1 0 0
0 1 0
0 1 0
1 0 0 etc.,
I used the following code, I got the result but I can only use the cell values. When I try to give expressions with numeral, it shows the following error:
*Warning: File: pro.m Line: 5 Column: 9 The expression on this line will generate an error when executed. The error will be: Error using
==> vertcat CAT arguments dimensions are not consistent.
??? Error using ==> pro at 5 Error using ==> vertcat CAT arguments dimensions are not consistent.*
Here is my code:
clc;
clear all;
close all;
cell = ['b(1,1)+b(1,2)';'b(2,1)+b(2 ,3)';'b(3,3)+b(3,2)'];
exp = cellstr(cell);
res = [0,0,0];
display(res);
display(exp);
a = zeros(3,3);
for i = 1:1:3
a(1,i) = 1;
if(i>1)
a(1,i-1) = 0;
end
for j = 1:1:3
a(2,j) = 1;
if(j>1)
a(2,j-1) = 0;
end
for k = 1:1:3
a(3,k) = 1;
if(k>1)
a(3,k-1) = 0;
end
b = a;
res(k) = eval(exp{k});
if res(1) == 1
if res(2) == 1
if res(3) == 1
display(res);
display(b);
break;
end
end
end
end
a(3,k)=0;
end
a(2,j) = 0;
end
;
Help how can I input strings with numerals and matrix cells...
This is not a valid expression to initialize a cell in Matlab:
cell = ['b(1,1)+b(1,2)';'b(2,1)+b(2 ,3)';'b(3,3)+b(3,2)'];
You have to use the curly brackets { }:
cell = {'b(1,1)+b(1,2)';'b(2,1)+b(2 ,3)';'b(3,3)+b(3,2)'};
BTW, if you want to solve linear equation of the form AX+b=0, you can simply try X=-inv(A)*b
% Define system
A = [2 3 1; 7 -1 1; 4 0 5];
b = [1 0 1].';
% Solve system
X = -inv(A)*b;

Resources