Proportion of the variables in each observation that satisfy certain conditions in SAS - arrays

HAVE is a SAS data set with 1700 observations and ~1,000 variables. There are three "types" of variables beyond the id. They are denoted by different prefixes. Here is a subset of the file:
id a_dog b_dog c_dog a_cat b_cat c_cat a_mouse b_mouse c_mouse ...
prsn1 1 -1 -2 2 2 0 1 4 1
prsn2 -1 -3 4 2 2 -1 0 -1 -1
...
I need to calculate the proportion of values that are above, below, or equal to zero for each respondent, by the type of variable (i.e., (a_, b_, or c_). The solution should append these new variables to the file:
... prop_a_gt0 prop_a_lt0 prop_a_eq0 prop_b_gt0 prop_b_lt0 prop_b_eq0 prop_c_gt0 prop_c_lt0 prop_c_eq0
... 1.0000 0.0000 0.0000 0.6667 0.3333 0.0000 0.3333 0.3333 0.3333
... 0.3333 0.3333 0.3333 0.3333 0.6667 0.0000 0.3333 0.6667 0.0000
Note how prop_b_gt0, for example, is 0.6667 for prsn1 because two of the three b_ variables in the prsn1 row have values greater than 0.
I'm not sure how to accomplish this systematically. Perhaps there's a way to combine arrays with a proc sql step? Any solution welcome!

With an array you will need to loop through the array and count the number greater (and possibly count the number non-missing).
data want;
set have ;
array a a_: ;
numerator=0;
denominator=0;
do index=1 to dim(a);
numerator=sum(numerator,a[index]>0);
denominator=sum(denominator,not missing(a[index]));
end;
prob_a_gt0=numerator/denominator;
drop index numerator denominator;
run;
Just replicate the block of code for the B and C variables also.

For the case of more than three arrays (grouped by variable name suffix A, B, C) a macro will help ensure there are no typos or stray edits that can happen during copy and paste (code replication).
Suppose a macro compute_proportions emits code that loops over a variable array defined in a DATA Step. The code generator counts each conditional states met by criteria during the loop and calculates the proportion after looping.
* simulate data;
data have;
array a a_1-a_300; * for simplicity, presume 1 to 300 correspond to dog, cat, mouse, ...;
array b b_1-b_300;
array c c_1-c_300;
call streaminit(123);
do id = 1 to 10;
do _n_ = 1 to dim(a);
a (_n_) = ceil(rand('uniform', 9)) - 5;
b (_n_) = ceil(rand('uniform', 9)) - 5;
c (_n_) = ceil(rand('uniform', 9)) - 5;
end;
output;
end;
run;
%macro compute_proportions(array=, prefix=);
_lt = 0; %* <0 count;
_eq = 0; %* =0 count;
_gt = 0; %* >0 count;
_n = 0;
do _index = 1 to dim(&array);
_v = &array(_n_);
if not missing(_v) then do;
_lt + _v < 0;
_eq + _v = 0;
_gt + _v > 0;
_n + 1;
end;
end;
if _n > 0 then do;
&prefix.prop_lt0 = _lt / _n;
&prefix.prop_eq0 = _eq / _n;
&prefix.prop_gt0 = _gt / _n;
end;
drop _lt _eq _gt _index _v _n;
%mend;
data want;
set have;
array a a_:; * all variables whose names start with a_ can be array referenced during step;
array b b_:;
array c c_:;
%compute_proportions (array=a, prefix=a_)
%compute_proportions (array=b, prefix=b_)
%compute_proportions (array=c, prefix=c_)
run;

Related

How to get the count of values greater than zero from a subset of an array in SAS

I want to get a data set with an array that saves the count of values greater than zero in a subset of an array.
My code:
%Macro Test(input_array, window);
array initial{*} &input_array;
array position[&window];
array cumulative[&window];
/* Fill array indicating position with value zero, previous value greater than zero */
do i = 1 to dim(initial) - 1;
if initial(i) gt 0 and initial(i+1) eq 0 then
position(i) = i + 1;
end;
/* Fill array indicating the count of values greater than zero until the index in the position array*/
%let j = 1;
%do %while (&j lt &window);
end_ = coalesce(of position&j - position&window);
if not missing(end_) then do;
gt_0_cnt = 0;
do k = &j to end_ - 1;
gt_0_cnt + ifn(initial(k) > 0,1,0);
end;
cumulative(end_ - 1) = gt_0_cnt;
end;
%let j = %eval(&j + end_);
%end;
%Mend;
DATA HAVE;
INPUT ID FM1-FM18;
DATALINES;
A 1 2 0 0 1 0 0 0 0 2 2 2 3 3 4 4 4 0
B 0 0 1 2 3 4 5 1 2 3 4 0 0 0 1 2 0 0
;
RUN;
DATA WANT;
SET HAVE;
%Test(FM: 18);
RUN;
The output I need:
But I have a problem when trying to evaluate this expression
%let j = %eval(&j + end_)
I get the messaje ERROR: A character operand was found in the %EVAL function or %IF condition where a numeric operand is required. The condition was:
1 + end_
I don't know of any other way to get the desired result.
If someone can help me I will be grateful.
Doesn't seem like you need the macro language for this.
data want;
set have;
array fm fm:;
array cum cum_1-cum_18;
do _i = 1 to dim(fm);
if fm[_i] eq 0 then call missing(cum[_i]);
else do;
do count = 1 by 1 until (fm[_i+count] eq 0 or (count+_i eq dim(fm)));
end;
put _i= count=;
cum[_i+count-1] = count;
_i = _i + count - 1;
end;
end;
run;
Obviously you can specify the 18 max on the cum array through a macro parameter, or what the variable names are, but all of the stuff you're doing is perfectly doable through the data step language or simple macro variable parameters.

Forming a 'partial' identity-matrix according to a partially filled vector

I'm currently forming a matrix from a vector in MATLAB following the scheme described below:
Given is a vector x containing ones and zeros in an arbitrary order, e.g.
x = [0 1 1 0 1];
From this, I would like to form a matrix Y that is described as follows:
Y has m rows, where m is the number of ones in x (here: 3).
Each row of Y is filled with a one at the k-th entry, where k is the position of a one in vector x (here: k = 2,3,5)
For the example x from above, this would result in:
Y = [0 1 0 0 0;
0 0 1 0 0;
0 0 0 0 1]
This is identical to an identity matrix, that has its (x=0)th rows eliminated.
I'm currently achieving this via the following code:
x = [0,1,1,0,1]; %example from above
m = sum(x==1);
Y = zeros(m,numel(x));
p = 1;
for n = 1:numel(x)
if x(n) == 1
Y(p,n) = 1;
p = p+1;
end
end
It works but I'm kind of unhappy with it as it seems rather inefficient and inelegant. Any ideas for a smoother implementation, maybe using some matrix multiplications or so are welcome.
Here are a few one-line alternatives:
Using sparse:
Y = full(sparse(1:nnz(x), find(x), 1));
Similar but with accumarray:
Y = accumarray([(1:nnz(x)).' find(x(:))], 1);
Using eye and indexing. This assumes Y is previously undefined:
Y(:,logical(x)) = eye(nnz(x));
Use find to obtain the indices of ones in x which are also the column subscripts of ones in Y. Find the number of rows of Y by adding all the elements of the vector x. Use these to initialise Y as a zero matrix. Now find the linear indices to place 1s using sub2ind. Use these indices to change the elements of Y to 1.
cols = find(x);
noofones = sum(x);
Y = zeros(noofones, size(x,2));
Y(sub2ind(size(Y), 1:noofones, cols)) = 1;
Here's an alternative using matrix multiplications:
x = [0,1,1,0,1];
I = eye(numel(x));
% construct identity matrix with zero rows
Y = I .* x; % uses implicit expansion from 2016b or later
Y = Y(logical(x), :); % take only non-zero rows of Y
Result:
Y =
0 1 0 0 0
0 0 1 0 0
0 0 0 0 1
Thanks to #SardarUsama's comment for simplifying the code a bit.
Thanks everybody for the nice alternatives! I tried out all your solutions and averaged execution times over 1e4 executions for random (1000-entry) x-vectors. Here are the results:
(7.3e-4 sec) full(sparse(1:nnz(x), find(x), 1));
(7.5e-4 sec) cols = find(x);
noofones = sum(x);
Y = zeros(noofones, size(x,2));
Y(sub2ind(size(Y), 1:noofones, cols)) = 1;
(7.7e-4 sec) Y = accumarray([(1:nnz(x)).' find(x(:))], 1);
(1.7e-3 sec) I = speye(numel(x));
Y = I .* x;
Y = full(Y(logical(x), :));
(3.1e-3 sec) Y(:,logical(x)) = eye(nnz(x));
From your comment "This is identical to an identity matrix, that has its (x=0)th rows eliminated.", well, you can also explicitly generate it as such:
Y = eye(length(x));
Y(x==0, :) = [];
Very slow option for long x, but it works slightly faster than full(sparse(... for x with 10 elements on my computer.

In matlab, find the frequency at which unique rows appear in a matrix

In Matlab, say I have the following matrix, which represents a population of 10 individuals:
pop = [0 0 0 0 0; 1 1 1 0 0; 1 1 1 1 1; 1 1 1 0 0; 0 0 0 0 0; 0 0 0 0 0; 1 0 0 0 0; 1 1 1 1 1; 0 0 0 0 0; 0 0 0 0 0];
Where rows of ones and zeros define 6 different 'types' of individuals.
a = [0 0 0 0 0];
b = [1 0 0 0 0];
c = [1 1 0 0 0];
d = [1 1 1 0 0];
e = [1 1 1 1 0];
f = [1 1 1 1 1];
I want to define the proportion/frequency of a, b, c, d, e and f in pop.
I want to end up with the following list:
a = 0.5;
b = 0.1;
c = 0;
d = 0.2;
e = 0;
f = 0.2;
One way I can think of is by summing the rows, then counting the number of times each appears, and then sorting and indexing
sum_pop = sum(pop')';
x = unique(sum_pop);
N = numel(x);
count = zeros(N,1);
for l = 1:N
count(l) = sum(sum_pop==x(l));
end
pop_frequency = [x(:) count/10];
But this doesn't quite get me what I want (i.e. when frequency = 0) and it seems there must be a faster way?
You can use pdist2 (Statistics Toolbox) to get all frequencies:
indiv = [a;b;c;d;e;f]; %// matrix with all individuals
result = mean(pdist2(pop, indiv)==0, 1);
This gives, in your example,
result =
0.5000 0.1000 0 0.2000 0 0.2000
Equivalently, you can use bsxfun to manually compute pdist2(pop, indiv)==0, as in Divakar's answer.
For the specific individuals in your example (that can be identified by the number of ones) you could also do
result = histc(sum(pop, 2), 0:size(pop,2)) / size(pop,1);
There is some functionality in unique that can be used for this. If
[q,w,e] = unique(pop,'rows');
q is the matrix of unique rows, w is the index of the row first appears in the matrix. The third element e contains indices of q so that pop = q(e,:). Armed with this, the rest of the problem should be straight forward. The probability of a value in e should be the probability that this row appears in pop.
The counting can be done with histc
histc(e,1:max(e))/length(e)
and the non occuring rows can be found with
ismember(a,q,'rows')
There is of course other ways as well, maybe (probably) faster ways, or oneliners. Why I post this is because it provides a way that is easy to understand, readable and that does not require any special toolboxes.
EDIT
This example gives expected output
a = [0,0,0,0,0;1,0,0,0,0;1,1,0,0,0;1,1,1,0,0;1,1,1,1,0;1,1,1,1,1]; % catenated a-f
[q,w,e] = unique(pop,'rows');
prob = histc(e,1:max(e))/length(e);
out = zeros(size(a,1),1);
out(ismember(a,q,'rows')) = prob;
Approach #1
With bsxfun -
A = cat(1,a,b,c,d,e,f)
out = squeeze(sum(all(bsxfun(#eq,pop,permute(A,[3 2 1])),2),1))/size(pop,1)
Output -
out =
0.5000
0.1000
0
0.2000
0
0.2000
Approach #2
If those elements are binary numbers, you can convert them into decimal format.
Thus, decimal format for pop becomes -
>> bi2de(pop)
ans =
0
7
31
7
0
0
1
31
0
0
And that of the concatenated array, A becomes -
>> bi2de(A)
ans =
0
1
3
7
15
31
Finally, you need to count the decimal formatted numbers from A in that of pop, which you can do with histc. Here's the code -
A = cat(1,a,b,c,d,e,f)
out = histc(bi2de(pop),bi2de(A))/size(pop,1)
Output -
out =
0.5000
0.1000
0
0.2000
0
0.2000
I think ismember is the most direct and general way to do this. If your groups were more complicated, this would be the way to go:
population = [0,0,0,0,0; 1,1,1,0,0; 1,1,1,1,1; 1,1,1,0,0; 0,0,0,0,0; 0,0,0,0,0; 1,0,0,0,0; 1,1,1,1,1; 0,0,0,0,0; 0,0,0,0,0];
groups = [0,0,0,0,0; 1,0,0,0,0; 1,1,0,0,0; 1,1,1,0,0; 1,1,1,1,0; 1,1,1,1,1];
[~, whichGroup] = ismember(population, groups, 'rows');
freqOfGroup = accumarray(whichGroup, 1)/size(groups, 1);
In your special case the groups can be represented by their sums, so if this generic solution is not fast enough, use the sum-histc simplification Luis used.

Bounding values of vector : Thresholding function

H = [1 1; 1 2; 2 -1; 2 0; -1 2; -2 1; -1 -1; -2 -2;]';
I need to threshold each value such that
H(I,j) = 0 if H(I,j) > =1,
else H(I,j) = 1 if H(I,j) <=0
I applied this code
a = H(1,1)
a(a<=0) = 1
a(a>=1) = 0
But this means that the already affected value in the first step may get changed again. What is the correct way of thresholding? The above code is giving incorrect answers. I should be getting
a = [0 0; 0 0; 0 1; 0 1; 1 0; 1 0; 1 1; 1 1]
Please help
EDIT
Based upon the answer now I am getting
0 0
0 0
1.0000 0.3443
0.8138 0.9919
0 0.7993
0.1386 1.0000
1.0000 1.0000
1.0000 1.0000
As can be seen, rows 3-6 are all incorrect. Please help
ind1 = H>=1; %// get indices before doing any change
ind2 = H<=0;
H(ind1) = 0; %// then do the changes
H(ind2) = 1;
If dealing with non-integer values, you should apply a certain tolerance in the comparisons:
tol = 1e-6; %// example tolerance
ind1 = H>=1-tol; %// get indices before doing any change
ind2 = H<=0+tol;
H(ind1) = 0; %// then do the changes
H(ind2) = 1;

Carrying out of loop so as to do the following operation

consider an area with size m*n. Here the size of m and n is unknown. Now I am extracting data from each point in the area. I am scanning the area first going in the x direction till m point and the again returning to m=0 and n=1, i.e the second row. Again I scan along the x direction till the end of m. An example of the data has been shown below. Here I get value for different x,y coordinates during the scan. I can carry out operation between the first two points in x direction by
p1 = A{1}; %%reading the data from the text file
p2 = A{2};
LA=[p1 p2];
for m=1:length(y)
p= LA(m,1);
t= LA(m,2);
%%and
q=LA(m+1,1)
r=LA(m+1,2)
I want to do the same for y axis. That is I want to operate between first point in x=0 and y=1 then between x=2 and y=1 and so on. Hope you have got it.
g x y
2 0 0
3 1 0
2 2 0
4 3 0
1 4 0
2 m 0
3 0 1
2 1 1
4 2 1
5 3 1
.
.
.
.
2 m 1
now I was thinking of a logic where I will first find the size of n by counting the number of zeros
NUMX = 0;
while y((NUMX+1),:) == 0
NUMX = NUMX + 1;
end
NU= NUMX;
And then I was thinking of applying the following loop
for m=1:NU:n-1
%%and
p= LA(m,1);
t= LA(m,2);
%%and
q=LA(m+1,1)
r=LA(m+1,2)
But its showing error. Please help!!
??? Attempted to access del2(99794,:); index out of bounds because
size(del2)=[99793,1].
Here NUMX=198
Comment: The nomenclature in your question is inconsistent, making it difficult to understand what you are doing. The variable del2 you mention in the error message is nowhere to be seen.
1.) Let's start off by creating a minimal working example that illustrates the data structure and provides knowledge of the dimensions we want to retrieve later. You matrix is not m x n but m*n x 3.
The following example will set up a matrix with data similar to what you have shown in your question:
M = zeros(8,3);
for J=1:4
for I=1:2
M((J-1)*2+I,1) = rand(1);
M((J-1)*2+I,2) = I;
M((J-1)*2+I,3) = J-1;
end
end
M =
0.469 1 0
0.012 2 0
0.337 1 1
0.162 2 1
0.794 1 2
0.311 2 2
0.529 1 3
0.166 2 3
2.) Next, let's determine the number of x and y, to use the nomenclature of your question:
NUMX = 0;
while M(NUMX+1,3) == 0
NUMX = NUMX + 1;
end
NUMY = size(M,1)/NUMX;
NUMX =
2
NUMY =
4
3.) The data processing you want to do still is unclear, but here are two approaches that can be used for different means:
(a)
COUNT = 1;
for K=1:NUMX:size(M,1)
A(COUNT,1) = M(K,1);
COUNT = COUNT + 1;
end
In this case, you step through the first column of M with a step-size corresponding to NUMX. This will result in all the values for x=1:
A =
0.469
0.337
0.794
0.529
(b) You can also use NUMX and NUMY to reorder M:
for J=1:NUMY
for I=1:NUMX
NEW_M(I,J) = M((J-1)*NUMX+I,1);
end
end
NEW_M =
0.469 0.337 0.794 0.529
0.012 0.162 0.311 0.166
The matrix NEW_M now is of size m x n, with the values of constant y in the columns and the values of constant x in the rows.
Concluding remark: It is unclear how you define m and n in your code, so your specific error message cannot be resolved here.

Resources