Octave group statistic computation with accumarray and user defined function returning a third column's value - arrays

To be clear the following is not my original problem which has data that is much larger and this code is in the context of a larger application and code base. I have reduced my work to the simplest example that’s now at toy or didactic size for clarity and dev and unit testing because that helps a lot for these purposes as well as for sharing on stackexchange. I am experienced in R but not in octave (Matlab). This is code for octave version 4.0.0. I seem to be stuck on translating group computations such as R’s tapply() or by() as well as writing and calling user defined functions (plus a bit of additional processing than those built-ins), but now written in the octave language.
Starting state is an array a as shown:
a = [5 1 8 0; 2 1 9 0; 2 3 3 0; 5 3 9 0]
a =
5 1 8 0
2 1 9 0
2 3 3 0
5 3 9 0
The process I need to do is essentially just this: Group by column 1, find the min statistic in column 3, return the value stored in column 2 of the same row, and write the value to column 4. I want no optional packages to be used. The built-in accumarray and min functions together get me pretty close but I’ve not found the needed syntax. Matlab seems to have many versions of parameter passing syntaxes developed over different releases and please note my code needs to run in Octave 4.0.0.
Final state desired is same array a, but column 4 is updated as shown:
a =
5 1 8 1
2 1 9 3
2 3 3 3
5 3 9 1
My best few code snippets of near-misses and most interesting things among all my failed attempts (not shown, as there are many pages of attempts that do not work) are:
[x,y] = min(a(a(:,1)==5,3),[],1)
x = 8
y = 1
Notice that y is index of row within the group, but not row within the a array, which is fine and good as long as I later do a computation to translate indexes from group-relative to global-relative, and inside there read the value of a(y,2) which is the correct answer value for each row.
>> [x,y] = min(a(a(:,1)==2,3),[],1)
x = 3
y = 2
>> [~,y] = min(a(a(:,1)==2,3),[],1);
>> y
y = 2
Notice that y is all I need from min() since it’s the index of the row of interest.
>> accumarray(a(:,1), a(:,3), [], #([~,y]=min([],[],1)))
parse error:
syntax error
Notice that with some kind of syntax I need to pass to min() in its first parameter the group of values determined by parameters 1 and 2 of accumarray.
I ultimately need to have something like this happen within the group computations after min() returns row index y:
a(y,4) = a(y,2); % y is the desired row index found by min() within each group
So, I tried to write a function that’s named for possibly simpler syntax:
>> function idx = ccen(d)
[~,y]=min(d,[],1);
idx=a(y,2);
end
>> accumarray(a(:,1), a(:,3), [], #ccen)
error: 'a' undefined near line 3 column 5
error: called from
ccen at line 3 column 4
accumarray at line 345 column 14
Seems to me, that to my surprise, a is not accessible to function ccen. Now what can I do? Thank you for reading.

When declaring functions in MATLAB / Octave, any variables declared outside the scope (by default) are not accessible. This means that even though you have a declaration for a, when you create that function, a is not accessible within the scope of the function.
What you can do is modify ccen so that a is supplied to the function so it can access the variable when the function is being called. After, wrap an anonymous function around your call to ccen when calling accumarray. Anonymous functions however do have the luxury of capturing the scope of variables that aren't explicitly declared as input variables into the function:
So first:
function idx = ccen(a, d) %// Change
[~,y]=min(d,[],1);
idx=a(y,2);
end
And now...
out = accumarray(a(:,1), a(:,3), [], #(x) ccen(a,x)); %// Change last parameter
This call is acceptable because the anonymous function is capturing a at the time of creation. Notice how x in the anonymous function is what is piped in from the accumarray calls. You're simply forwarding that as the second parameter to ccen and keeping a constant. This doesn't change the way the function is being run.... it's just resolving a scope issue.
I get the following in Octave:
octave:10> a = [5 1 8 0; 2 1 9 0; 2 3 3 0; 5 3 9 0]
a =
5 1 8 0
2 1 9 0
2 3 3 0
5 3 9 0
octave:11> function idx = ccen(a,d)
> [~,y]=min(d,[],1);
> idx=a(y,2);
> end
octave:12> out = accumarray(a(:,1), a(:,3), [], #(x) ccen(a,x))
out =
0
1
0
0
1

Related

get x elements from center of vector

How do I create a function (e.g. here, an anonymous one but I don't mind any) to get x elements from vec that are most centered (i.e. around the median)? In essence I want a function with same syntax as Matlab's randsample(n,k), but for non-random, with elements spanning around the center.
cntr=#(vec,x) vec(round(end*.5)+(-floor(x/2):floor(x/2))); %this function in question
cntr(1:10,3) % outputs 3 values around median 5.5 => [4 5 6];
cntr(1:11,5) % outputs => [4 5 6 7 8]
Note that vec is always sorted.
One part that I struggle with is not to output more than the limits of vec. For example, cntr(1:10, 10) should not throw an error.
edit: sorry to answer-ers for many updates of question
It's not a one-line anonymous function, but you can do this pretty simply with a couple calls to sort:
function vec = cntr(vec, x)
[~, index] = sort(abs(vec-median(vec)));
vec = vec(sort(index(1:min(x, end))));
end
The upside: it will still return the same set of values even if vec isn't sorted. Some examples:
>> cntr(1:10, 3)
ans =
4 5 6
>> cntr(1:11, 5)
ans =
4 5 6 7 8
>> cntr(1:10, 10) % No indexing errors
ans =
1 2 3 4 5 6 7 8 9 10
>> cntr([3 10 2 4 1 6 5 8 11 7 9], 5) % Unsorted version of example 2
ans =
4 6 5 8 7 % Same values, in their original order in vec
OLD ANSWER
NOTE: This applied to an earlier version of the question where a range of x values below and x values above the median were desired as output. Leaving it for posterity...
I broke it down into these steps (starting with a sorted vec):
Find the values in vec less than the median, get the last x indices of these, then take the first (smallest) of them. This is the starting index.
Find the values in vec greater than the median, get the first x indices of these, then take the last (largest) of them. This is the ending index.
Use the starting and ending indices to select the center portion of vec.
Here's the implementation of the above, using the functions find, min, and max:
cntr = #(vec, x) vec(min(find(vec < median(vec), x, 'last')):max(find(vec > median(vec), x)));
And a few tests:
>> cntr(1:10, 3) % 3 above and 3 below 5.5
ans =
3 4 5 6 7 8
>> cntr(1:11, 5) % 5 above and 5 below 6 (i.e. all of vec)
ans =
1 2 3 4 5 6 7 8 9 10 11
>> cntr(1:10, 10) % 10 above and 10 below 5.5 (i.e. all of vec, no indexing errors)
ans =
1 2 3 4 5 6 7 8 9 10
median requires sorting the array elements. Might as well sort manually, and pick out the middle block (edit: OP's comment indicates elements are already sorted, more justification for keeping it simple):
function data = cntr(data,x)
x = min(x,numel(data)); % don't pick more elements than exist
data = sort(data);
start = floor((numel(data)-x)/2) + 1;
data = data(start:start+x-1);
You could stick this into a single-line anonymous function with some tricks, but that just makes the code ugly. :)
Note that in the case of an uneven division (when we don't leave an even number of elements out), here we prioritize an element on the left. Here is what I mean:
0 0 0 0 0 0 0 0 0 0 0 => 11 elements, x=4
\_____/
picking these 4 values
This choice could be made more complex, for example shifting the interval left or right depending on which of those values is closest to the mean.
Given data (i.e. vec) is already sorted, the indexing operation can be kept to a single line:
cntr = #(data,x) data( floor((numel(data)-x)/2) + (1:x) );
The thing that is missing in that line is x = min(x,numel(data)), which we need to add twice becuase we can't change a variable in an anonymous function:
cntr = #(data,x) data( floor((numel(data)-min(x,numel(data)))/2) + (1:min(x,numel(data))) );
This we can simplify to:
cntr = #(data,x) data( floor(max(numel(data)-x,0)/2) + (1:min(x,numel(data))) );

Matlab One Hot Encoding - convert column with categoricals into several columns of logicals

CONTEXT
I have a large number of columns with categoricals, all with different, unrankable choices. To make my life easier for analysis, I'd like to take each of them and convert it to several columns with logicals. For example:
1 GENRE
2 Pop
3 Classical
4 Jazz
...would turn into...
1 Pop Classical Jazz
2 1 0 0
3 0 1 0
4 0 0 1
PROBLEM
I've tried using ind2vec but this only works with numericals or logicals. I've also come across this but am not sure it works with categoricals. What is the right function to use in this case?
If you want to convert from a categorical vector to a logical array, you can use the unique function to generate column indices, then perform your encoding using any of the options from this related question:
% Sample data:
data = categorical({'Pop'; 'Classical'; 'Jazz'; 'Pop'; 'Pop'; 'Jazz'});
% Get unique categories and create indices:
[genre, ~, index] = unique(data)
genre =
Classical
Jazz
Pop
index =
3
1
2
3
3
2
% Create logical matrix:
mat = logical(accumarray([(1:numel(index)).' index], 1))
mat =
6×3 logical array
0 0 1
1 0 0
0 1 0
0 0 1
0 0 1
0 1 0
ind2vec do work with the cell strings, and you could call cellstr function to get such a cell string.
This codes may help (From this ,I only changed a little)
data = categorical({'Pop'; 'Classical'; 'Jazz';});
GENRE = cellstr(data); %change categorical data into cell strings
[~, loc] = ismember(GENRE, unique(GENRE));
genre = ind2vec(loc')';
Gen=full(genre);
array2table(Gen, 'VariableNames', unique(GENRE))
run such a code will return this:
ans =
Classical Jazz Pop
_________ ____ ___
0 0 1
1 0 0
0 1 0
you can call unique(GENRE) to check the categories(in cell strings). In the meanwhile, logical(Gen)(or call logical(full(genre))) contain columns with logical that you need.
P.s. categorical structure might be faster than cell string, but ind2vec function doesn't work with it. unique and accumarray might better.

How to delete rows from a matrix that contain more than 50% zeros MATLAB

I want to remove the rows in an array that contain more than 50% of null elements.
eg:
if the input is
1 0 0 0 5 0
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
0 0 4 0 1 0
I want to remove rows 1 and 5, but retain the rest. The output should look like:
2 3 5 4 3 1
3 0 0 4 3 0
2 0 9 8 2 1
I want to do this using matlab
Use logical indexing into the rows, based on the mean of the rows of A negated:
t = .5; % threshold
A(mean(A==0,2) > t, :) = [];
What this does:
Compare A with 0: turns zeros into true, and nonzeros into false.
Compute the mean of each row.
Compare that to the desired threshold.
Use the result as a logical index to delete unwanted rows.
Equivalently, you can keep the wanted rows instead of removing the unwanted ones. This may be faster depending on the proportion of rows:
A = A(mean(A~=0,2) >= 1-t, :);
You can also use the standardizeMissing function and rmmissing function together to achieve this:
>> [~,tf] = rmmissing(standardizeMissing(A,0),'MinNumMissing',floor(0.5*size(A,2))+1);
>> A(~tf,:)
The call to standardizeMissing replaces the 0 values with NaN (the standard missing indicator for double), then the rmmissing call identifies in the logical vector tf the rows that have more than 50% of their entries as 0 (i.e., those rows that have more than floor(0.5*size(A,2))+1 0-valued entries. Then you can just negate the tf output and use it as an indexer. You can adapt the minimum number missing easily to satisfy whatever percentage criteria you want.
Also note that tf is a logical vector here that is only the size of the number of rows of A.
As I mentioned on Luis' answer, one downside to his approach is that it requires an intermediate logical array of the same size as A to be created, which can potentially incur a significant memory/performance penalty when working with large arrays.
An explicit looped approach with nnz (overly verbose, for clarity):
[nrows, ncols] = size(A);
maximum_ratio_of_zeros = 0.5;
minimum_ratio_of_nonzeros = 1 - maximum_ratio_of_zeros;
todelete = false(nrows, 1);
for ii = 1:nrows
if nnz(A(ii,:))/ncols < minimum_ratio_of_nonzeros
todelete(ii) = true;
end
end
A(todelete,:) = [];
Which returns the desired answer.

matlab making repetitions of data from one array into another array [duplicate]

This question already has answers here:
Element-wise array replication in Matlab
(7 answers)
Closed 6 years ago.
I would appreaciate a lot if you help. I am beginner in programming. I am using Matlab. So, I have an array which is 431x1 type - double; there i have numbers 1 to 6; for ex: 1 4 5 3 2 6 6 3 3 5 4 1 ...; what I want to do is I need to make a new array where I would have each element repeat for 11 times; for ex: a(1:11)=1; a(12:22)=4; a(23:33)=5; or to illustrate differently : a=[1 1 1 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4...];
I tried doing it in a loop but had some problems, which way could you suggest, do you know any function I could take advantage of?
First of all, it would help if you could format your code is separate blocks to make your question easier to read...
Let's say you had an array of length Nx1 as:
x = [1 2 3 4 5 ...]';
You could construct a loop and concatenate as:
for i = 1 : length(x)
for i = 1: length(x)
y(1 + (i - 1) * 11 : 1 + i * 11) = x(i); % Copy to a moving block
end
y(end) = []; % Delete the superfluous one at the end
You could also look at functions like repmat in the MATLAB help for replicating arrays.
Try this (NRepis how many times you want it repeated):
x = [1, 2, 3, 4, 5];
NRep = 5;
y = reshape(repmat(x,[NRep,1]),[1,length(x)*NRep])
Since it's a little cumbersome to write that out, I also particularly enjoy to use this "hack":
x = [1, 2, 3, 4, 5];
NRep = 5;
y = kron(x, ones(1,NRep));
Hope that helps!
P.S.: This is designed for row vectors only. Though if you need column vectors it's easy to modify.
edit: Of course, if you're post-R2015a you can just use y=repelem(x,NRep). I tend to forget about those because I work on older Matlabs (and sometimes it's not such a bad idea to be a bit backward compatible). Thanks to #rahnema1 for reminding me.

Using bsxfun with an anonymous function

after trying to understand the bsxfun function I have tried to implement it in a script to avoid looping. I am trying to check if each individual element in an array is contained in one matrix, returning a matrix the same size as the initial array containing 1 and 0's respectively. The anonymous function I have created is:
myfunction = #(x,y) (sum(any(x == y)));
x is the matrix which will contain the 'accepted values' per say. y is the input array. So far I have tried using the bsxfun function in this way:
dummyvar = bsxfun(myfunction,dxcp,X)
I understand that myfunction is equal to the handle of the anonymous function and that bsxfun can be used to accomplish this I just do not understand the reason for the following error:
Non-singleton dimensions of the two input arrays must match each other.
I am using the following test data:
dxcp = [1 2 3 6 10 20];
X = [2 5 9 18];
and hope for the output to be:
dummyvar = [1,0,0,0]
Cheers, NZBRU.
EDIT: Reached 15 rep so I have updated the answer
Thanks again guys, I thought I would update this as I now understand how the solution provided from Divakar works. This might deter confusion from others who have read my initial question and are confused to how bsxfun() works, I think writing it out helps me understand it better too.
Note: The following may be incorrect, I have just tried to understand how the function operates by looking at this one case.
The input into the bsxfun function was dxcp and X transposed. The function handle used was #eq so each element was compared.
%%// Given data
dxcp = [1 2 3 6 10 20];
X = [2 5 9 18];
The following code:
bsxfun(#eq,dxcp,X')
compared every value of dxcp, the first input variable, to every row of X'. The following matrix is the output of this:
dummyvar =
0 1 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
0 0 0 0 0 0
The first element was found by comparing 1 and 2 dxcp = [1 2 3 6 10 20]; X' = [2;5;9;18];
The next along the first row was found by comparing 2 and 2 dxcp = [1 2 3 6 10 20]; X' = [2;5;9;18];
This was repeated until all of the values of dxcp where compared to the first row of X'. Following this logic, the first element in the second row was calculating using the comparison between: dxcp = [1 2 3 6 10 20]; X' = [2;5;9;18];
The final solution provided was any(bsxfun(#eq,dxcp,X'),2) which is equivalent to: any(dummyvar,2). http://nf.nci.org.au/facilities/software/Matlab/techdoc/ref/any.html seems to explain the any function in detail well. Basically, say:
A = [1,2;0,0;0,1]
If the following code is run:
result = any(A,2)
Then the function any will check if each row contains one or several non-zero elements and return 1 if so. The result of this example would be:
result = [1;0;1];
Because the second input parameter is equal to 2. If the above line was changed to result = any(A,1) then it would check for each column.
Using this logic,
result = any(A,2)
was used to obtain the final result.
1
0
0
0
which if needed could be transposed to equal
[1,0,0,0]
Performance- After running the following code:
tic
dummyvar = ~any(bsxfun(#eq,dxcp,X'),2)'
toc
It was found that the duration was:
Elapsed time is 0.000085 seconds.
The alternative below:
tic
arrayfun(#(el) any(el == dxcp),X)
toc
using the arrayfun() function (which applies a function to each element of an array) resulted in a runtime of:
Elapsed time is 0.000260 seconds.
^The above run times are averages over 5 runs of each meaning that in this case bsxfun() is faster (on average).
You don't want every combination of elements thrown into your any(x == y) test, you want each element from dxcp tested to see if it exists in X. So here is the short version, which also needs no transposes. Vectorization should also be a bit faster than bsxfun.
arrayfun(#(el) any(el == X), dxcp)
The result is
ans =
0 1 0 0 0 0

Resources