mean of parts of an array in octave - arrays

I have two arrays. One is a list of lengths within the other. For example
zarray = [1 2 3 4 5 6 7 8 9 10]
and
lengths = [1 3 2 1 3]
I want to average (mean) over parts the first array with lengths given by the second. For this example, resulting in:
[mean([1]),mean([2,3,4]),mean([5,6]),mean([7]),mean([8,9,10])]
I am trying to avoid looping, for the sake of speed. I tried using mat2cell and cellfun as follows
zcell = mat2cell(zarray,[1],lengths);
zcellsum = cellfun('mean',zcell);
But the cellfun part is very slow. Is there a way to do this without looping or cellfun?

Here is a fully vectorized solution (no explicit for-loops, or hidden loops with ARRAYFUN, CELLFUN, ..). The idea is to use the extremely fast ACCUMARRAY function:
%# data
zarray = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];
%# generate subscripts: 1 2 2 2 3 3 4 5 5 5
endLocs = cumsum(lengths(:));
subs = zeros(endLocs(end),1);
subs([1;endLocs(1:end-1)+1]) = 1;
subs = cumsum(subs);
%# mean of each part
means = accumarray(subs, zarray) ./ lengths(:)
The result in this case:
means =
1
3
5.5
7
9
Speed test:
Consider the following comparison of the different methods. I am using the TIMEIT function by Steve Eddins:
function [t,v] = testMeans()
%# generate test data
[arr,len] = genData();
%# define functions
f1 = #() func1(arr,len);
f2 = #() func2(arr,len);
f3 = #() func3(arr,len);
f4 = #() func4(arr,len);
%# timeit
t(1) = timeit( f1 );
t(2) = timeit( f2 );
t(3) = timeit( f3 );
t(4) = timeit( f4 );
%# return results to check their validity
v{1} = f1();
v{2} = f2();
v{3} = f3();
v{4} = f4();
end
function [arr,len] = genData()
%#arr = [1 2 3 4 5 6 7 8 9 10];
%#len = [1 3 2 1 3];
numArr = 10000; %# number of elements in array
numParts = 500; %# number of parts/regions
arr = rand(1,numArr);
len = zeros(1,numParts);
len(1:end-1) = diff(sort( randperm(numArr,numParts) ));
len(end) = numArr - sum(len);
end
function m = func1(arr, len)
%# #Drodbar: for-loop
idx = 1;
N = length(len);
m = zeros(1,N);
for i=1:N
m(i) = mean( arr(idx+(0:len(i)-1)) );
idx = idx + len(i);
end
end
function m = func2(arr, len)
%# #user1073959: MAT2CELL+CELLFUN
m = cellfun(#mean, mat2cell(arr, 1, len));
end
function m = func3(arr, len)
%# #Drodbar: ARRAYFUN+CELLFUN
idx = arrayfun(#(a,b) a-(0:b-1), cumsum(len), len, 'UniformOutput',false);
m = cellfun(#(a) mean(arr(a)), idx);
end
function m = func4(arr, len)
%# #Amro: ACCUMARRAY
endLocs = cumsum(len(:));
subs = zeros(endLocs(end),1);
subs([1;endLocs(1:end-1)+1]) = 1;
subs = cumsum(subs);
m = accumarray(subs, arr) ./ len(:);
if isrow(len)
m = m';
end
end
Below are the timings. Tests were performed on a WinXP 32-bit machine with MATLAB R2012a. My method is an order of magnitude faster than all other methods. For-loop is second best.
>> [t,v] = testMeans();
>> t
t =
0.013098 0.013074 0.022407 0.00031807
| | | \_________ #Amro: ACCUMARRAY (!)
| | \___________________ #Drodbar: ARRAYFUN+CELLFUN
| \______________________________ #user1073959: MAT2CELL+CELLFUN
\__________________________________________ #Drodbar: FOR-loop
Furthermore all results are correct and equal -- differences are in the order of eps the machine precision (caused by different ways of accumulating round-off errors), therefore considered rubbish and simply ignored:
%#assert( isequal(v{:}) )
>> maxErr = max(max( diff(vertcat(v{:})) ))
maxErr =
3.3307e-16

Here is a solution using arrayfun and cellfun
zarray = [1 2 3 4 5 6 7 8 9 10];
lengths = [1 3 2 1 3];
% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(#(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( #(a) mean(zarray(a)), idx);
Your desired output result:
means =
1.0000 3.0000 5.5000 7.0000 9.0000
Following #tmpearce comment I did a quick time performance comparison between above's solution, from which I create a function called subsetMeans1
function means = subsetMeans1( zarray, lengths)
% Generate the indexes for the elements contained within each length specified
% subset. idx would be {[1], [4, 3, 2], [6, 5], [7], [10, 9, 8]} in this case
idx = arrayfun(#(a,b) a-(0:b-1), cumsum(lengths), lengths,'UniformOutput',false);
means = cellfun( #(a) mean(zarray(a)), idx);
and a simple for loop alternative, function subsetMeans2.
function means = subsetMeans2( zarray, lengths)
% Method based on single loop
idx = 1;
N = length(lengths);
means = zeros( 1, N);
for i = 1:N
means(i) = mean( zarray(idx+(0:lengths(i)-1)) );
idx = idx+lengths(i);
end
Using the next test scrip, based on TIMEIT, that allows checking performance varying the number of elements on the input vector and sizes of elements per subset:
% Generate some data for the performance test
% Total of elements on the vector to test
nVec = 100000;
% Max of elements per subset
nSubset = 5;
% Data generation aux variables
lenghtsGen = randi( nSubset, 1, nVec);
accumLen = cumsum(lenghtsGen);
maxIdx = find( accumLen < nVec, 1, 'last' );
% % Original test data
% zarray = [1 2 3 4 5 6 7 8 9 10];
% lengths = [1 3 2 1 3];
% Vector to test
zarray = 1:nVec;
lengths = [ lenghtsGen(1:maxIdx) nVec-accumLen(maxIdx)] ;
% Double check that nVec is will be the max index
assert ( sum(lengths) == nVec)
t1(1) = timeit(#() subsetMeans1( zarray, lengths));
t1(2) = timeit(#() subsetMeans2( zarray, lengths));
fprintf('Time spent subsetMeans1: %f\n',t1(1));
fprintf('Time spent subsetMeans2: %f\n',t1(2));
It turns out that the non-vectorised version without arrayfun and cellfun is faster, presumably due to the extra overhead of those functions
Time spent subsetMeans1: 2.082457
Time spent subsetMeans2: 1.278473

Related

Finding multiple coincidences between two vectors

I have two vectors, and I'm trying to find ALL coincidences of one on the other within a certain tolerance without using a for loop.
By tolerance I mean for example if I have the number 3, with tolerance 2, I will want to keep values within 3±2, so (1,2,3,4,5).
A = [5 3 4 2]; B = [2 4 4 4 6 8];
I want to obtain a cell array containing on each cell the numbers of all the coincidences with a tolerance of 1 (or more) units. (A = B +- 1)
I have a solution with zero units (A = B), which would look something like this:
tol = 0;
[tf, ia] = ismembertol(B,A,tol,'DataScale',1); % For tol = 0, this is equivalent to using ismember
idx = 1:numel(B);
ib = accumarray(nonzeros(ia), idx(tf), [], #(x){x}) % This gives the cell array
The output is:
ib =
[]
[]
[2 3 4]
[1]
Which is as desired.
If I change the tolerance to 1, the code doesn't work as intended. It outputs instead:
tol = 1
[tf, ia] = ismembertol(B,A,tol,'DataScale',1); % For tolerance = 1, this is equivalent to using ismember
idx = 1:numel(B);
ib = accumarray(nonzeros(ia), idx(tf), [], #(x){x}) % This gives the cell array
ib =
[5]
[2 3 4]
[]
[1]
When I would expect to obtain:
ib =
[2 3 4 5]
[1 2 3 4]
[2 3 4]
[1]
What am I doing wrong? Is there an alternative solution?
Your problem is that, in the current state of your code, ismembertol only outputs 1 index per element of B found in A, so you lose information in the cases where an element can be found several times within tolerance.
As per the documentation You can use the 'OutputAllIndices',true value pair argument syntax, to output what you want in ia with just a call to ismembertol:
A = [5 3 4 2]; B = [2 4 4 4 6 8];
tol = 0;
[tf, ia] = ismembertol(A,B,tol,'DataScale',1,'OutputAllIndices',true);
celldisp(ia) % tol = 0
ia{1} =
0
ia{2} =
0
ia{3} =
2
3
4
ia{4} =
1
celldisp(ia) % tol = 1
ia{1} =
2
3
4
5
ia{2} =
1
2
3
4
ia{3} =
2
3
4
ia{4} =
1
Here is a manual approach, just to provide another method. It computes an intermediate matrix of all absolute differences (using implicit expansion), and from the row and column indices of the entries that are less than the tolerance it builds the result:
A = [5 3 4 2];
B = [2 4 4 4 6 8];
tol = 1;
[ii, jj] = find(abs(A(:).'-B(:))<=tol);
ib = accumarray(jj, ii, [numel(A) 1], #(x){x});
Note that this approach
may be memory-intensive, because of the intermediate matrix;
can be made to work in old Matlab versions, because it doesn't use ismembertol; but then implicit expansion has to be replaced by explicitly calling bsxfun:
[ii, jj] = find(abs(bsxfun(#minus, A(:).', B(:)))<=tol);
ib = accumarray(jj, ii, [numel(A) 1], #(x){x});

Finding number(s) that is(are) repeated consecutively most often

Given this array for example:
a = [1 2 2 2 1 3 2 1 4 4 4 5 1]
I want to find a way to check which numbers are repeated consecutively most often. In this example, the output should be [2 4] since both 2 and 4 are repeated three times consecutively.
Another example:
a = [1 1 2 3 1 1 5]
This should return [1 1] because there are separate instances of 1 being repeated twice.
This is my simple code. I know there is a better way to do this:
function val=longrun(a)
b = a(:)';
b = [b, max(b)+1];
val = [];
sum = 1;
max_occ = 0;
for i = 1:max(size(b))
q = b(i);
for j = i:size(b,2)
if (q == b(j))
sum = sum + 1;
else
if (sum > max_occ)
max_occ = sum;
val = [];
val = [val, q];
elseif (max_occ == sum)
val = [val, q];
end
sum = 1;
break;
end
end
end
if (size(a,2) == 1)
val = val'
end
end
Here's a vectorized way:
a = [1 2 2 2 1 3 2 1 4 4 4 5 1]; % input data
t = cumsum([true logical(diff(a))]); % assign a label to each run of equal values
[~, n, z] = mode(t); % maximum run length and corresponding labels
result = a(ismember(t,z{1})); % build result with repeated values
result = result(1:n:end); % remove repetitions
One solution could be:
%Dummy data
a = [1 2 2 2 1 3 2 1 4 4 4 5 5]
%Preallocation
x = ones(1,numel(a));
%Loop
for ii = 2:numel(a)
if a(ii-1) == a(ii)
x(ii) = x(ii-1)+1;
end
end
%Get the result
a(find(x==max(x)))
With a simple for loop.
The goal here is to increase the value of x if the previous value in the vector a is identical.
Or you could also vectorized the process:
x = a(find(a-circshift(a,1,2)==0)); %compare a with a + a shift of 1 and get only the repeated element.
u = unique(x); %get the unique value of x
h = histc(x,u);
res = u(h==max(h)) %get the result

Find unique elements of multiple arrays

Let say I have 3 MATs
X = [ 1 3 9 10 ];
Y = [ 1 9 11 20];
Z = [ 1 3 9 11 ];
Now I would like to find the values that appear only once, and to what array they belong to
I generalized EBH's answer to cover flexible number of arrays, arrays with different sizes and multidimensional arrays. This method also can only deal with integer-valued arrays:
function [uniq, id] = uniQ(varargin)
combo = [];
idx = [];
for ii = 1:nargin
combo = [combo; varargin{ii}(:)]; % merge the arrays
idx = [idx; ii*ones(numel(varargin{ii}), 1)];
end
counts = histcounts(combo, min(combo):max(combo)+1);
ids = find(counts == 1); % finding index of unique elements in combo
uniq = min(combo) - 1 + ids(:); % constructing array of unique elements in 'counts'
id = zeros(size(uniq));
for ii = 1:numel(uniq)
ids = find(combo == uniq(ii), 1); % finding index of unique elements in 'combo'
id(ii) = idx(ids); % assigning the corresponding index
end
And this is how it works:
[uniq, id] = uniQ([9, 4], 15, randi(12,3,3), magic(3))
uniq =
1
7
11
12
15
id =
4
4
3
3
2
If you are only dealing with integers and your vectors are equally sized (all with the same number of elements), you can use histcounts for a quick search for unique elements:
X = [1 -3 9 10];
Y = [1 9 11 20];
Z = [1 3 9 11];
XYZ = [X(:) Y(:) Z(:)]; % one matrix with all vectors as columns
counts = histcounts(XYZ,min(XYZ(:)):max(XYZ(:))+1);
R = min(XYZ(:)):max(XYZ(:)); % range of the data
unkelem = R(counts==1);
and then locate them using a loop with find:
pos = zeros(size(unkelem));
counter = 1;
for k = unkelem
[~,pos(counter)] = find(XYZ==k);
counter = counter+1;
end
result = [unkelem;pos]
and you get:
result =
-3 3 10 20
1 3 1 2
so -3 3 10 20 are unique, and they appear at the 1 3 1 2 vectors, respectively.

MATLAB - Avoid repeated values in a vector inside cell arrays and take next

This is the problem:
I have a cell array on the form indx{ii} where each ii is an array of size 1xNii (this means the arrays have different size).
And another cell array on the form indy{jj} where each jj is an array of the same size as ii.
The question is that I would like to create a function evaluates the values in the arrays of indx{:} and take the first one that is not repeated, and if is a repeated value then take the next.
I will try to explain with an example. Suppose we have indx and indy that are the cell arrays:
indx{1} = [1 3 2 7];
indx{2} = [3 8 5];
indx{3} = [3 6 2 9];
indx{4} = [1 3 4];
indx{5} = [3 1 4];
indy{1} = [0.12 0.21 0.31 0.44];
indy{2} = [0.22 0.34 0.54];
indy{3} = [0.13 0.23 0.36 0.41];
indy{4} = [0.12 0.16 0.22];
indy{5} = [0.14 0.19 0.26];
What I want the code to do is take the first value and is not repeated in indx and the equivalent in indy. So the answer for the example should be:
ans=
indx{1} = 1;
indx{2} = 3;
indx{3} = 6;
indx{4} = 4;
indx{5} = [];
indy{1} = 0.12;
indy{2} = 0.22;
indy{3} = 0.23;
indy{4} = 0.22;
indy{5} = [];
In ans, for indx{1} the code takes 1 because is the first and it's not repeated and takes the equivalent value in indy. Then for indx{2} it takes 3 because is the first value and is not repeated as first value in any array before. But for ind{3} it takes 6, because the first value that is 3 is repeated, and takes the equivalent value to 6 in indy which is 0.23. For ind{4} the first and second value they are already repeated as first values so the code takes 4 and its equivalent in indy. And last, for indx{5} since all values are already repeated the code should take no value.
indx{1} = [1 3 2 7];
indx{2} = [3 8 5];
indx{3} = [3 6 2 9];
indx{4} = [1 3 4];
indx{5} = [3 1 4];
indy{1} = [0.12 0.21 0.31 0.44];
indy{2} = [0.22 0.34 0.54];
indy{3} = [0.13 0.23 0.36 0.41];
indy{4} = [0.12 0.16 0.22];
indy{5} = [0.14 0.19 0.26];
indx2 = NaN(numel(indx),1);
indx2(1) = indx{1}(1);
indy2 = NaN(numel(indy),1);
indy2(1) = indy{1}(1);
for ii = 2:numel(indx)
tmp1 = indx{ii}'; % get the original as array
tmp2 = indy{ii}';
if numel(tmp1)>numel(indx2)
tmp3 = [indx2;NaN(numel(tmp1)-numel(indx2),1)];
tmp4 = [indx2;NaN(numel(tmp1)-numel(indx2),1)];
else
tmp1 = [tmp1;NaN(numel(indx2)-numel(tmp1),1)];
tmp2 = [tmp2;NaN(numel(indx2)-numel(tmp2),1)];
tmp3 = indx2;
tmp4 = indy2;
end
tmp5 = ~ismember(tmp1,tmp3); % find first non equal one
tmp6 = find(tmp5,1,'first');
indx2(ii) = tmp1(tmp6); % save values
indy2(ii) = tmp2(tmp6);
end
N = numel(indx2);
indx2 = mat2cell(indx2, repmat(1,N,1));
N = numel(indy2);
indy2 = mat2cell(indy2, repmat(1,N,1));
indx2 =
[ 1]
[ 3]
[ 6]
[ 4]
[NaN]
What I have done here is to first initialise your output cells to have the same number of cells as your original data. Then I assign value 1, since that one will always be unique, it is the first entry. After that I use a for loop to first convert all four cell arrays (2 input, two output) to regular arrays for processing with ismember, where I check for the all non-equal number between the next input cell and the existing numbers in your output. Then find is employed to get the first non-matching number. Lastly, the numbers are assigned to the arrays if present.
As a comment on the usage of booleans with NaN, try NaN ~=NaN and NaN ==NaN. The first will give you 1, whilst the second will give you zero. This quality makes NaNs the ideal choice of filler here, because 0 == 0 will result in 1:
A = [1,2,5,4,NaN];
B = [1,3,7,NaN,NaN];
ismember(A,B)
=
1 0 0 0 0
Thus the NaNs do not equal one another and will therefore not pollute your solution.

Create all possible Mx1 vectors from an Nx1 vector in MATLAB

I am trying to create all possible 1xM vectors (word) from a 1xN vector (alphabet) in MATLAB. N is > M. For example, I want to create all possible 2x1 "words" from a 4x1 "alphabet" alphabet = [1 2 3 4];
I expect a result like:
[1 1]
[1 2]
[1 3]
[1 4]
[2 1]
[2 2]
...
I want to make M an input to my routine and I do not know it beforehand. Otherwise, I could easily do this using nested for-loops. Anyway to do this?
Try
[d1 d2] = ndgrid(alphabet);
[d2(:) d1(:)]
To parameterize on M:
d = cell(M, 1);
[d{:}] = ndgrid(alphabet);
for i = 1:M
d{i} = d{i}(:);
end
[d{end:-1:1}]
In general, and in languages that don't have ndgrid in their library, the way to parameterize for-loop nesting is using recursion.
[result] = function cartesian(alphabet, M)
if M <= 1
result = alphabet;
else
recursed = cartesian(alphabet, M-1)
N = size(recursed,1);
result = zeros(M, N * numel(alphabet));
for i=1:numel(alphabet)
result(1,1+(i-1)*N:i*N) = alphabet(i);
result(2:M,1+(i-1)*N:i*N) = recursed; % in MATLAB, this line can be vectorized with repmat... but in MATLAB you'd use ndgrid anyway
end
end
end
To get all k-letter combinations from an arbitrary alphabet, use
n = length(alphabet);
aux = dec2base(0:n^k-1,n)
aux2 = aux-'A';
ind = aux2<0;
aux2(ind) = aux(ind)-'0'
aux2(~ind) = aux2(~ind)+10;
words = alphabet(aux2+1)
The alphabet may consist of up to 36 elements (as per dec2base). Those elements may be numbers or characters.
How this works:
The numbers 0, 1, ... , n^k-1 when expressed in base n give all groups of k numbers taken from 0,...,n-1. dec2base does the conversion to base n, but gives the result in form of strings, so need to convert to the corresponding number (that's part with aux and aux2). We then add 1 to make the numbers 1,..., n. Finally, we index alphabet with that to use the real letters of numbers of the alphabet.
Example with letters:
>> alphabet = 'abc';
>> k = 2;
>> words
words =
aa
ab
ac
ba
bb
bc
ca
cb
cc
Example with numbers:
>> alphabet = [1 3 5 7];
>> k = 2;
>> words
words =
1 1
1 3
1 5
1 7
3 1
3 3
3 5
3 7
5 1
5 3
5 5
5 7
7 1
7 3
7 5
7 7
use ndgrid function in Matlab
[a,b] = ndgrid(alphabet)

Resources