Split vector in MATLAB - arrays

I'm trying to elegantly split a vector. For example,
vec = [1 2 3 4 5 6 7 8 9 10]
According to another vector of 0's and 1's of the same length where the 1's indicate where the vector should be split - or rather cut:
cut = [0 0 0 1 0 0 0 0 1 0]
Giving us a cell output similar to the following:
[1 2 3] [5 6 7 8] [10]

Solution code
You can use cumsum & accumarray for an efficient solution -
%// Create ID/labels for use with accumarray later on
id = cumsum(cut)+1
%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0
%// Finally get the output with accumarray using masked IDs and vec values
out = accumarray(id(mask).',vec(mask).',[],#(x) {x})
Benchmarking
Here are some performance numbers when using a large input on the three most popular approaches listed to solve this problem -
N = 100000; %// Input Datasize
vec = randi(100,1,N); %// Random inputs
cut = randi(2,1,N)-1;
disp('-------------------- With CUMSUM + ACCUMARRAY')
tic
id = cumsum(cut)+1;
mask = cut==0;
out = accumarray(id(mask).',vec(mask).',[],#(x) {x});
toc
disp('-------------------- With FIND + ARRAYFUN')
tic
N = numel(vec);
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(#(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
toc
disp('-------------------- With CUMSUM + ARRAYFUN')
tic
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN; %Don't include the cut indices themselves
sumvals = unique(cutsum); % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = []; %Remove NaN values from sumvals
output = arrayfun(#(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
toc
Runtimes
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.068102 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.117953 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 12.560973 seconds.
Special case scenario: In cases where you might have runs of 1's, you need to modify few things as listed next -
%// Mask to get valid values from cut and vec corresponding to ones in cut
mask = cut==0
%// Setup IDs differently this time. The idea is to have successive IDs.
id = cumsum(cut)+1
[~,~,id] = unique(id(mask))
%// Finally get the output with accumarray using masked IDs and vec values
out = accumarray(id(:),vec(mask).',[],#(x) {x})
Sample run with such a case -
>> vec
vec =
1 2 3 4 5 6 7 8 9 10
>> cut
cut =
1 0 0 1 1 0 0 0 1 0
>> celldisp(out)
out{1} =
2
3
out{2} =
6
7
8
out{3} =
10

For this problem, a handy function is cumsum, which can create a cumulative sum of the cut array. The code that produces an output cell array is as follows:
vec = [1 2 3 4 5 6 7 8 9 10];
cut = [0 0 0 1 0 0 0 0 1 0];
cutsum = cumsum(cut);
cutsum(cut == 1) = NaN; %Don't include the cut indices themselves
sumvals = unique(cutsum); % Find the values to use in indexing vec for the output
sumvals(isnan(sumvals)) = []; %Remove NaN values from sumvals
output = {};
for i=1:numel(sumvals)
output{i} = vec(cutsum == sumvals(i)); %#ok<SAGROW>
end
As another answer shows, you can use arrayfun to create a cell array with the results. To apply that here, you'd replace the for loop (and the initialization of output) with the following line:
output = arrayfun(#(val) vec(cutsum == val), sumvals, 'UniformOutput', 0);
That's nice because it doesn't end up growing the output cell array.
The key feature of this routine is the variable cutsum, which ends up looking like this:
cutsum =
0 0 0 NaN 1 1 1 1 NaN 2
Then all we need to do is use it to create indices to pull the data out of the original vec array. We loop from zero to max and pull matching values. Notice that this routine handles some situations that may arise. For instance, it handles 1 values at the very beginning and very end of the cut array, and it gracefully handles repeated ones in the cut array without creating empty arrays in the output. This is because of the use of unique to create the set of values to search for in cutsum, and the fact that we throw out the NaN values in the sumvals array.
You could use -1 instead of NaN as the signal flag for the cut locations to not use, but I like NaN for readability. The -1 value would probably be more efficient, as all you'd have to do is truncate the first element from the sumvals array. It's just my preference to use NaN as a signal flag.
The output of this is a cell array with the results:
output{1} =
1 2 3
output{2} =
5 6 7 8
output{3} =
10
There are some odd conditions we need to handle. Consider the situation:
vec = [1 2 3 4 5 6 7 8 9 10 11 12 13 14];
cut = [1 0 0 1 1 0 0 0 0 1 0 0 0 1];
There are repeated 1's in there, as well as a 1 at the beginning and end. This routine properly handles all this without any empty sets:
output{1} =
2 3
output{2} =
6 7 8 9
output{3} =
11 12 13

You can do this with a combination of find and arrayfun:
vec = [1 2 3 4 5 6 7 8 9 10];
N = numel(vec);
cut = [0 0 0 1 0 0 0 0 1 0];
ind = find(cut);
ind_before = [ind-1 N]; ind_before(ind_before < 1) = 1;
ind_after = [1 ind+1]; ind_after(ind_after > N) = N;
out = arrayfun(#(x,y) vec(x:y), ind_after, ind_before, 'uni', 0);
We thus get:
>> celldisp(out)
out{1} =
1 2 3
out{2} =
5 6 7 8
out{3} =
10
So how does this work? Well, the first line defines your input vector, the second line finds how many elements are in this vector and the third line denotes your cut vector which defines where we need to cut in our vector. Next, we use find to determine the locations that are non-zero in cut which correspond to the split points in the vector. If you notice, the split points determine where we need to stop collecting elements and begin collecting elements.
However, we need to account for the beginning of the vector as well as the end. ind_after tells us the locations of where we need to start collecting values and ind_before tells us the locations of where we need to stop collecting values. To calculate these starting and ending positions, you simply take the result of find and add and subtract 1 respectively.
Each corresponding position in ind_after and ind_before tell us where we need to start and stop collecting values together. In order to accommodate for the beginning of the vector, ind_after needs to have the index of 1 inserted at the beginning because index 1 is where we should start collecting values at the beginning. Similarly, N needs to be inserted at the end of ind_before because this is where we need to stop collecting values at the end of the array.
Now for ind_after and ind_before, there is a degenerate case where the cut point may be at the end or beginning of the vector. If this is the case, then subtracting or adding by 1 will generate a start and stopping position that's out of bounds. We check for this in the 4th and 5th line of code and simply set these to 1 or N depending on whether we're at the beginning or end of the array.
The last line of code uses arrayfun and iterates through each pair of ind_after and ind_before to slice into our vector. Each result is placed into a cell array, and our output follows.
We can check for the degenerate case by placing a 1 at the beginning and end of cut and some values in between:
vec = [1 2 3 4 5 6 7 8 9 10];
cut = [1 0 0 1 0 0 0 1 0 1];
Using this example and the above code, we get:
>> celldisp(out)
out{1} =
1
out{2} =
2 3
out{3} =
5 6 7
out{4} =
9
out{5} =
10

Yet another way, but this time without any loops or accumulating at all...
lengths = diff(find([1 cut 1])) - 1; % assuming a row vector
lengths = lengths(lengths > 0);
data = vec(~cut);
result = mat2cell(data, 1, lengths); % also assuming a row vector
The diff(find(...)) construct gives us the distance from each marker to the next - we append boundary markers with [1 cut 1] to catch any runs of zeros which touch the ends. Each length is inclusive of its marker, though, so we subtract 1 to account for that, and remove any which just cover consecutive markers, so that we won't get any undesired empty cells in the output.
For the data, we mask out any elements corresponding to markers, so we just have the valid parts we want to partition up. Finally, with the data ready to split and the lengths into which to split it, that's precisely what mat2cell is for.
Also, using #Divakar's benchmark code;
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.272810 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.436276 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 17.112259 seconds.
-------------------- With mat2cell
Elapsed time is 0.084207 seconds.
...just sayin' ;)

Here's what you need:
function spl = Splitting(vec,cut)
n=1;
j=1;
for i=1:1:length(b)
if cut(i)==0
spl{n}(j)=vec(i);
j=j+1;
else
n=n+1;
j=1;
end
end
end
Despite how simple my method is, it's in 2nd place for performance:
-------------------- With CUMSUM + ACCUMARRAY
Elapsed time is 0.264428 seconds.
-------------------- With FIND + ARRAYFUN
Elapsed time is 0.407963 seconds.
-------------------- With CUMSUM + ARRAYFUN
Elapsed time is 18.337940 seconds.
-------------------- SIMPLE
Elapsed time is 0.271942 seconds.

Unfortunately there is no 'inverse concatenate' in MATLAB. If you wish to solve a question like this you can try the below code. It will give you what you looking for in the case where you have two split point to produce three vectors at the end. If you want more splits you will need to modify the code after the loop.
The results are in n vector form. To make them into cells, use num2cell on the results.
pos_of_one = 0;
% The loop finds the split points and puts their positions into a vector.
for kk = 1 : length(cut)
if cut(1,kk) == 1
pos_of_one = pos_of_one + 1;
A(1,one_pos) = kk;
end
end
F = vec(1 : A(1,1) - 1);
G = vec(A(1,1) + 1 : A(1,2) - 1);
H = vec(A(1,2) + 1 : end);

Related

Shuffle array while spacing repeating elements

I'm trying to write a function that shuffles an array, which contains repeating elements, but ensures that repeating elements are not too close to one another.
This code works but seems inefficient to me:
function shuffledArr = distShuffle(myArr, myDist)
% this function takes an array myArr and shuffles it, while ensuring that repeating
% elements are at least myDist elements away from on another
% flag to indicate whether there are repetitions within myDist
reps = 1;
while reps
% set to 0 to break while-loop, will be set to 1 if it doesn't meet condition
reps = 0;
% randomly shuffle array
shuffledArr = Shuffle(myArr);
% loop through each unique value, find its position, and calculate the distance to the next occurence
for x = 1:length(unique(myArr))
% check if there are any repetitions that are separated by myDist or less
if any(diff(find(shuffledArr == x)) <= myDist)
reps = 1;
break;
end
end
end
This seems suboptimal to me for three reasons:
1) It may not be necessary to repeatedly shuffle until a solution has been found.
2) This while loop will go on forever if there is no possible solution (i.e. setting myDist to be too high to find a configuration that fits). Any ideas on how to catch this in advance?
3) There must be an easier way to determine the distance between repeating elements in an array than what I did by looping through each unique value.
I would be grateful for answers to points 2 and 3, even if point 1 is correct and it is possible to do this in a single shuffle.
I think it is sufficient to check the following condition to prevent infinite loops:
[~,num, C] = mode(myArr);
N = numel(C);
assert( (myDist<=N) || (myDist-N+1) * (num-1) +N*num <= numel(myArr),...
'Shuffling impossible!');
Assume that myDist is 2 and we have the following data:
[4 6 5 1 6 7 4 6]
We can find the the mode , 6, with its occurence, 3. We arrange 6s separating them by 2 = myDist blanks:
6 _ _ 6 _ _6
There must be (3-1) * myDist = 4 numbers to fill the blanks. Now we have five more numbers so the array can be shuffled.
The problem becomes more complicated if we have multiple modes. For example for this array [4 6 5 1 6 7 4 6 4] we have N=2 modes: 6 and 4. They can be arranged as:
6 4 _ 6 4 _ 6 4
We have 2 blanks and three more numbers [ 5 1 7] that can be used to fill the blanks. If for example we had only one number [ 5] it was impossible to fill the blanks and we couldn't shuffle the array.
For the third point you can use sparse matrix to accelerate the computation (My initial testing in Octave shows that it is more efficient):
function shuffledArr = distShuffleSparse(myArr, myDist)
[U,~,idx] = unique(myArr);
reps = true;
while reps
S = Shuffle(idx);
shuffledBin = sparse ( 1:numel(idx), S, true, numel(idx) + myDist, numel(U) );
reps = any (diff(find(shuffledBin)) <= myDist);
end
shuffledArr = U(S);
end
Alternatively you can use sub2ind and sort instead of sparse matrix:
function shuffledArr = distShuffleSparse(myArr, myDist)
[U,~,idx] = unique(myArr);
reps = true;
while reps
S = Shuffle(idx);
f = sub2ind ( [numel(idx) + myDist, numel(U)] , 1:numel(idx), S );
reps = any (diff(sort(f)) <= myDist);
end
shuffledArr = U(S);
end
If you just want to find one possible solution you could use something like that:
x = [1 1 1 2 2 2 3 3 3 3 3 4 5 5 6 7 8 9];
n = numel(x);
dist = 3; %minimal distance
uni = unique(x); %get the unique value
his = histc(x,uni); %count the occurence of each element
s = [sortrows([uni;his].',2,'descend'), zeros(length(uni),1)];
xr = []; %the vector that will contains the solution
%the for loop that will maximize the distance of each element
for ii = 1:n
s(s(:,3)<0,3) = s(s(:,3)<0,3)+1;
s(1,3) = s(1,3)-dist;
s(1,2) = s(1,2)-1;
xr = [xr s(1,1)];
s = sortrows(s,[3,2],{'descend','descend'})
end
if any(s(:,2)~=0)
fprintf('failed, dist is too big')
end
Result:
xr = [3 1 2 5 3 1 2 4 3 6 7 8 3 9 5 1 2 3]
Explaination:
I create a vector s and at the beggining s is equal to:
s =
3 5 0
1 3 0
2 3 0
5 2 0
4 1 0
6 1 0
7 1 0
8 1 0
9 1 0
%col1 = unique element; col2 = occurence of each element, col3 = penalities
At each iteration of our for-loop we choose the element with the maximum occurence since this element will be harder to place in our array.
Then after the first iteration s is equal to:
s =
1 3 0 %1 is the next element that will be placed in our array.
2 3 0
5 2 0
4 1 0
6 1 0
7 1 0
8 1 0
9 1 0
3 4 -3 %3 has now 5-1 = 4 occurence and a penalities of -3 so it won't show up the next 3 iterations.
at the end every number of the second column should be equal to 0, if it's not the minimal distance was too big.

Replace +/- values around index - MATLAB

Following this question and the precious help I got from it, I've reached to the following issue:
Using indices of detected peaks and having computed the median of my signal +/-3 datapoints around these peaks, I need to replace my signal in a +/-5 window around the peak with the previously computed median.
I'm only able replace the datapoint at the peak with the median, but not the surrounding +/-5 data points...see figure. Black = original peak; Yellow = data point at peak changed to the median of +/-3 datapoints around it.
Original peak and changed peak
Unfortunately I have not been able to make it work by following suggestions on the previous question.
Any help will be very much appreciated!
Cheers,
M
Assuming you mean the following. Given the array
x = [0 1 2 3 4 5 35 5 4 3 2 1 0]
you want to replace 35 and surrounding +/- 5 entries with the median of 3,4,5,35,5,4,3, which is 4, so the resulting array should be
x = [0 4 4 4 4 4 4 4 4 4 4 4 0]
Following my answer in this question an intuitive approach is to simply replace the neighbors with the median value by offsetting the indicies. This can be accomplished as follows
[~,idx]=findpeaks(x);
med_sz = 3; % Take the median with respect to +/- this many neighbors
repl_sz = 5; % Replace neighbors +/- this distance from peak
if ~isempty(idx)
m = medfilt1(x,med_sz*2+1);
N = numel(x);
for offset = -repl_sz:repl_sz
idx_offset = idx + offset;
idx_valid = idx_offset >= 1 & idx_offset <= N;
x(idx_offset(idx_valid)) = m(idx(idx_valid));
end
end
Alternatively, if you want to avoid loops, an equivalent loopless implementation is
[~,idx]=findpeaks(x);
med_sz = 3;
repl_sz = 5;
if ~isempty(idx)
m = medfilt1(x,med_sz*2+1);
idx_repeat = repmat(idx,repl_sz*2+1,1);
idx_offset = idx_repeat + repmat((-repl_sz:repl_sz)',1,numel(idx));
idx_valid = idx_repeat >= 1 & idx_repeat <= numel(x);
idx_repeat = idx_repeat(idx_valid);
idx_offset = idx_offset(idx_valid);
x(idx_offset) = m(idx_repeat);
end

Sum up vector values till threshold, then start again

I have a vector a = [1 3 4 2 1 5 6 3 2]. Now I want to create a new vector 'b' with the cumsum of a, but after reaching a threshold, let's say 5, cumsum should reset and start again till it reaches the threshold again, so the new vector should look like this:
b = [1 4 4 2 3 5 6 3 5]
Any ideas?
You could build a sparse matrix that, when multiplied by the original vector, returns the cumulative sums. I haven't timed this solution versus others, but I strongly suspect this will be the fastest for large arrays of a.
% Original data
a = [1 3 4 2 1 5 6 3 2];
% Threshold
th = 5;
% Cumulative sum corrected by threshold
b = cumsum(a)/th;
% Group indices to be summed by checking for equality,
% rounded down, between each cumsum value and its next value. We add one to
% prevent NaNs from occuring in the next step.
c = cumsum(floor(b) ~= floor([0,b(1:end-1)]))+1;
% Build the sparse matrix, remove all values that are in the upper
% triangle.
S = tril(sparse(c.'./c == 1));
% In case you use matlab 2016a or older:
% S = tril(sparse(bsxfun(#rdivide,c.',c) == 1));
% Matrix multiplication to create o.
o = S*a.';
By normalizing the arguments of cumsum with the threshold and flooring you can get grouping indizes for accumarray, which then can do the cumsumming groupwise:
t = 5;
a = [1 3 4 2 1 5 6 3 2];
%// cumulative sum of normalized vector a
n = cumsum(a/t);
%// subs for accumarray
subs = floor( n ) + 1;
%// cumsum of every group
aout = accumarray( subs(:), (1:numel(subs)).', [], #(x) {cumsum(a(x))});
%// gather results;
b = [aout{:}]
One way is to use a loop. You create the first cumulative sum cs, and then as long as elements in cs are larger than your threshold th, you replace them with elements from the cumulative sum on the rest of the elements in a.
Because some elements in a might be larger than th, this loop will be infinite unless we also eliminate these elements too.
Here is a simple solution with a while loop:
a = [1 3 4 2 1 5 6 3 2];
th = 5;
cs = cumsum(a);
while any(cs>th & cs~=a) % if 'cs' has values larger that 'th',
% and there are any values smaller than th left in 'a'
% sum all the values in 'a' that are after 'cs' reached 'th',
% excluding values that are larger then 'th'
cs(cs>th & cs~=a) = cumsum(a(cs>th & cs~=a));
end
Calculate the cumulative sum and replace the indices value obeying your condition.
a = [1 3 4 2 1 5 6 3 2] ;
b = [1 4 4 2 3 5 6 3 5] ;
iwant = a ;
a_sum = cumsum(a) ;
iwant(a_sum<5) = a_sum(a_sum<5) ;

Vectorization- Matlab

Given a vector
X = [1 1 1 1 1 2 2 2 2 2 2 3 3 3 3 3]
I would like to generate a vector such
Y = [1 2 3 4 5 1 2 3 4 5 6 1 2 3 4 5]
So far what I have got is
idx = find(diff(X))
Y = [1:idx(1) 1:idx(2)-idx(1) 1:length(X)-idx(2)]
But I was wondering if there is a more elegant(robust) solution?
One approach with diff, find & cumsum for a generic case -
%// Initialize array of 1s with the same size as input array and an
%// intention of using cumsum on it after placing "appropriate" values
%// at "strategic" places for getting the final output.
out = ones(size(X))
%// Find starting indices of each "group", except the first group, and
%// by group here we mean run of identical numbers.
idx = find(diff(X))+1
%// Place differentiated and subtracted values of indices at starting locations
out(idx) = 1-diff([1 idx])
%// Perform cumulative summation for the final output
Y = cumsum(out)
Sample run -
X =
1 1 1 1 2 2 3 3 3 3 3 4 4 5
Y =
1 2 3 4 1 2 1 2 3 4 5 1 2 1
Just for fun, but customary bsxfun based alternative solution -
%// Logical mask with each column of ones for presence of each group elements
mask = bsxfun(#eq,X(:),unique(X(:).')) %//'
%// Cumulative summation along columns and use masked values for final output
vals = cumsum(mask,1)
Y = vals(mask)
Here's another approach:
Y = sum(triu(bsxfun(#eq, X, X.')), 1);
This works as follows:
Compare each element with all others (bsxfun(...)).
Keep only comparisons with current or previous elements (triu(...)).
Count, for each element, how many comparisons are true (sum(..., 1)); that is, how many elements, up to and including the current one, are equal to the current one.
Another method is using the function unique
like this:
[unqX ind Xout] = unique(X)
Y = [ind(1):ind(2) 1:ind(3)-ind(2) 1:length(X)-ind(3)]
Whether this is more elegant is up to you.
A more robust method will be:
[unqX ind Xout] = unique(X)
for ii = 1:length(unqX)-1
Y(ind(ii):ind(ii+1)-1) = 1:(ind(ii+1)-ind(ii));
end

matlab: eliminate elements from array

I have quite big array. To make things simple lets simplify it to:
A = [1 1 1 1 2 2 3 3 3 3 4 4 5 5 5 5 5 5 5 5];
So, there is a group of 1's (4 elements), 2's (2 elements), 3's (4 elements), 4's (2 elements) and 5's (8 elements). Now, I want to keep only columns, which belong to group of 3 or more elements. So it will be like:
B = [1 1 1 1 3 3 3 3 5 5 5 5 5 5 5 5];
I was doing it using for loop, scanning separately 1's, 2's, 3's and so on, but its extremely slow with big arrays...
Thanks for any suggestions how to do it in more efficient way :)
Art.
A general approach
If your vector is not necessarily sorted, then you need to run to count the number of occurrences of each element in the vector. You have histc just for that:
elem = unique(A);
counts = histc(A, elem);
B = A;
B(ismember(A, elem(counts < 3))) = []
The last line picks the elements that have less than 3 occurrences and deletes them.
An approach for a grouped vector
If your vector is "semi-sorted", that is if similar elements in the vector are grouped together (as in your example), you can speed things up a little by doing the following:
start_idx = find(diff([0, A]))
counts = diff([start_idx, numel(A) + 1]);
B = A;
B(ismember(A, A(start_idx(counts < 3)))) = []
Again, note that the vector need not to be entirely sorted, just that similar elements are adjacent to each other.
Here is my two-liner
counts = accumarray(A', 1);
B = A(ismember(A, find(counts>=3)));
accumarray is used to count the individual members of A. find extracts the ones that meet your '3 or more elements' criterion. Finally, ismember tells you where they are in A. Note that A needs not be sorted. Of course, accumarray only works for integer values in A.
What you are describing is called run-length encoding.
There is software for this in Matlab on the FileExchange. Or you can do it directly as follows:
len = diff([ 0 find(A(1:end-1) ~= A(2:end)) length(A) ]);
val = A(logical([ A(1:end-1) ~= A(2:end) 1 ]));
Once you have your run-length encoding you can remove elements based on the length. i.e.
idx = (len>=3)
len = len(idx);
val = val(idx);
And then decode to get the array you want:
i = cumsum(len);
j = zeros(1, i(end));
j(i(1:end-1)+1) = 1;
j(1) = 1;
B = val(cumsum(j));
Here's another way to do it using matlab built-ins.
% Set up
A=[1 1 1 1 2 2 3 3 3 3 4 4 5 5 5 5 5];
threshold=2;
% Get the unique elements of the array
uniqueElements=unique(A);
% Count haw many times each unique element occurs
counts=histc(A,uniqueElements);
% Write which elements should be kept
toKeep=uniqueElements(counts>threshold);
% Make a logical index
indexer=false(size(A));
for i=1:length(toKeep)
% For every unique element we want to keep select the indices in A that
% are equal
indexer=indexer|(toKeep(i)==A);
end
% Apply index
B=A(indexer);

Resources