How to deal with multiple minimum values when using "min" function - arrays

I am using matlab's "min" function to determine the index corresponding to the minimum value within a array (just a vector, actually)... All's well and good, except that I've found that when there are multiple values in the array that share the minimum value, the function [C, I] = min(A) returns only one of the indices. This actually would not be an issue, except that the index it returns is not always the first (i.e., smallest) index that has the minimum value. The documentation says that this should be the case (so, if entry #4 and entry #13 in an array have the same (minimum) value, it should return I = 4), but that's not what's happening.
Does anyone know how to have the min function return the smallest/lowest index for a shared minimum value within an array/vector? Relatedly, can anyone explain why the function is not behaving as it seemingly should?
Thanks,
Ben Mooneyham

As stated above, the values are then likely not the same. Consider
a = [1 2 3 4 2 4 3 1];
b = a;
b(1) = 1+eps; b(end) = 1-eps; % added a small error to the 1st and 8th element
[~,Ia] = min(a);
[~,Ib] = min(b);
where Ia is 1 and Ib would be 8.
A solution is to round off your inputs:
f = 0.1;% rounding off to 1 decimal place
c = round(b/f)*f;
[~,Ic] = min(c);
where Ic will be 1, as expected.

Related

What does "uniquetol" do, exactly?

The uniquetol function, introduced in R2015a, computes "unique elements within tolerance". Specifically,
C = uniquetol(A,tol) returns the unique elements in A using tolerance tol.
But the problem of finding unique elements with a given tolerance has several solutions. Which one is actually produced?
Let's see two examples:
Let A = [3 5 7 9] with absolute tolerance 2.5. The output can be [3 7], or it can be [5 9]. Both solutions satisfy the requirement.
For A = [1 3 5 7 9] with absolute tolerance 2.5, the output can be [1 5 9] or [3 7]. So even the number of elements in the output can vary.
See this nice discussion about the transitivity issue that lies at the heart of the problem.
So, how does uniquetol work? What output does it produce among the several existing solutions?
To simplify, I consider the one-output, two-input version of uniquetol,
C = uniquetol(A, tol);
where the first input is a double vector A. In particular, this implies that:
The 'ByRows' option of uniquetol is not used.
The first input is a vector. If it were not, uniquetol would implicitly linearize to a column, as usual.
The second input, which defines the tolerance, is interpreted as follows:
Two values, u and v, are within tolerance if abs(u-v) <= tol*max(abs(A(:)))
That is, the specified tolerance is relative by default. The actual tolerance used in the comparisons is obtained by scaling by the maximum absolute value in A.
With these considerations, it seems that the approach that uniquetol uses is:
Sort A.
Pick the first entry of sorted A, and set this as reference value (this value will have to be updated later).
Write the reference value into the output C.
Skip subsequent entries of sorted A until one is found that is not within tolerance of the reference value. When that entry is found, take it as the new reference value and go back to step 3.
Of course, I'm not saying that this is what uniquetol internally does. But the output seems to be the same. So this is functionally equivalent to what uniquetol does.
The following code implements the approach described above (inefficient code, just to illustrate the point).
% Inputs A, tol
% Output C
tol_scaled = tol*max(abs(A(:))); % scale tolerance
C = []; % initiallize output. Will be extended
ref = NaN; % initiallize reference value to NaN. This will immediately cause
% A(1) to become the new reference
for a = sort(A(:)).';
if ~(a-ref <= tol_scaled)
ref = a;
C(end+1) = ref;
end
end
To verify this, let's generate some random data and compare the output of uniquetol and of the above code:
clear
N = 1e3; % number of realizations
S = 1e5; % maximum input size
for n = 1:N;
% Generate inputs:
s = randi(S); % input size
A = (2*rand(1,S)-1) / rand; % random input of length S; positive and
% negative values; random scaling
tol = .1*rand; % random tolerance (relative). Change value .1 as desired
% Compute output:
tol_scaled = tol*max(abs(A(:))); % scale tolerance
C = []; % initiallize output. Will be extended
ref = NaN; % initiallize reference value to NaN. This will immediately cause
% A(1) to become the new reference
for a = sort(A(:)).';
if ~(a-ref <= tol_scaled)
ref = a;
C(end+1) = ref;
end
end
% Check if output is equal to that of uniquetol:
assert(isequal(C, uniquetol(A, tol)))
end
In all my tests this has run without the assertion failing.
So, in summary, uniquetol seems to sort the input, pick its first entry, and keep skipping entries for as long as it can.
For the two examples in the question, the outputs are as follows. Note that the second input is specified as 2.5/9, where 9 is the maximum of the first input, to achieve an absolute tolerance of 2.5:
>> uniquetol([1 3 5 7 9], 2.5/9)
ans =
1 5 9
>> uniquetol([3 5 7 9], 2.5/9)
ans =
3 7

Compute the product of the next n elements in array

I would like to compute the product of the next n adjacent elements of a matrix. The number n of elements to be multiplied should be given in function's input.
For example for this input I should compute the product of every 3 consecutive elements, starting from the first.
[p, ind] = max_product([1 2 2 1 3 1],3);
This gives [1*2*2, 2*2*1, 2*1*3, 1*3*1] = [4,4,6,3].
Is there any practical way to do it? Now I do this using:
for ii = 1:(length(v)-2)
p = prod(v(ii:ii+n-1));
end
where v is the input vector and n is the number of elements to be multiplied.
in this example n=3 but can take any positive integer value.
Depending whether n is odd or even or length(v) is odd or even, I get sometimes right answers but sometimes an error.
For example for arguments:
v = [1.35912281237829 -0.958120385352704 -0.553335935098461 1.44601450110386 1.43760259196739 0.0266423803393867 0.417039432979809 1.14033971399183 -0.418125096873537 -1.99362640306847 -0.589833539347417 -0.218969651537063 1.49863539349242 0.338844452879616 1.34169199365703 0.181185490389383 0.102817336496793 0.104835620599133 -2.70026800170358 1.46129128974515 0.64413523430416 0.921962619821458 0.568712984110933]
n = 7
I get the error:
Index exceeds matrix dimensions.
Error in max_product (line 6)
p = prod(v(ii:ii+n-1));
Is there any correct general way to do it?
Based on the solution in Fast numpy rolling_product, I'd like to suggest a MATLAB version of it, which leverages the movsum function introduced in R2016a.
The mathematical reasoning is that a product of numbers is equal to the exponent of the sum of their logarithms:
A possible MATLAB implementation of the above may look like this:
function P = movprod(vec,window_sz)
P = exp(movsum(log(vec),[0 window_sz-1],'Endpoints','discard'));
if isreal(vec) % Ensures correct outputs when the input contains negative and/or
P = real(P); % complex entries.
end
end
Several notes:
I haven't benchmarked this solution, and do not know how it compares in terms of performance to the other suggestions.
It should work correctly with vectors containing zero and/or negative and/or complex elements.
It can be easily expanded to accept a dimension to operate along (for array inputs), and any other customization afforded by movsum.
The 1st input is assumed to be either a double or a complex double row vector.
Outputs may require rounding.
Update
Inspired by the nicely thought answer of Dev-iL comes this handy solution, which does not require Matlab R2016a or above:
out = real( exp(conv(log(a),ones(1,n),'valid')) )
The basic idea is to transform the multiplication to a sum and a moving average can be used, which in turn can be realised by convolution.
Old answers
This is one way using gallery to get a circulant matrix and indexing the relevant part of the resulting matrix before multiplying the elements:
a = [1 2 2 1 3 1]
n = 3
%// circulant matrix
tmp = gallery('circul', a(:))
%// product of relevant parts of matrix
out = prod(tmp(end-n+1:-1:1, end-n+1:end), 2)
out =
4
4
6
3
More memory efficient alternative in case there are no zeros in the input:
a = [10 9 8 7 6 5 4 3 2 1]
n = 2
%// cumulative product
x = [1 cumprod(a)]
%// shifted by n and divided by itself
y = circshift( x,[0 -n] )./x
%// remove last elements
out = y(1:end-n)
out =
90 72 56 42 30 20 12 6 2
Your approach is correct. You should just change the for loop to for ii = 1:(length(v)-n+1) and then it will work fine.
If you are not going to deal with large inputs, another approach is using gallery as explained in #thewaywewalk's answer.
I think the problem may be based on your indexing. The line that states for ii = 1:(length(v)-2) does not provide the correct range of ii.
Try this:
function out = max_product(in,size)
size = size-1; % this is because we add size to i later
out = zeros(length(in),1) % assuming that this is a column vector
for i = 1:length(in)-size
out(i) = prod(in(i:i+size));
end
Your code works when restated like so:
for ii = 1:(length(v)-(n-1))
p = prod(v(ii:ii+(n-1)));
end
That should take care of the indexing problem.
using bsxfun you create a matrix each row of it contains consecutive 3 elements then take prod of 2nd dimension of the matrix. I think this is most efficient way:
max_product = #(v, n) prod(v(bsxfun(#plus, (1 : n), (0 : numel(v)-n)')), 2);
p = max_product([1 2 2 1 3 1],3)
Update:
some other solutions updated, and some such as #Dev-iL 's answer outperform others, I can suggest fftconv that in Octave outperforms conv
If you can upgrade to R2017a, you can use the new movprod function to compute a windowed product.

Comparing two arrays of pixel values, and store any matches

I want to compare the pixel values of two images, which I have stored in arrays.
Suppose the arrays are A and B. I want to compare the elements one by one, and if A[l] == B[k], then I want to store the match as a key value-pair in a third array, C, like so: C[l] = k.
Since the arrays are naturally quite large, the solution needs to finish within a reasonable amount of time (minutes) on a Core 2 Duo system.
This seems to work in under a second for 1024*720 matrices:
A = randi(255,737280,1);
B = randi(255,737280,1);
C = zeros(size(A));
[b_vals, b_inds] = unique(B,'first');
for l = 1:numel(b_vals)
C(A == b_vals(l)) = b_inds(l);
end
First we find the unique values of B and the indices of the first occurrences of these values.
[b_vals, b_inds] = unique(B,'first');
We know that there can be no more than 256 unique values in a uint8 array, so we've reduced our loop from 1024*720 iterations to just 256 iterations.
We also know that for each occurrence of a particular value, say 209, in A, those locations in C will all have the same value: the location of the first occurrence of 209 in B, so we can set all of them at once. First we get locations of all of the occurrences of b_vals(l) in A:
A == b_vals(l)
then use that mask as a logical index into C.
C(A == b_vals(l))
All of these values will be equal to the corresponding index in B:
C(A == b_vals(l)) = b_inds(l);
Here is the updated code to consider all of the indices of a value in B (or at least as many as are necessary). If there are more occurrences of a value in A than in B, the indices wrap.
A = randi(255,737280,1);
B = randi(255,737280,1);
C = zeros(size(A));
b_vals = unique(B);
for l = 1:numel(b_vals)
b_inds = find(B==b_vals(l)); %// find the indices of each unique value in B
a_inds = find(A==b_vals(l)); %// find the indices of each unique value in A
%// in case the length of a_inds is greater than the length of b_inds
%// duplicate b_inds until it is larger (or equal)
b_inds = repmat(b_inds,[ceil(numel(a_inds)/numel(b_inds)),1]);
%// truncate b_inds to be the same length as a_inds (if necessary) and
%// put b_inds into the proper places in C
C(a_inds) = b_inds(1:numel(a_inds));
end
I haven't fully tested this code, but from my small samples it seems to work properly and on the full-size case, it only takes about twice as long as the previous code, or less than 2 seconds on my machine.
So, if I understand your question correctly, you want for each value of l=1:length(A) the (first) index k into B so that A(l) == B(k). Then:
C = arrayfun(#(val) find(B==val, 1, 'first'), A)
could give you your solution, as long as you're sure that every element will have a match. The above solution would fail otherwise, complaning that the function returned a non-scalar (because find would return [] if no match is found). You have two options:
Using a cell array to store the result instead of a numeric array. You would need to call arrayfun with 'UniformOutput', false at the end. Then, the values of A without matches in B would be those for which isempty(C{i}) is true.
Providing a default value for an index into A with no matches in B (e.g. 0 or NaN). I'm not sure about this one, but I think that you would need to add 'ErrorHandler', #(~,~) NaN to the arrayfun call. The error handler is a function that gets called when the function passed to arrayfun fails, and may either rethrow the error or compute a substitute value. Thus the #(~,~) NaN. I am not sure that it would work, however, since in this case the error is in arrayfun and not in the passed function, but you can try it.
If you have the images in arrays A & B
idx = A == B;
C = zeros(size(A));
C(idx) = A(idx);

Substitute a vector value with two values in MATLAB

I have to create a function that takes as input a vector v and three scalars a, b and c. The function replaces every element of v that is equal to a with a two element array [b,c].
For example, given v = [1,2,3,4] and a = 2, b = 5, c = 5, the output would be:
out = [1,5,5,3,4]
My first attempt was to try this:
v = [1,2,3,4];
v(2) = [5,5];
However, I get an error, so I do not understand how to put two values in the place of one in a vector, i.e. shift all the following values one position to the right so that the new two values fit in the vector and, therefore, the size of the vector will increase in one. In addition, if there are several values of a that exist in v, I'm not sure how to replace them all at once.
How can I do this in MATLAB?
Here's a solution using cell arrays:
% remember the indices where a occurs
ind = (v == a);
% split array such that each element of a cell array contains one element
v = mat2cell(v, 1, ones(1, numel(v)));
% replace appropriate cells with two-element array
v(ind) = {[b c]};
% concatenate
v = cell2mat(v);
Like rayryeng's solution, it can replace multiple occurrences of a.
The problem mentioned by siliconwafer, that the array changes size, is here solved by intermediately keeping the partial arrays in cells of a cell array. Converting back to an array concenates these parts.
Something I would do is to first find the values of v that are equal to a which we will call ind. Then, create a new output vector that has the output size equal to numel(v) + numel(ind), as we are replacing each value of a that is in v with an additional value, then use indexing to place our new values in.
Assuming that you have created a row vector v, do the following:
%// Find all locations that are equal to a
ind = find(v == a);
%// Allocate output vector
out = zeros(1, numel(v) + numel(ind));
%// Determine locations in output vector that we need to
%// modify to place the value b in
indx = ind + (0:numel(ind)-1);
%// Determine locations in output vector that we need to
%// modify to place the value c in
indy = indx + 1;
%// Place values of b and c into the output
out(indx) = b;
out(indy) = c;
%// Get the rest of the values in v that are not equal to a
%// and place them in their corresponding spots.
rest = true(1,numel(out));
rest([indx,indy]) = false;
out(rest) = v(v ~= a);
The indx and indy statements are rather tricky, but certainly not hard to understand. For each index in v that is equal to a, what happens is that we need to shift the vector over by 1 for each index / location of v that is equal to a. The first value requires that we shift the vector over to the right by 1, then the next value requires that we shift to the right by 1 with respect to the previous shift, which means that we actually need to take the second index and shift by the right by 2 as this is with respect to the original index.
The next value requires that we shift to the right by 1 with respect to the second shift, or shifting to the right by 3 with respect to the original index and so on. These shifts define where we're going to place b. To place c, we simply take the indices generated for placing b and move them over to the right by 1.
What's left is to populate the output vector with those values that are not equal to a. We simply define a logical mask where the indices used to populate the output array have their locations set to false while the rest are set to true. We use this to index into the output and find those locations that are not equal to a to complete the assignment.
Example:
v = [1,2,3,4,5,4,4,5];
a = 4;
b = 10;
c = 11;
Using the above code, we get:
out =
1 2 3 10 11 5 10 11 10 11 5
This successfully replaces every value that is 4 in v with the tuple of [10,11].
I think that strrep deserves a mention here.
Although it's called string replacement and warns for non-char input, it still works perfectly fine for other numbers as well (including integers, doubles and even complex numbers).
v = [1,2,3,4]
a = 2, b = 5, c = 5
out = strrep(v, a, [b c])
Warning: Inputs must be character arrays or cell arrays of strings.
out =
1 5 5 3 4
You are not attempting to overwrite an existing value in the vector. You're attempting to change the size of the vector (meaning the number of rows or columns in the vector) because you're adding an element. This will always result in the vector being reallocated in memory.
Create a new vector, using the first and last half of v.
Let's say your index is stored in the variable index.
index = 2;
newValues = [5, 5];
x = [ v(1:index), newValues, v(index+1:end) ]
x =
1 2 5 5 3 4

Algorithm to split an array into P subarrays of balanced sum

I have an big array of length N, let's say something like:
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
I need to split this array into P subarrays (in this example, P=4 would be reasonable), such that the sum of the elements in each subarray is as close as possible to sigma, being:
sigma=(sum of all elements in original array)/P
In this example, sigma=15.
For the sake of clarity, one possible result would be:
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
(sums: 12,19,14,15)
I have written a very naive algorithm based in how I would do the divisions by hand, but I don't know how to impose the condition that a division whose sums are (14,14,14,14,19) is worse than one that is (15,14,16,14,16).
Thank you in advance.
First, let’s formalize your optimization problem by specifying the input, output, and the measure for each possible solution (I hope this is in your interest):
Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.
Input: Array A of positive integers; P is a positive integer.
Output: Array SA of P non-negative integers representing the length of each subarray of A where the sum of these subarray lengths is equal to the length of A.
Measure: abs(sum(sa)-sum(A)/P) is minimal for each sa ∈ {sa | sa = (Ai, …, Ai+‍SAj) for i = (Σ SAj), j from 0 to P-1}.
The input and output define the set of valid solutions. The measure defines a measure to compare multiple valid solutions. And since we’re looking for a solution with the least difference to the perfect solution (minimization problem), measure should also be minimal.
With this information, it is quite easy to implement the measure function (here in Python):
def measure(a, sa):
sigma = sum(a)/len(sa)
diff = 0
i = 0
for j in xrange(0, len(sa)):
diff += abs(sum(a[i:i+sa[j]])-sigma)
i += sa[j]
return diff
print measure([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], [3,4,4,5]) # prints 8
Now finding an optimal solution is a little harder.
We can use the Backtracking algorithm for finding valid solutions and use the measure function to rate them. We basically try all possible combinations of P non-negative integer numbers that sum up to length(A) to represent all possible valid solutions. Although this ensures not to miss a valid solution, it is basically a brute-force approach with the benefit that we can omit some branches that cannot be any better than our yet best solution. E.g. in the example above, we wouldn’t need to test solutions with [9,…] (measure > 38) if we already have a solution with measure ≤ 38.
Following the pseudocode pattern from Wikipedia, our bt function looks as follows:
def bt(c):
global P, optimum, optimum_diff
if reject(P,c):
return
if accept(P,c):
print "%r with %d" % (c, measure(P,c))
if measure(P,c) < optimum_diff:
optimum = c
optimum_diff = measure(P,c)
return
s = first(P,c)
while s is not None:
bt(list(s))
s = next(P,s)
The global variables P, optimum, and optimum_diff represent the problem instance holding the values for A, P, and sigma, as well as the optimal solution and its measure:
class MinimalSumOfSubArraySumsProblem:
def __init__(self, a, p):
self.a = a
self.p = p
self.sigma = sum(a)/p
Next we specify the reject and accept functions that are quite straight forward:
def reject(P,c):
return optimum_diff < measure(P,c)
def accept(P,c):
return None not in c
This simply rejects any candidate whose measure is already more than our yet optimal solution. And we’re accepting any valid solution.
The measure function is also slightly changed due to the fact that c can now contain None values:
def measure(P, c):
diff = 0
i = 0
for j in xrange(0, P.p):
if c[j] is None:
break;
diff += abs(sum(P.a[i:i+c[j]])-P.sigma)
i += c[j]
return diff
The remaining two function first and next are a little more complicated:
def first(P,c):
t = 0
is_complete = True
for i in xrange(0, len(c)):
if c[i] is None:
if i+1 < len(c):
c[i] = 0
else:
c[i] = len(P.a) - t
is_complete = False
break;
else:
t += c[i]
if is_complete:
return None
return c
def next(P,s):
t = 0
for i in xrange(0, len(s)):
t += s[i]
if i+1 >= len(s) or s[i+1] is None:
if t+1 > len(P.a):
return None
else:
s[i] += 1
return s
Basically, first either replaces the next None value in the list with either 0 if it’s not the last value in the list or with the remainder to represent a valid solution (little optimization here) if it’s the last value in the list, or it return None if there is no None value in the list. next simply increments the rightmost integer by one or returns None if an increment would breach the total limit.
Now all you need is to create a problem instance, initialize the global variables and call bt with the root:
P = MinimalSumOfSubArraySumsProblem([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], 4)
optimum = None
optimum_diff = float("inf")
bt([None]*P.p)
If I am not mistaken here, one more approach is dynamic programming.
You can define P[ pos, n ] as the smallest possible "penalty" accumulated up to position pos if n subarrays were created. Obviously there is some position pos' such that
P[pos', n-1] + penalty(pos', pos) = P[pos, n]
You can just minimize over pos' = 1..pos.
The naive implementation will run in O(N^2 * M), where N - size of the original array and M - number of divisions.
#Gumbo 's answer is clear and actionable, but consumes lots of time when length(A) bigger than 400 and P bigger than 8. This is because that algorithm is kind of brute-forcing with benefits as he said.
In fact, a very fast solution is using dynamic programming.
Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.
Measure: , where is sum of elements of subarray , is the average of P subarray' sums.
This can make sure the balance of sum, because it use the definition of Standard Deviation.
Persuming that array A has N elements; Q(i,j) means the minimum Measure value when split the last i elements of A into j subarrays. D(i,j) means (sum(B)-sum(A)/P)^2 when array B consists of the i~jth elements of A ( 0<=i<=j<N ).
The minimum measure of the question is to calculate Q(N,P). And we find that:
Q(N,P)=MIN{Q(N-1,P-1)+D(0,0); Q(N-2,P-1)+D(0,1); ...; Q(N-1,P-1)+D(0,N-P)}
So it like can be solved by dynamic programming.
Q(i,1) = D(N-i,N-1)
Q(i,j) = MIN{ Q(i-1,j-1)+D(N-i,N-i);
Q(i-2,j-1)+D(N-i,N-i+1);
...;
Q(j-1,j-1)+D(N-i,N-j)}
So the algorithm step is:
1. Cal j=1:
Q(1,1), Q(2,1)... Q(3,1)
2. Cal j=2:
Q(2,2) = MIN{Q(1,1)+D(N-2,N-2)};
Q(3,2) = MIN{Q(2,1)+D(N-3,N-3); Q(1,1)+D(N-3,N-2)}
Q(4,2) = MIN{Q(3,1)+D(N-4,N-4); Q(2,1)+D(N-4,N-3); Q(1,1)+D(N-4,N-2)}
... Cal j=...
P. Cal j=P:
Q(P,P), Q(P+1,P)...Q(N,P)
The final minimum Measure value is stored as Q(N,P)!
To trace each subarray's length, you can store the
MIN choice when calculate Q(i,j)=MIN{Q+D...}
space for D(i,j);
time for calculate Q(N,P)
compared to the pure brute-forcing algorithm consumes time.
Working code below (I used php language). This code decides part quantity itself;
$main = array(2,4,6,1,6,3,2,3,4,3,4,1,4,7,3,1,2,1,3,4,1,7,2,4,1,2,3,1,1,1,1,4,5,7,8,9,8,0);
$pa=0;
for($i=0;$i < count($main); $i++){
$p[]= $main[$i];
if(abs(15 - array_sum($p)) < abs(15 - (array_sum($p)+$main[$i+1])))
{
$pa=$pa+1;
$pi[] = $i+1;
$pc = count($pi);
$ba = $pi[$pc-2] ;
$part[$pa] = array_slice( $main, $ba, count($p));
unset($p);
}
}
print_r($part);
for($s=1;$s<count($part);$s++){
echo '<br>';
echo array_sum($part[$s]);
}
code will output part sums like as below
13
14
16
14
15
15
17
I'm wondering whether the following would work:
Go from the left, as soon as sum > sigma, branch into two, one including the value that pushes it over, and one that doesn't. Recursively process data to the right with rightSum = totalSum-leftSum and rightP = P-1.
So, at the start, sum = 60
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
Then for 2 4 6 7, sum = 19 > sigma, so split into:
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
Then we process 7 6 3 3 3 4 3 4 4 4 3 3 1 and 6 3 3 3 4 3 4 4 4 3 3 1 with P = 4-1 and sum = 60-12 and sum = 60-19 respectively.
This results in, I think, O(P*n).
It might be a problem when 1 or 2 values is by far the largest, but, for any value >= sigma, we can probably just put that in it's own partition (preprocessing the array to find these might be the best idea (and reduce sum appropriately)).
If it works, it should hopefully minimise sum-of-squared-error (or close to that), which seems like the desired measure.
I propose an algorithm based on backtracking. The main function chosen randomly select an element from the original array and adds it to an array partitioned. For each addition will check to obtain a better solution than the original. This will be achieved by using a function that calculates the deviation, distinguishing each adding a new element to the page. Anyway, I thought it would be good to add an original variables in loops that you can not reach desired solution will force the program ends. By desired solution I means to add all elements with respect of condition imposed by condition from if.
sum=CalculateSum(vector)
Read P
sigma=sum/P
initialize P vectors, with names vector_partition[i], i=1..P
list_vector initialize a list what pointed this P vectors
initialize a diferences_vector with dimension of P
//that can easy visualize like a vector of vectors
//construct a non-recursive backtracking algorithm
function Deviation(vector) //function for calculate deviation of elements from a vector
{
dev=0
for i=0 to Size(vector)-1 do
dev+=|vector[i+1]-vector[i]|
return dev
}
iteration=0
//fix some maximum number of iteration for while loop
Read max_iteration
//as the number of iterations will be higher the more it will get
//a more accurate solution
while(!IsEmpty(vector))
{
for i=1 to Size(list_vector) do
{
if(IsEmpty(vector)) break from while loop
initial_deviation=Deviation(list_vector[i])
el=SelectElement(vector) //you can implement that function using a randomized
//choice of element
difference_vector[i]=|sigma-CalculateSum(list_vector[i])|
PutOnBackVector(vector_list[i], el)
if(initial_deviation>Deviation(difference_vector))
ExtractFromBackVectorAndPutOnSecondVector(list_vector, vector)
}
iteration++
//prevent to enter in some infinite loop
if (iteration>max_iteration) break from while loop
}
You can change this by adding in first if some code witch increment with a amount the calculated deviation.
aditional_amount=0
iteration=0
while
{
...
if(initial_deviation>Deviation(difference_vector)+additional_amount)
ExtractFromBackVectorAndPutOnSecondVector(list_vector, vector)
if(iteration>max_iteration)
{
iteration=0
aditional_amout+=1/some_constant
}
iteration++
//delete second if from first version
}
Your problem is very similar to, or the same as, the minimum makespan scheduling problem, depending on how you define your objective. In the case that you want to minimize the maximum |sum_i - sigma|, it is exactly that problem.
As referenced in the Wikipedia article, this problem is NP-complete for p > 2. Graham's list scheduling algorithm is optimal for p <= 3, and provides an approximation ratio of 2 - 1/p. You can check out the Wikipedia article for other algorithms and their approximation.
All the algorithms given on this page are either solving for a different objective, incorrect/suboptimal, or can be used to solve any problem in NP :)
This is very similar to the case of the one-dimensional bin packing problem, see http://www.cs.sunysb.edu/~algorith/files/bin-packing.shtml. In the associated book, The Algorithm Design Manual, Skienna suggests a first-fit decreasing approach. I.e. figure out your bin size (mean = sum / N), and then allocate the largest remaining object into the first bin that has room for it. You either get to a point where you have to start over-filling a bin, or if you're lucky you get a perfect fit. As Skiena states "First-fit decreasing has an intuitive appeal to it, for we pack the bulky objects first and hope that little objects can fill up the cracks."
As a previous poster said, the problem looks like it's NP-complete, so you're not going to solve it perfectly in reasonable time, and you need to look for heuristics.
I recently needed this and did as follows;
create an initial sub-arrays array of length given sub arrays count. sub arrays should have a sum property too. ie [[sum:0],[sum:0]...[sum:0]]
sort the main array descending.
search for the sub-array with the smallest sum and insert one item from main array and increment the sub arrays sum property by the inserted item's value.
repeat item 3 up until the end of main array is reached.
return the initial array.
This is the code in JS.
function groupTasks(tasks,groupCount){
var sum = tasks.reduce((p,c) => p+c),
initial = [...Array(groupCount)].map(sa => (sa = [], sa.sum = 0, sa));
return tasks.sort((a,b) => b-a)
.reduce((groups,task) => { var group = groups.reduce((p,c) => p.sum < c.sum ? p : c);
group.push(task);
group.sum += task;
return groups;
},initial);
}
var tasks = [...Array(50)].map(_ => ~~(Math.random()*10)+1), // create an array of 100 random elements among 1 to 10
result = groupTasks(tasks,7); // distribute them into 10 sub arrays with closest sums
console.log("input array:", JSON.stringify(tasks));
console.log(result.map(r=> [JSON.stringify(r),"sum: " + r.sum]));
You can use Max Flow algorithm.

Resources