Efficient way to generate histogram from very large dataset in MATLAB?

Efficient way to generate histogram from very large dataset in MATLAB? - arrays

I have two 2D arrays of size up to 35,000*35,000 each: indices and dotPs. From this, I want to create two 1D arrays such that pop contains the number of times each number appears in indices and nn contains the sum of elements in dotPs that correspond to those numbers. I have come up with the following (really dumb) way:
dotPs = [81.4285 9.2648 46.3184 5.7974 4.5016 2.6779 16.0092 41.1426;
9.2648 24.3525 11.4308 14.6598 17.9558 23.4246 19.4837 14.1173;
46.3184 11.4308 92.9264 9.2036 2.9957 0.1164 26.5770 26.0243;
5.7974 14.6598 9.2036 34.9984 16.2352 19.4568 31.8712 5.0732;
4.5016 17.9558 2.9957 16.2352 19.6595 16.0678 3.5750 16.7702;
2.6779 23.4246 0.1164 19.4568 16.0678 25.1084 6.6237 15.6188;
16.0092 19.4837 26.5770 31.8712 3.5750 6.6237 61.6045 16.6102;
41.1426 14.1173 26.0243 5.0732 16.7702 15.6188 16.6102 47.3289];
indices = [3 2 1 1 2 1 2 1;
2 2 1 2 2 1 2 2;
1 1 3 3 2 2 2 2;
1 2 3 4 3 3 4 2;
2 2 2 3 3 1 3 2;
1 1 2 3 1 8 2 2;
2 2 2 4 3 2 4 2;
1 2 2 2 2 2 2 2];
nn = zeros(1,8);
pop = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
j = uniqueInd(k);
[I,J]=find(indices==j);
if j == 0 || numel(I) == 0
continue
end
pop(j) = pop(j) + numel(I);
nn(j) = nn(j) + sum(sum(dotPs(I,J)));
end
Because of the find function, this is very slow. How can I do this more smartly so that it runs in a few seconds rather than several minutes?
Edit: added small dummy matrices for testing the code.

Both tasks can be done with the accumarray function:
pop = accumarray(indices(:), 1, [max(indices(:)) 1]).';
nn = accumarray(indices(:), dotPs(:), [max(indices(:)) 1]).';
This assumes that indices only contains positive integers.
EDIT:
From comments, only the lower part of the indices matrix without the diagonal should be used, and it is guaranteed to contain positive integers. In that case:
mask = tril(true(size(indices)), -1);
indices_masked = indices(mask);
dotPs_masked = dotPs(mask);
pop = accumarray(indices_masked, 1, [max(indices_masked) 1]).';
nn = accumarray(indices_masked, dotPs_masked, [max(indices_masked) 1]).';

First of all, note that the dimension of indices does not matter (e.g. if both indices and dotPs were 1D arrays or 3D arrays the result will be the same).
pop can be calculated by histcount function, but since you also need to calculate the sum of the corresponding elements of dotPs array the problem becomes harder.
Here is a possible solution with a for loop. The advantage of this solution is that I am not calling find function in a loop, so it should be faster:
%Example input
indices=randi(5,3,3);
dotPs=rand(3,3);
%Solution
[C,ia,ic]=unique(indices);
nn=zeros(size(C));
pop=zeros(size(C));
for i=1:numel(indices)
nn(ic(i))=nn(ic(i))+1;
pop(ic(i))=pop(ic(i))+dotPs(i);
end
This solution uses a vector ic to categorize each of the input values. After that, I go through each element and update nn(ic) and pop(ic).

For computing pop, you can use hist, for computing nn, I couldn't find a smart solution (but I found a solution without using find):
pop = hist(indices(:), max(indices(:)));
nn = zeros(1,8);
uniqueInd = unique(indices);
for k=1:numel(uniqueInd)
j = uniqueInd(k);
nn(j) = sum(dotPs(indices == j));
end
There must be a better solution for computing nn.
I found a smarter solution applying sorting.
I am not sure it's faster, because sorting 35,000*35,000 elements might take a long time.
Sort indices just for getting the index for sorting dotPs by indices.
Sort dotPs according to index returned by previous sort.
cumsumPop = Compute cumulative sum of pop (cumulative sum of the histogram of indices).
cumsumPs = Compute cumulative sum of sorted dotPs.
Now values of cumsumPop can be used as indices in cumsumPs.
Because cumsumPs is cumulative sum, we need to use diff
for getting the solution.
Here is the "smart" solution:
pop = hist(indices(:), max(indices(:)));
[sortedIndices, I] = sort(indices(:));
sortedDotPs = dotPs(I);
cumsumPop = cumsum(pop);
cumsumPs = cumsum(sortedDotPs);
nn = diff([0; cumsumPs(cumsumPop)]);
nn = nn';

Related

A matrix and a column vector containing indices, how to iterate with no loop?

I have a big matrix (500212x7) and a column vector like below
matrix F vector P
0001011 4
0001101 3
1101100 6
0000110 1
1110000 7
The vector contains indices considered within the matrix rows. P(1) is meant to point at F(1,4), P(2) at F(2,3) and so on.
I want to negate a bit in each row in F in a column pointed by P element (in the same row).
I thought of things like
F(:,P(1)) = ~F(:,P(1));
F(:,P(:)) = ~F(:,P(:));
but of course these scenarios won't produce the result I expect as the first line won't make P element change and the second one won't even let me start the program because a full vector cannot make an index.
The idea is I need to do this for all F and P rows (changing/incrementing "simultaneously") but take the value of P element.
I know this is easily achieved with for loop but due to large dimensions of the F array such a way to solve the problem is completely unacceptable.
Is there any kind of Matlab wizardry that lets solving such a task with the use of matrix operations?

I know this is easily achieved with for loop but due to large dimensions of the F array such a way to solve the problem is completely unacceptable.
You should never make such an assumption. First implement the loop, then check to see if it really is too slow for you or not, then worry about optimizing.
Here I'm comparing Luis' answer and the trival loop:
N = 500212;
F = rand(N,7) > 0.6;
P = randi(7,N,1);
timeit(#()method1(F,P))
timeit(#()method2(F,P))
function F = method1(F,P)
ind = (1:size(F,1)) + (P(:).'-1)*size(F,1); % create linear index
F(ind) = ~F(ind); % negate those entries
end
function F = method2(F,P)
for ii = 1:numel(P)
F(ii,P(ii)) = ~F(ii,P(ii));
end
end
Timings are 0.0065 s for Luis' answer, and 0.0023 s for the loop (MATLAB Online R2019a).
It is especially true for very large arrays, that loops are faster than vectorized code, if the vectorization requires creating an intermediate array. Memory access is expensive, using more of it makes the code slower.
Lessons: don't dismiss loops, don't prematurely try to optimize, and don't optimize without comparing.

Another solution:
xor( F, 1:7 == P )
Explanation:
1:7 == P generates one-hot arrays.
xor will cause a bit to retain its value against a 0, and flip it against a 1

Not sure if it qualifies as wizardry, but linear indexing does exactly what you want:
F = [0 0 0 1 0 1 1; 0 0 0 1 1 0 1; 1 1 0 1 1 0 0; 0 0 0 0 1 1 0; 1 1 1 0 0 0 0];
P = [4; 3; 6; 1; 7];
ind = (1:size(F,1)) + (P(:).'-1)*size(F,1); % create linear index
F(ind) = ~F(ind); % negate those entries

Compute the product of the next n elements in array

I would like to compute the product of the next n adjacent elements of a matrix. The number n of elements to be multiplied should be given in function's input.
For example for this input I should compute the product of every 3 consecutive elements, starting from the first.
[p, ind] = max_product([1 2 2 1 3 1],3);
This gives [1*2*2, 2*2*1, 2*1*3, 1*3*1] = [4,4,6,3].
Is there any practical way to do it? Now I do this using:
for ii = 1:(length(v)-2)
p = prod(v(ii:ii+n-1));
end
where v is the input vector and n is the number of elements to be multiplied.
in this example n=3 but can take any positive integer value.
Depending whether n is odd or even or length(v) is odd or even, I get sometimes right answers but sometimes an error.
For example for arguments:
v = [1.35912281237829 -0.958120385352704 -0.553335935098461 1.44601450110386 1.43760259196739 0.0266423803393867 0.417039432979809 1.14033971399183 -0.418125096873537 -1.99362640306847 -0.589833539347417 -0.218969651537063 1.49863539349242 0.338844452879616 1.34169199365703 0.181185490389383 0.102817336496793 0.104835620599133 -2.70026800170358 1.46129128974515 0.64413523430416 0.921962619821458 0.568712984110933]
n = 7
I get the error:
Index exceeds matrix dimensions.
Error in max_product (line 6)
p = prod(v(ii:ii+n-1));
Is there any correct general way to do it?

Based on the solution in Fast numpy rolling_product, I'd like to suggest a MATLAB version of it, which leverages the movsum function introduced in R2016a.
The mathematical reasoning is that a product of numbers is equal to the exponent of the sum of their logarithms:
A possible MATLAB implementation of the above may look like this:
function P = movprod(vec,window_sz)
P = exp(movsum(log(vec),[0 window_sz-1],'Endpoints','discard'));
if isreal(vec) % Ensures correct outputs when the input contains negative and/or
P = real(P); % complex entries.
end
end
Several notes:
I haven't benchmarked this solution, and do not know how it compares in terms of performance to the other suggestions.
It should work correctly with vectors containing zero and/or negative and/or complex elements.
It can be easily expanded to accept a dimension to operate along (for array inputs), and any other customization afforded by movsum.
The 1st input is assumed to be either a double or a complex double row vector.
Outputs may require rounding.

Update
Inspired by the nicely thought answer of Dev-iL comes this handy solution, which does not require Matlab R2016a or above:
out = real( exp(conv(log(a),ones(1,n),'valid')) )
The basic idea is to transform the multiplication to a sum and a moving average can be used, which in turn can be realised by convolution.
Old answers
This is one way using gallery to get a circulant matrix and indexing the relevant part of the resulting matrix before multiplying the elements:
a = [1 2 2 1 3 1]
n = 3
%// circulant matrix
tmp = gallery('circul', a(:))
%// product of relevant parts of matrix
out = prod(tmp(end-n+1:-1:1, end-n+1:end), 2)
out =
4
4
6
3
More memory efficient alternative in case there are no zeros in the input:
a = [10 9 8 7 6 5 4 3 2 1]
n = 2
%// cumulative product
x = [1 cumprod(a)]
%// shifted by n and divided by itself
y = circshift( x,[0 -n] )./x
%// remove last elements
out = y(1:end-n)
out =
90 72 56 42 30 20 12 6 2

Your approach is correct. You should just change the for loop to for ii = 1:(length(v)-n+1) and then it will work fine.
If you are not going to deal with large inputs, another approach is using gallery as explained in #thewaywewalk's answer.

I think the problem may be based on your indexing. The line that states for ii = 1:(length(v)-2) does not provide the correct range of ii.
Try this:
function out = max_product(in,size)
size = size-1; % this is because we add size to i later
out = zeros(length(in),1) % assuming that this is a column vector
for i = 1:length(in)-size
out(i) = prod(in(i:i+size));
end
Your code works when restated like so:
for ii = 1:(length(v)-(n-1))
p = prod(v(ii:ii+(n-1)));
end
That should take care of the indexing problem.

using bsxfun you create a matrix each row of it contains consecutive 3 elements then take prod of 2nd dimension of the matrix. I think this is most efficient way:
max_product = #(v, n) prod(v(bsxfun(#plus, (1 : n), (0 : numel(v)-n)')), 2);
p = max_product([1 2 2 1 3 1],3)
Update:
some other solutions updated, and some such as #Dev-iL 's answer outperform others, I can suggest fftconv that in Octave outperforms conv

If you can upgrade to R2017a, you can use the new movprod function to compute a windowed product.

How to increment some of elements in an array by specific values in MATLAB

Suppose we have an array
A = zeros([1,10]);
We have several indexes with possible duplicate say:
indSeq = [1,1,2,3,4,4,4];
How can we increase A(i) by the number of i in the index sequence i.e. A(1) = 2, A(2) = 1, A(3) = 1, A(4) = 3?
The code A(indSeq) = A(indSeq)+1 does not work.
I know that I can use the following for loop to achieve the goal, but I wonder if there is anyway that we can avoid for-loop? We can assume that the indSeq is sorted.
A for-loop solution:
for i=1:length(indSeq)
A(indSeq(i)) = A(indSeq(i))+1;
end;

You can use accumarray for such a label based counting job, like so -
accumarray(indSeq(:),1)
Benchmarking
As suggested in the other answer, you can also use hist/histc. Let's benchmark these two for a large datasize. The benchmarking code I used had -
%// Create huge random array filled with ints that are duplicated & sorted
maxn = 100000;
N = 10000000;
indSeq = sort(randi(maxn,1,N));
disp('--------------------- With HISTC')
tic,histc(indSeq,unique(indSeq));toc
disp('--------------------- With ACCUMARRAY')
tic,accumarray(indSeq(:),1);toc
Runtime output -
--------------------- With HISTC
Elapsed time is 1.028165 seconds.
--------------------- With ACCUMARRAY
Elapsed time is 0.220202 seconds.

This is run-length encoding, and the following code should do the trick for you.
A=zeros(1,10);
indSeq = [1,1,2,3,4,4,4,7,1];
indSeq=sort(indSeq); %// if your input is always sorted, you don't need to do this
pos = [1; find(diff(indSeq(:)))+1; numel(indSeq)+1];
A(indSeq(pos(1:end-1)))=diff(pos)
which returns
A =
3 1 1 3 0 0 1 0 0 0
This algorithm was written by Luis Mendo for MATL.

I think what you are looking for is the number of occurences of unique values of the array. This can be accomplished with:
[num, val] = hist(indSeq,unique(indSeq));
the output of your example is:
num = 2 1 1 3
val = 1 2 3 4
so num is the number of times val occurs. i.e. the number 1 occurs 2 times in your example

Is there a better/faster way of randomly shuffling a matrix in MATLAB?

In MATLAB, I am using the shake.m function (http://www.mathworks.com/matlabcentral/fileexchange/10067-shake) to randomly shuffle each column. For example:
a = [1 2 3; 4 5 6; 7 8 9]
a =
1 2 3
4 5 6
7 8 9
b = shake(a)
b =
7 8 6
1 5 9
4 2 3
This function does exactly what I want, however my columns are very long (>10,000,000) and so this takes a long time to run. Does anyone know of a faster way of achieving this? I have tried shaking each column vector separately but this isn't faster. Thanks!

You can use randperm like this, but I don't know if it will be any faster than shake:
[m,n]=size(a)
for c = 1:n
a(randperm(m),c) = a(:,c);
end
Or you can try switch the randperm around to see which is faster (should produce the same result):
[m,n]=size(a)
for c = 1:n
a(:,c) = a(randperm(m),c);
end
Otherwise how many rows do you have? If you have far fewer rows than columns, it's possible that we can assume each permutation will be repeated, so what about something like this:
[m,n]=size(a)
cols = randperm(n);
k = 5; %//This is a parameter you'll need to tweak...
set_size = floor(n/k);
for set = 1:set_size:n
set_cols = cols(set:(set+set_size-1))
a(:,set_cols) = a(randperm(m), set_cols);
end
which would massively reduce the number of calls to randperm. Breaking it up into k equal sized sets might not be optimal though, you might want to add some randomness to that as well. The basic idea here though is that there will only be factorial(m) different orderings, and if m is much smaller than n (e.g. m=5, n=100000 like your data), then these orderings will be repeated naturally. So instead of letting that occur by itself, rather manage the process and reduce the calls to randperm which would be producing the same result anyway.

Here's a simple vectorized approach. Note that it creates an auxiliary matrix (ind) the same size as a, so depending on your memory it may be usable or not.
[~, ind] = sort(rand(size(a))); %// create a random sorting for each column
b = a(bsxfun(#plus, ind, 0:size(a,1):numel(a)-1)); %// convert to linear index

Obtain shuffled indices using randperm
idx = randperm(size(a,1));
Use the indices to shuffle the vector:
m = size(a,1);
for i=1:m
b(:,i) = a(randperm(m,:);
end
Look at this answer: Matlab: How to random shuffle columns of matrix

Here's a no-loop approach as it processes all indices at once and I believe this is as random as one could get given the requirements of shuffling among each column only.
Code
%// Get sizes
[m,n] = size(a);
%// Create an array of randomly placed sequential indices from 1 to numel(a)
rand_idx = randperm(m*n);
%// segregate those indices into rows and cols for the size of input data, a
col = ceil(rand_idx/m);
row = rem(rand_idx,m);
row(row==0)=m;
%// Sort both these row and col indices based on col, such that we have col
%// as 1,1,1,1 ...2,2,2,....3,3,3,3 and so on, which would represent per col
%// indices for the input data. Use these indices to linearly index into a
[scol,ind1] = sort(col);
a(1:m*n) = a((scol-1)*m + row(ind1))
Final output is obtained in a itself.

Algorithm to split an array into P subarrays of balanced sum

I have an big array of length N, let's say something like:
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
I need to split this array into P subarrays (in this example, P=4 would be reasonable), such that the sum of the elements in each subarray is as close as possible to sigma, being:
sigma=(sum of all elements in original array)/P
In this example, sigma=15.
For the sake of clarity, one possible result would be:
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
(sums: 12,19,14,15)
I have written a very naive algorithm based in how I would do the divisions by hand, but I don't know how to impose the condition that a division whose sums are (14,14,14,14,19) is worse than one that is (15,14,16,14,16).
Thank you in advance.

First, let’s formalize your optimization problem by specifying the input, output, and the measure for each possible solution (I hope this is in your interest):
Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.
Input: Array A of positive integers; P is a positive integer.
Output: Array SA of P non-negative integers representing the length of each subarray of A where the sum of these subarray lengths is equal to the length of A.
Measure: abs(sum(sa)-sum(A)/P) is minimal for each sa ∈ {sa | sa = (Ai, …, Ai+‍SAj) for i = (Σ SAj), j from 0 to P-1}.
The input and output define the set of valid solutions. The measure defines a measure to compare multiple valid solutions. And since we’re looking for a solution with the least difference to the perfect solution (minimization problem), measure should also be minimal.
With this information, it is quite easy to implement the measure function (here in Python):
def measure(a, sa):
sigma = sum(a)/len(sa)
diff = 0
i = 0
for j in xrange(0, len(sa)):
diff += abs(sum(a[i:i+sa[j]])-sigma)
i += sa[j]
return diff
print measure([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], [3,4,4,5]) # prints 8
Now finding an optimal solution is a little harder.
We can use the Backtracking algorithm for finding valid solutions and use the measure function to rate them. We basically try all possible combinations of P non-negative integer numbers that sum up to length(A) to represent all possible valid solutions. Although this ensures not to miss a valid solution, it is basically a brute-force approach with the benefit that we can omit some branches that cannot be any better than our yet best solution. E.g. in the example above, we wouldn’t need to test solutions with [9,…] (measure > 38) if we already have a solution with measure ≤ 38.
Following the pseudocode pattern from Wikipedia, our bt function looks as follows:
def bt(c):
global P, optimum, optimum_diff
if reject(P,c):
return
if accept(P,c):
print "%r with %d" % (c, measure(P,c))
if measure(P,c) < optimum_diff:
optimum = c
optimum_diff = measure(P,c)
return
s = first(P,c)
while s is not None:
bt(list(s))
s = next(P,s)
The global variables P, optimum, and optimum_diff represent the problem instance holding the values for A, P, and sigma, as well as the optimal solution and its measure:
class MinimalSumOfSubArraySumsProblem:
def __init__(self, a, p):
self.a = a
self.p = p
self.sigma = sum(a)/p
Next we specify the reject and accept functions that are quite straight forward:
def reject(P,c):
return optimum_diff < measure(P,c)
def accept(P,c):
return None not in c
This simply rejects any candidate whose measure is already more than our yet optimal solution. And we’re accepting any valid solution.
The measure function is also slightly changed due to the fact that c can now contain None values:
def measure(P, c):
diff = 0
i = 0
for j in xrange(0, P.p):
if c[j] is None:
break;
diff += abs(sum(P.a[i:i+c[j]])-P.sigma)
i += c[j]
return diff
The remaining two function first and next are a little more complicated:
def first(P,c):
t = 0
is_complete = True
for i in xrange(0, len(c)):
if c[i] is None:
if i+1 < len(c):
c[i] = 0
else:
c[i] = len(P.a) - t
is_complete = False
break;
else:
t += c[i]
if is_complete:
return None
return c
def next(P,s):
t = 0
for i in xrange(0, len(s)):
t += s[i]
if i+1 >= len(s) or s[i+1] is None:
if t+1 > len(P.a):
return None
else:
s[i] += 1
return s
Basically, first either replaces the next None value in the list with either 0 if it’s not the last value in the list or with the remainder to represent a valid solution (little optimization here) if it’s the last value in the list, or it return None if there is no None value in the list. next simply increments the rightmost integer by one or returns None if an increment would breach the total limit.
Now all you need is to create a problem instance, initialize the global variables and call bt with the root:
P = MinimalSumOfSubArraySumsProblem([2,4,6,7,6,3,3,3,4,3,4,4,4,3,3,1], 4)
optimum = None
optimum_diff = float("inf")
bt([None]*P.p)

If I am not mistaken here, one more approach is dynamic programming.
You can define P[ pos, n ] as the smallest possible "penalty" accumulated up to position pos if n subarrays were created. Obviously there is some position pos' such that
P[pos', n-1] + penalty(pos', pos) = P[pos, n]
You can just minimize over pos' = 1..pos.
The naive implementation will run in O(N^2 * M), where N - size of the original array and M - number of divisions.

#Gumbo 's answer is clear and actionable, but consumes lots of time when length(A) bigger than 400 and P bigger than 8. This is because that algorithm is kind of brute-forcing with benefits as he said.
In fact, a very fast solution is using dynamic programming.
Given an array A of positive integers and a positive integer P, separate the array A into P non-overlapping subarrays such that the difference between the sum of each subarray and the perfect sum of the subarrays (sum(A)/P) is minimal.
Measure: , where is sum of elements of subarray , is the average of P subarray' sums.
This can make sure the balance of sum, because it use the definition of Standard Deviation.
Persuming that array A has N elements; Q(i,j) means the minimum Measure value when split the last i elements of A into j subarrays. D(i,j) means (sum(B)-sum(A)/P)^2 when array B consists of the i~jth elements of A ( 0<=i<=j<N ).
The minimum measure of the question is to calculate Q(N,P). And we find that:
Q(N,P)=MIN{Q(N-1,P-1)+D(0,0); Q(N-2,P-1)+D(0,1); ...; Q(N-1,P-1)+D(0,N-P)}
So it like can be solved by dynamic programming.
Q(i,1) = D(N-i,N-1)
Q(i,j) = MIN{ Q(i-1,j-1)+D(N-i,N-i);
Q(i-2,j-1)+D(N-i,N-i+1);
...;
Q(j-1,j-1)+D(N-i,N-j)}
So the algorithm step is:
1. Cal j=1:
Q(1,1), Q(2,1)... Q(3,1)
2. Cal j=2:
Q(2,2) = MIN{Q(1,1)+D(N-2,N-2)};
Q(3,2) = MIN{Q(2,1)+D(N-3,N-3); Q(1,1)+D(N-3,N-2)}
Q(4,2) = MIN{Q(3,1)+D(N-4,N-4); Q(2,1)+D(N-4,N-3); Q(1,1)+D(N-4,N-2)}
... Cal j=...
P. Cal j=P:
Q(P,P), Q(P+1,P)...Q(N,P)
The final minimum Measure value is stored as Q(N,P)!
To trace each subarray's length, you can store the
MIN choice when calculate Q(i,j)=MIN{Q+D...}
space for D(i,j);
time for calculate Q(N,P)
compared to the pure brute-forcing algorithm consumes time.

Working code below (I used php language). This code decides part quantity itself;
$main = array(2,4,6,1,6,3,2,3,4,3,4,1,4,7,3,1,2,1,3,4,1,7,2,4,1,2,3,1,1,1,1,4,5,7,8,9,8,0);
$pa=0;
for($i=0;$i < count($main); $i++){
$p[]= $main[$i];
if(abs(15 - array_sum($p)) < abs(15 - (array_sum($p)+$main[$i+1])))
{
$pa=$pa+1;
$pi[] = $i+1;
$pc = count($pi);
$ba = $pi[$pc-2] ;
$part[$pa] = array_slice( $main, $ba, count($p));
unset($p);
}
}
print_r($part);
for($s=1;$s<count($part);$s++){
echo '<br>';
echo array_sum($part[$s]);
}
code will output part sums like as below
13
14
16
14
15
15
17

I'm wondering whether the following would work:
Go from the left, as soon as sum > sigma, branch into two, one including the value that pushes it over, and one that doesn't. Recursively process data to the right with rightSum = totalSum-leftSum and rightP = P-1.
So, at the start, sum = 60
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
Then for 2 4 6 7, sum = 19 > sigma, so split into:
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
2 4 6 7 6 3 3 3 4 3 4 4 4 3 3 1
Then we process 7 6 3 3 3 4 3 4 4 4 3 3 1 and 6 3 3 3 4 3 4 4 4 3 3 1 with P = 4-1 and sum = 60-12 and sum = 60-19 respectively.
This results in, I think, O(P*n).
It might be a problem when 1 or 2 values is by far the largest, but, for any value >= sigma, we can probably just put that in it's own partition (preprocessing the array to find these might be the best idea (and reduce sum appropriately)).
If it works, it should hopefully minimise sum-of-squared-error (or close to that), which seems like the desired measure.

I propose an algorithm based on backtracking. The main function chosen randomly select an element from the original array and adds it to an array partitioned. For each addition will check to obtain a better solution than the original. This will be achieved by using a function that calculates the deviation, distinguishing each adding a new element to the page. Anyway, I thought it would be good to add an original variables in loops that you can not reach desired solution will force the program ends. By desired solution I means to add all elements with respect of condition imposed by condition from if.
sum=CalculateSum(vector)
Read P
sigma=sum/P
initialize P vectors, with names vector_partition[i], i=1..P
list_vector initialize a list what pointed this P vectors
initialize a diferences_vector with dimension of P
//that can easy visualize like a vector of vectors
//construct a non-recursive backtracking algorithm
function Deviation(vector) //function for calculate deviation of elements from a vector
{
dev=0
for i=0 to Size(vector)-1 do
dev+=|vector[i+1]-vector[i]|
return dev
}
iteration=0
//fix some maximum number of iteration for while loop
Read max_iteration
//as the number of iterations will be higher the more it will get
//a more accurate solution
while(!IsEmpty(vector))
{
for i=1 to Size(list_vector) do
{
if(IsEmpty(vector)) break from while loop
initial_deviation=Deviation(list_vector[i])
el=SelectElement(vector) //you can implement that function using a randomized
//choice of element
difference_vector[i]=|sigma-CalculateSum(list_vector[i])|
PutOnBackVector(vector_list[i], el)
if(initial_deviation>Deviation(difference_vector))
ExtractFromBackVectorAndPutOnSecondVector(list_vector, vector)
}
iteration++
//prevent to enter in some infinite loop
if (iteration>max_iteration) break from while loop
}
You can change this by adding in first if some code witch increment with a amount the calculated deviation.
aditional_amount=0
iteration=0
while
{
...
if(initial_deviation>Deviation(difference_vector)+additional_amount)
ExtractFromBackVectorAndPutOnSecondVector(list_vector, vector)
if(iteration>max_iteration)
{
iteration=0
aditional_amout+=1/some_constant
}
iteration++
//delete second if from first version
}

Your problem is very similar to, or the same as, the minimum makespan scheduling problem, depending on how you define your objective. In the case that you want to minimize the maximum |sum_i - sigma|, it is exactly that problem.
As referenced in the Wikipedia article, this problem is NP-complete for p > 2. Graham's list scheduling algorithm is optimal for p <= 3, and provides an approximation ratio of 2 - 1/p. You can check out the Wikipedia article for other algorithms and their approximation.
All the algorithms given on this page are either solving for a different objective, incorrect/suboptimal, or can be used to solve any problem in NP :)

This is very similar to the case of the one-dimensional bin packing problem, see http://www.cs.sunysb.edu/~algorith/files/bin-packing.shtml. In the associated book, The Algorithm Design Manual, Skienna suggests a first-fit decreasing approach. I.e. figure out your bin size (mean = sum / N), and then allocate the largest remaining object into the first bin that has room for it. You either get to a point where you have to start over-filling a bin, or if you're lucky you get a perfect fit. As Skiena states "First-fit decreasing has an intuitive appeal to it, for we pack the bulky objects first and hope that little objects can fill up the cracks."
As a previous poster said, the problem looks like it's NP-complete, so you're not going to solve it perfectly in reasonable time, and you need to look for heuristics.

I recently needed this and did as follows;
create an initial sub-arrays array of length given sub arrays count. sub arrays should have a sum property too. ie [[sum:0],[sum:0]...[sum:0]]
sort the main array descending.
search for the sub-array with the smallest sum and insert one item from main array and increment the sub arrays sum property by the inserted item's value.
repeat item 3 up until the end of main array is reached.
return the initial array.
This is the code in JS.
function groupTasks(tasks,groupCount){
var sum = tasks.reduce((p,c) => p+c),
initial = [...Array(groupCount)].map(sa => (sa = [], sa.sum = 0, sa));
return tasks.sort((a,b) => b-a)
.reduce((groups,task) => { var group = groups.reduce((p,c) => p.sum < c.sum ? p : c);
group.push(task);
group.sum += task;
return groups;
},initial);
}
var tasks = [...Array(50)].map(_ => ~~(Math.random()*10)+1), // create an array of 100 random elements among 1 to 10
result = groupTasks(tasks,7); // distribute them into 10 sub arrays with closest sums
console.log("input array:", JSON.stringify(tasks));
console.log(result.map(r=> [JSON.stringify(r),"sum: " + r.sum]));

You can use Max Flow algorithm.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Efficient way to generate histogram from very large dataset in MATLAB? - arrays

Related

A matrix and a column vector containing indices, how to iterate with no loop?

Compute the product of the next n elements in array

How to increment some of elements in an array by specific values in MATLAB

Is there a better/faster way of randomly shuffling a matrix in MATLAB?

Algorithm to split an array into P subarrays of balanced sum

Categories

Resources