Random sampling of elements from an array based on a target condition - arrays

I have an array (let's call it ElmInfo) of size Nx2 representing a geometry. In that array the element number and element volume are on the column 1 and column 2 respectively. The volume of elements largely vary. The sum of the volume of all elements leads to a value V which can be obtained in MATLAB as:
V=sum(ElmInfo(:,2));
I want to randomly sample elements from the array ElmInfo in such a way that the volume of sampled elements (with no repetition) will lead to a target volume V1. Note: V1 is less than V. So I don't know the number of elements to be sampled. I am giving an example. For a sampling case number of sampled element can be '10' whereas in other sampling number of sampled element can be '15'.
There is no straightforward MATLAB in-built function to meet the target condition. How can I implement the code in MATLAB?

Finally I got the answer of my question. Here is the solution I got from a contributor at MATLAB central. For the convenience of the stack overflow community I am posting the answer here.
TotVol=sum(ElmInfo(:,2));
DefVf = 1.5; % This is the volume fraction I want to sample
% Target sample volume
DefVolm_target = TotVol*(DefVf/100);
% **************************************
n = 300;
v = ElmInfo(:,2);
tol = 1e-6;
sample = [];
maxits = 10000;
for count = 1:maxits
p = randperm(n);
s = cumsum(v(p));
k = find(abs(s - DefVolm_target) < tol);
if ~isempty(k)
sample_indices = p(1:k(1));
sample = v(sample_indices);
fprintf('Sample found after %d iterations\n', count);
break
end
end
DefVol_sim=sum(sample);
sampled_Elm=sort(sample_indices);

Related

Define a vector with random steps

I want to create an array that has incremental random steps, I've used this simple code.
t_inici=(0:10*rand:100);
The problem is that the random number keeps unchangable between steps. Is there any simple way to change the seed of the random number within each step?
If you have a set number of points, say nPts, then you could do the following
nPts = 10; % Could use 'randi' here for random number of points
lims = [0, 10] % Start and end points
x = rand(1, nPts); % Create random numbers
% Sort and scale x to fit your limits and be ordered
x = diff(lims) * ( sort(x) - min(x) ) / diff(minmax(x)) + lims(1)
This approach always includes your end point, which a 0:dx:10 approach would not necessarily.
If you had some maximum number of points, say nPtsMax, then you could do the following
nPtsMax = 1000; % Max number of points
lims = [0,10]; % Start and end points
% Could do 10* or any other multiplier as in your example in front of 'rand'
x = lims(1) + [0 cumsum(rand(1, nPtsMax))];
x(x > lims(2)) = []; % remove values above maximum limit
This approach may be slower, but is still fairly quick and better represents the behaviour in your question.
My first approach to this would be to generate N-2 samples, where N is the desired amount of samples randomly, sort them, and add the extrema:
N=50;
endpoint=100;
initpoint=0;
randsamples=sort(rand(1, N-2)*(endpoint-initpoint)+initpoint);
t_inici=[initpoint randsamples endpoint];
However not sure how "uniformly random" this is, as you are "faking" the last 2 data, to have the extrema included. This will somehow distort pure randomness (I think). If you are not necessarily interested on including the extrema, then just remove the last line and generate N points. That will make sure that they are indeed random (or as random as MATLAB can create them).
Here is an alternative solution with "uniformly random"
[initpoint,endpoint,coef]=deal(0,100,10);
t_inici(1)=initpoint;
while(t_inici(end)<endpoint)
t_inici(end+1)=t_inici(end)+rand()*coef;
end
t_inici(end)=[];
In my point of view, it fits your attempts well with unknown steps, start from 0, but not necessarily end at 100.
From your code it seems you want a uniformly random step that varies between each two entries. This implies that the number of entries that the vector will have is unknown in advance.
A way to do that is as follows. This is similar to Hunter Jiang's answer but adds entries in batches instead of one by one, in order to reduce the number of loop iterations.
Guess a number of required entries, n. Any value will do, but a large value will result in fewer iterations and will probably be more efficient.
Initiallize result to the first value.
Generate n entries and concatenate them to the (temporary) result.
See if the current entries are already too many.
If they are, cut as needed and output (final) result. Else go back to step 3.
Code:
lower_value = 0;
upper_value = 100;
step_scale = 10;
n = 5*(upper_value-lower_value)/step_scale*2; % STEP 1. The number 5 here is arbitrary.
% It's probably more efficient to err with too many than with too few
result = lower_value; % STEP 2
done = false;
while ~done
result = [result result(end)+cumsum(step_scale*rand(1,n))]; % STEP 3. Include
% n new entries
ind_final = find(result>upper_value,1)-1; % STEP 4. Index of first entry exceeding
% upper_value, if any
if ind_final % STEP 5. If non-empty, we're done
result = result(1:ind_final-1);
done = true;
end
end

Compute the product of the next n elements in array

I would like to compute the product of the next n adjacent elements of a matrix. The number n of elements to be multiplied should be given in function's input.
For example for this input I should compute the product of every 3 consecutive elements, starting from the first.
[p, ind] = max_product([1 2 2 1 3 1],3);
This gives [1*2*2, 2*2*1, 2*1*3, 1*3*1] = [4,4,6,3].
Is there any practical way to do it? Now I do this using:
for ii = 1:(length(v)-2)
p = prod(v(ii:ii+n-1));
end
where v is the input vector and n is the number of elements to be multiplied.
in this example n=3 but can take any positive integer value.
Depending whether n is odd or even or length(v) is odd or even, I get sometimes right answers but sometimes an error.
For example for arguments:
v = [1.35912281237829 -0.958120385352704 -0.553335935098461 1.44601450110386 1.43760259196739 0.0266423803393867 0.417039432979809 1.14033971399183 -0.418125096873537 -1.99362640306847 -0.589833539347417 -0.218969651537063 1.49863539349242 0.338844452879616 1.34169199365703 0.181185490389383 0.102817336496793 0.104835620599133 -2.70026800170358 1.46129128974515 0.64413523430416 0.921962619821458 0.568712984110933]
n = 7
I get the error:
Index exceeds matrix dimensions.
Error in max_product (line 6)
p = prod(v(ii:ii+n-1));
Is there any correct general way to do it?
Based on the solution in Fast numpy rolling_product, I'd like to suggest a MATLAB version of it, which leverages the movsum function introduced in R2016a.
The mathematical reasoning is that a product of numbers is equal to the exponent of the sum of their logarithms:
A possible MATLAB implementation of the above may look like this:
function P = movprod(vec,window_sz)
P = exp(movsum(log(vec),[0 window_sz-1],'Endpoints','discard'));
if isreal(vec) % Ensures correct outputs when the input contains negative and/or
P = real(P); % complex entries.
end
end
Several notes:
I haven't benchmarked this solution, and do not know how it compares in terms of performance to the other suggestions.
It should work correctly with vectors containing zero and/or negative and/or complex elements.
It can be easily expanded to accept a dimension to operate along (for array inputs), and any other customization afforded by movsum.
The 1st input is assumed to be either a double or a complex double row vector.
Outputs may require rounding.
Update
Inspired by the nicely thought answer of Dev-iL comes this handy solution, which does not require Matlab R2016a or above:
out = real( exp(conv(log(a),ones(1,n),'valid')) )
The basic idea is to transform the multiplication to a sum and a moving average can be used, which in turn can be realised by convolution.
Old answers
This is one way using gallery to get a circulant matrix and indexing the relevant part of the resulting matrix before multiplying the elements:
a = [1 2 2 1 3 1]
n = 3
%// circulant matrix
tmp = gallery('circul', a(:))
%// product of relevant parts of matrix
out = prod(tmp(end-n+1:-1:1, end-n+1:end), 2)
out =
4
4
6
3
More memory efficient alternative in case there are no zeros in the input:
a = [10 9 8 7 6 5 4 3 2 1]
n = 2
%// cumulative product
x = [1 cumprod(a)]
%// shifted by n and divided by itself
y = circshift( x,[0 -n] )./x
%// remove last elements
out = y(1:end-n)
out =
90 72 56 42 30 20 12 6 2
Your approach is correct. You should just change the for loop to for ii = 1:(length(v)-n+1) and then it will work fine.
If you are not going to deal with large inputs, another approach is using gallery as explained in #thewaywewalk's answer.
I think the problem may be based on your indexing. The line that states for ii = 1:(length(v)-2) does not provide the correct range of ii.
Try this:
function out = max_product(in,size)
size = size-1; % this is because we add size to i later
out = zeros(length(in),1) % assuming that this is a column vector
for i = 1:length(in)-size
out(i) = prod(in(i:i+size));
end
Your code works when restated like so:
for ii = 1:(length(v)-(n-1))
p = prod(v(ii:ii+(n-1)));
end
That should take care of the indexing problem.
using bsxfun you create a matrix each row of it contains consecutive 3 elements then take prod of 2nd dimension of the matrix. I think this is most efficient way:
max_product = #(v, n) prod(v(bsxfun(#plus, (1 : n), (0 : numel(v)-n)')), 2);
p = max_product([1 2 2 1 3 1],3)
Update:
some other solutions updated, and some such as #Dev-iL 's answer outperform others, I can suggest fftconv that in Octave outperforms conv
If you can upgrade to R2017a, you can use the new movprod function to compute a windowed product.

Smallest sum of subarray with sum greater than a given value

Input: Array of N positive numbers and a value X such that N is small compared to X
Output: Subarray with sum of all its numbers equal to Y > X, such that there is no other subarray with sum of its numbers bigger than X but smaller than Y.
Is there a polynomial solution to this question? If so, can you present it?
As the other answers indicate this is a NP-Complete problem which is called the "Knapsack Problem". So there is no polynomial solution. But it has a pseudo polynomial time algorithm. This explains what pseudo polynomial is.
A visual explanation of the algorithm.
And some code.
If this is work related (I met this problem a few times already, in various disguises) I suggest introducing additional restrictions to simplify it. If it was a general question you may want to check other NP-Complete problems as well. One such list.
Edit 1:
AliVar made a good point. The given problem searches for Y > X and the knapsack problem searches for Y < X. So the answer for this problem needs a few more steps. When we are trying to find the minimum sum where Y > X we are also looking for the maximum sum where S < (Total - X). The second part is the original knapsack problem. So;
Find the total
Solve knapsack for S < (Total - X)
Subtrack the list of items in knapsack solution from the original input.
This should give you the minimum Y > X
Let A be our array. Here is a O(X*N) algorithm:
initialize set S = {0}
initialize map<int, int> parent
best_sum = inf
best_parent = -1
for a in A
Sn = {}
for s in S
t = s + a
if t > X and t < best_sum
best_sum = t
best_parent = s
end if
if t <= X
Sn.add(t)
parent[t] = s
end if
end for
S = S unite with Sn
end for
To print the elements in the best sum print the numbers:
Subarray = {best_sum - best_parent}
t = best_parent
while t in parent.keys()
Subarray.add(t-parent[t])
t = parent[t]
end while
print Subarray
The idea is similar to the idea of dynamic programming. We just calculate all reachable (those that could be obtained as a subarray sum) sums that are less than X. For each element a in the array A you could either choose to participate in the sum or not. At the update step S = S unite with Sn S represent all sums in which a does not participate while Sn all sum in which a do participate.
You could represent S as a boolean array setting a item true if this item is in the set. Note that the length of this boolean array would be at most X.
Overall, the algorithm is O(X*N) with memory usage O(X).
I think this problem is NP-hard and the subset sum can be reduced to it. Here is my reduction:
For an instance of the subset sum with set S={x1,...,xn} it is desired to find a subset with sum t. Suppose d is the minimum distance between two non-equal xi and xj. Build S'={x1+d/n,...,xn+d/n} and feed it to your problem. Suppose that your problem found an answer; i.e. a subset D' of S' with sum Y>t which is the smallest sum with this property. Name the set of original members of D' as D. Three cases may happen:
1) Y = t + |D|*d/n which means D is the solution to the original subset sum problem.
2) Y > t + |D|*d/n which means no answer set can be found for the original problem.
3) Y < t + |D|*d/n. In this case assign t=Y and repeat the problem. Since the value for the new t is increased, this case will not repeat exponentially. Therefore, the procedure terminates in polynomial time.

Matlab: Sum corresponding values if index is within a range

I have been going crazy trying to figure a way to speed this up. Right now my current code talks ~200 sec looping over 77000 events. I was hoping someone might be able to help me speed this up because I have to do about 500 of these.
Problem:
I have arrays (both 200000x1) that correspond to Energy and Position of a hit over 77000 events. I have the range of each event separated into two arrays, event_start and event_end. First thing I do is look for the position in a specific range, then I put the correspond energy in its own array. To get what I need out of this information, I loop through each event and its corresponding start/end to sum up all the energies from each it hit. My code is below:
indx_pos = find(pos>0.7 & pos<2.0);
energy = HitEnergy(indx_pos);
for i=1:n_events
Etotal(i) = sum(energy(find(indx_pos>=event_start(i) …
& indx_pos<=event_end(i))));
end
Sample input & output:
% Sample input
% pos and energy same length
n_events = 3;
event_start = [1 3 7]';
event_end = [2 6 8]';
pos = [0.75 0.8 2.1 3.6 1.9 0.5 21.0 3.1]';
HitEnergy = [0.002 0.004 0.01 0.0005 0.08 0.1 1.7 0.007]';
% Sample Output
Etotal = 0.0060
0.0800
0
Approach #1: Generic case
One approach with bsxfun and matrix-multiplication -
mask = bsxfun(#ge,indx_pos,event_start.') & bsxfun(#le,indx_pos,event_end.')
Etotal = energy.'*mask
This could be a bit memory-hungry if indx_pos has lots of elements in it.
Approach #2: Non-overlapping start/end ranges case
One can use accumarray for this special case like so -
%// Setup ID array for use in accumarray later on
loc(numel(pos))=0; %// Fast pre-allocation scheme
valids = event_end+1<=numel(pos);
loc(event_end(valids)+1) = -1*(1:sum(valids));
loc(event_start) = loc(event_start)+(1:numel(event_end));
id = cumsum(loc);
%// Set elements as zeros in HitEnergy that do not satisfy the criteria:
%// pos>0.7 & pos<2.0
HitEnergy_select = (pos>0.7 & pos<2.0).*HitEnergy(:);
%// Discard elments in HitEnergy_select & id that have IDs as zeros
HitEnergy_select = HitEnergy_select(id~=0);
id = id(id~=0);
%// Accumulate summations as done inside the loop in the original code
Etotal = accumarray(id(:),HitEnergy_select);
The problem is that for every event you are searching the entire vector indx_pos.
Constrain your search inside the loop to only the range from event_start(i) to event_end(i):
for i = 1:n_events
I = event_start(i):event_end(i);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal(i) = sum(HitEnergy(I(posIIsWithinRange)));
end
You could also use a vectorized version based on run length decoding and vectorizing the notion of colon. (Download the functions coloncatrld and runLengthDecode.)
I = coloncatrld(event_start, event_end);
energy = HitEnergy(I);
eventNum = runLengthDecode(event_end - event_start+1);
posIIsWithinRange = pos(I)>0.7 & pos(I)<2.0;
Etotal = accumarray(eventNum(posIIsWithinRange), energy(posIIsWithinRange), [n_events,1]);
This is similar to Divakar's Approach #2 with the addition that it should work for overlapping ranges too.

Optimize parameters of a pairwise distance function in Matlab

This question is related to matlab: find the index of common values at the same entry from two arrays.
Suppose that I have an 1000 by 10000 matrix that contains value 0,1,and 2. Each row are treated as a sample. I want to calculate the pairwise distance between those samples according to the formula d = 1-1/(2p)sum(a/c+b/d) where a,b,c,d can treated as as the row vector of length 10000 according to some definition and p=10000. c and d are probabilities such that c+d=1.
An example of how to find the values of a,b,c,d: suppose we want to find d between sample i and bj, then I look at row i and j.
If kth entry of row i and j has value 2 and 2, then a=2,b=0,c=1,d=0 (I guess I will assign 0/0=0 in this case).
If kth entry of row i and j has value 2 and 1 or vice versa, then a=1,b=0,c=3/4,d=1/4.
The similar assignment will give to the case for 2,0(a=0,b=0,c=1/2,d=1/2),1,1(a=1,b=1,c=1/2,d=1/2),1,0(a=0,b=1,c=1/4,d=3/4),0,0(a=0,b=2,c=0,d=1).
The matlab code I have so far is using for loops for i and j, then find the cases above by using find, then create two arrays for a/c and b/d. This is extremely slow, is there a way that I can improve the efficiency?
Edit: the distance d is the formula given in this paper on page 13.
Provided those coefficients are fixed, then I think I've successfully vectorised the distance function. Figuring out the formulae was fun. I flipped things around a bit to minimise division, and since I wasn't aware of pdist until #horchler's comment, you get it wrapped in loops with the constants factored out:
% m is the data
[n p] = size(m, 1);
distance = zeros(n);
for ii=1:n
for jj=ii+1:n
a = min(m(ii,:), m(jj,:));
b = 2 - max(m(ii,:), m(jj,:));
c = 4 ./ (m(ii,:) + m(jj,:));
c(c == Inf) = 0;
d = 1 - c;
distance(ii,jj) = sum(a.*c + b.*d);
% distance(jj,ii) = distance(ii,jj); % optional for the full matrix
end
end
distance = 1 - (1 / (2 * p)) * distance;

Resources