Vector search Algorithm - arrays

I have the following problem. Say I have a vector:
v = [1,2,3,4,5,1,2,3,4,...]
I want to sequentially sample points from the vector, that have an absolute maginute difference higher than a threshold from a previously sampled point. So say my threshold is 2.
I start at the index 1, and sample the first point 1. Then my condition is met at v[3], and I sample 3 (since 3-1 >= 2). Then 3, the new sampled point becomes the reference, that I check against. The next sampled point is 5 which is v[5] (5-3 >= 2). Then the next point is 1 which is v[6] (abs(1-5) >= 2).
Unfortunately my code in R, is taking too long. Basically I am scanning the array repeatedly and looking for matches. I think that this approach is naive though. I have a feeling that I can accomplish this task in a single pass through the array. I dont know how though. Any help appreciated. I guess the problem I am running into is that the location of the next sample point can be anywhere in the array, and I need to scan the array from the current point to the end to find it.
Thanks.

I don't see a way this can be done without a loop, so here is one:
my.sample <- function(x, thresh) {
out <- x
i <- 1
for (j in seq_along(x)[-1]) {
if (abs(x[i]-x[j]) >= thresh) {
i <- j
} else {
out[j] <- NA
}
}
out[!is.na(out)]
}
my.sample(x = c(1:5,1:4), thresh = 2)
# [1] 1 3 5 1 3

You can do this without a loop using a bit of recursion:
vsearch = function(v, x, fun=NULL) {
# v: input vector
# x: threshold level
if (!length(v) > 0) return(NULL)
y = v-rep(v[1], times=length(v))
if (!is.null(fun)) y = fun(y)
i = which(y >= x)
if (!length(i) > 0) return(NULL)
i = i[1]
return(c(v[i], vsearch(v[-(1:(i-1))], x, fun=fun)))
}
With your vector above:
> vsearch(c(1,2,3,4,5,1,2,3,4), 2, abs)
[1] 3 5 1 3

Related

Minimize (firstA_max - firstA_min) + (secondB_max - secondB_min)

Given n pairs of integers. Split into two subsets A and B to minimize sum(maximum difference among first values of A, maximum difference among second values of B).
Example : n = 4
{0, 0}; {5;5}; {1; 1}; {3; 4}
A = {{0; 0}; {1; 1}}
B = {{5; 5}; {3; 4}}
(maximum difference among first values of A, maximum difference among second values of B).
(maximum difference among first values of A) = fA_max - fA_min = 1 - 0 = 1
(maximum difference among second values of B) = sB_max - sB_min = 5 - 4 = 1
Therefore, the answer if 1 + 1 = 2. And this is the best way.
Obviously, maximum difference among the values equals to (maximum value - minimum value). Hence, what we need to do is find the minimum of (fA_max - fA_min) + (sB_max - sB_min)
Suppose the given array is arr[], first value if arr[].first and second value is arr[].second.
I think it is quite easy to solve this in quadratic complexity. You just need to sort the array by the first value. Then all the elements in subset A should be picked consecutively in the sorted array. So, you can loop for all ranges [L;R] of the sorted. Each range, try to add all elements in that range into subset A and add all the remains into subset B.
For more detail, this is my C++ code
int calc(pair<int, int> a[], int n){
int m = 1e9, M = -1e9, res = 2e9; //m and M are min and max of all the first values in subset A
for (int l = 1; l <= n; l++){
int g = m, G = M; //g and G are min and max of all the second values in subset B
for(int r = n; r >= l; r--) {
if (r - l + 1 < n){
res = min(res, a[r].first - a[l].first + G - g);
}
g = min(g, a[r].second);
G = max(G, a[r].second);
}
m = min(m, a[l].second);
M = max(M, a[l].second);
}
return res;
}
Now, I want to improve my algorithm down to loglinear complexity. Of course, sort the array by the first value. After that, if I fixed fA_min = a[i].first, then if the index i increase, the fA_max will increase while the (sB_max - sB_min) decrease.
But now I am still stuck here, is there any ways to solve this problem in loglinear complexity?
The following approach is an attempt to escape the n^2, using an argmin list for the second element of the tuples (lets say the y-part). Where the points are sorted regarding x.
One Observation is that there is an optimum solution where A includes index argmin[0] or argmin[n-1] or both.
in get_best_interval_min_max we focus once on including argmin[0] and the next smallest element on y and so one. The we do the same from the max element.
We get two dictionaries {(i,j):(profit, idx)}, telling us how much we gain in y when including points[i:j+1] in A, towards min or max on y. idx is the idx in the argmin array.
calculate the objective for each dict assuming max/min or y is not in A.
combine the results of both dictionaries, : (i1,j1): (v1, idx1) and (i2,j2): (v2, idx2). result : j2 - i1 + max_y - min_y - v1 - v2.
Constraint: idx1 < idx2. Because the indices in the argmin array can not intersect, otherwise some profit in y might be counted twice.
On average the dictionaries (dmin,dmax) are smaller than n, but in the worst case when x and y correlate [(i,i) for i in range(n)] they are exactly n, and we do not win any time. Anyhow on random instances this approach is much faster. Maybe someone can improve upon this.
import numpy as np
from random import randrange
import time
def get_best_interval_min_max(points):# sorted input according to x dim
L = len(points)
argmin_b = np.argsort([p[1] for p in points])
b_min,b_max = points[argmin_b[0]][1], points[argmin_b[L-1]][1]
arg = [argmin_b[0],argmin_b[0]]
res_min = dict()
for i in range(1,L):
res_min[tuple(arg)] = points[argmin_b[i]][1] - points[argmin_b[0]][1],i # the profit in b towards min
if arg[0] > argmin_b[i]: arg[0]=argmin_b[i]
elif arg[1] < argmin_b[i]: arg[1]=argmin_b[i]
arg = [argmin_b[L-1],argmin_b[L-1]]
res_max = dict()
for i in range(L-2,-1,-1):
res_max[tuple(arg)] = points[argmin_b[L-1]][1]-points[argmin_b[i]][1],i # the profit in b towards max
if arg[0]>argmin_b[i]: arg[0]=argmin_b[i]
elif arg[1]<argmin_b[i]: arg[1]=argmin_b[i]
# return the two dicts, difference along y,
return res_min, res_max, b_max-b_min
def argmin_algo(points):
# return the objective value, sets A and B, and the interval for A in points.
points.sort()
# get the profits for different intervals on the sorted array for max and min
dmin, dmax, y_diff = get_best_interval_min_max(points)
key = [None,None]
res_min = 2e9
# the best result when only the min/max b value is includes in A
for d in [dmin,dmax]:
for k,(v,i) in d.items():
res = points[k[1]][0]-points[k[0]][0] + y_diff - v
if res < res_min:
key = k
res_min = res
# combine the results for max and min.
for k1,(v1,i) in dmin.items():
for k2,(v2,j) in dmax.items():
if i > j: break # their argmin_b indices can not intersect!
idx_l, idx_h = min(k1[0], k2[0]), max(k1[1],k2[1]) # get index low and idx hight for combination
res = points[idx_h][0]-points[idx_l][0] -v1 -v2 + y_diff
if res < res_min:
key = (idx_l, idx_h) # new merged interval
res_min = res
return res_min, points[key[0]:key[1]+1], points[:key[0]]+points[key[1]+1:], key
def quadratic_algorithm(points):
points.sort()
m, M, res = 1e9, -1e9, 2e9
idx = (0,0)
for l in range(len(points)):
g, G = m, M
for r in range(len(points)-1,l-1,-1):
if r-l+1 < len(points):
res_n = points[r][0] - points[l][0] + G - g
if res_n < res:
res = res_n
idx = (l,r)
g = min(g, points[r][1])
G = max(G, points[r][1])
m = min(m, points[l][1])
M = max(M, points[l][1])
return res, points[idx[0]:idx[1]+1], points[:idx[0]]+points[idx[1]+1:], idx
# let's try it and compare running times to the quadratic_algorithm
# get some "random" points
c1=0
c2=0
for i in range(100):
points = [(randrange(100), randrange(100)) for i in range(1,200)]
points.sort() # sorted for x dimention
s = time.time()
r1 = argmin_algo(points)
e1 = time.time()
r2 = quadratic_algorithm(points)
e2 = time.time()
c1 += (e1-s)
c2 += (e2-e1)
if not r1[0] == r2[0]:
print(r1,r2)
raise Exception("Error, results are not equal")
print("time of argmin_algo", c1, "time of quadratic_algorithm",c2)
UPDATE: #Luka proved the algorithm described in this answer is not exact. But I will keep it here because it's a good performance heuristics and opens the way to many probabilistic methods.
I will describe a loglinear algorithm. I couldn't find a counter example. But I also couldn't find a proof :/
Let set A be ordered by first element and set B be ordered by second element. They are initially empty. Take floor(n/2) random points of your set of points and put in set A. Put the remaining points in set B. Define this as a partition.
Let's call a partition stable if you can't take an element of set A, put it in B and decrease the objective function and if you can't take an element of set B, put it in A and decrease the objective function. Otherwise, let's call the partition unstable.
For an unstable partition, the only moves that are interesting are the ones that take the first or the last element of A and move to B or take the first or the last element of B and move to A. So, we can find all interesting moves for a given unstable partition in O(1). If an interesting move decreases the objective function, do it. Go like that until the partition becomes stable. I conjecture that it takes at most O(n) moves for the partition to become stable. I also conjecture that at the moment the partition becomes stable, you will have a solution.

Find Minimum Score Possible

Problem statement:
We are given three arrays A1,A2,A3 of lengths n1,n2,n3. Each array contains some (or no) natural numbers (i.e > 0). These numbers denote the program execution times.
The task is to choose the first element from any array and then you can execute that program and remove it from that array.
For example:
if A1=[3,2] (n1=2),
A2=[7] (n2=1),
A3=[1] (n3=1)
then we can execute programs in various orders like [1,7,3,2] or [7,1,3,2] or [3,7,1,2] or [3,1,7,2] or [3,2,1,7] etc.
Now if we take S=[1,3,2,7] as the order of execution the waiting time of various programs would be
for S[0] waiting time = 0, since executed immediately,
for S[1] waiting time = 0+1 = 1, taking previous time into account, similarly,
for S[2] waiting time = 0+1+3 = 4
for S[3] waiting time = 0+1+3+2 = 6
Now the score of array is defined as sum of all wait times = 0 + 1 + 4 + 6 = 11, This is the minimum score we can get from any order of execution.
Our task is to find this minimum score.
How can we solve this problem? I tried with approach trying to pick minimum of three elements each time, but it is not correct because it gets stuck when two or three same elements are encountered.
One more example:
if A1=[23,10,18,43], A2=[7], A3=[13,42] minimum score would be 307.
The simplest way to solve this is with dynamic programming (which runs in cubic time).
For each array A: Suppose you take the first element from array A, i.e. A[0], as the next process. Your total cost is the wait-time contribution of A[0] (i.e., A[0] * (total_remaining_elements - 1)), plus the minimal wait time sum from A[1:] and the rest of the arrays.
Take the minimum cost over each possible first array A, and you'll get the minimum score.
Here's a Python implementation of that idea. It works with any number of arrays, not just three.
def dp_solve(arrays: List[List[int]]) -> int:
"""Given list of arrays representing dependent processing times,
return the smallest sum of wait_time_before_start for all job orders"""
arrays = [x for x in arrays if len(x) > 0] # Remove empty
#functools.lru_cache(100000)
def dp(remaining_elements: Tuple[int],
total_remaining: int) -> int:
"""Returns minimum wait time sum when suffixes of each array
have lengths in 'remaining_elements' """
if total_remaining == 0:
return 0
rem_elements_copy = list(remaining_elements)
best = 10 ** 20
for i, x in enumerate(remaining_elements):
if x == 0:
continue
cost_here = arrays[i][-x] * (total_remaining - 1)
if cost_here >= best:
continue
rem_elements_copy[i] -= 1
best = min(best,
dp(tuple(rem_elements_copy), total_remaining - 1)
+ cost_here)
rem_elements_copy[i] += 1
return best
return dp(tuple(map(len, arrays)), sum(map(len, arrays)))
Better solutions
The naive greedy strategy of 'smallest first element' doesn't work, because it can be worth it to do a longer job to get a much shorter job in the same list done, as the example of
A1 = [100, 1, 2, 3], A2 = [38], A3 = [34],
best solution = [100, 1, 2, 3, 34, 38]
by user3386109 in the comments demonstrates.
A more refined greedy strategy does work. Instead of the smallest first element, consider each possible prefix of the array. We want to pick the array with the smallest prefix, where prefixes are compared by average process time, and perform all the processes in that prefix in order.
A1 = [ 100, 1, 2, 3]
Prefix averages = [(100)/1, (100+1)/2, (100+1+2)/3, (100+1+2+3)/4]
= [ 100.0, 50.5, 34.333, 26.5]
A2=[38]
A3=[34]
Smallest prefix average in any array is 26.5, so pick
the prefix [100, 1, 2, 3] to complete first.
Then [34] is the next prefix, and [38] is the final prefix.
And here's a rough Python implementation of the greedy algorithm. This code computes subarray averages in a completely naive/brute-force way, so the algorithm is still quadratic (but an improvement over the dynamic programming method). Also, it computes 'maximum suffixes' instead of 'minimum prefixes' for ease of coding, but the two strategies are equivalent.
def greedy_solve(arrays: List[List[int]]) -> int:
"""Given list of arrays representing dependent processing times,
return the smallest sum of wait_time_before_start for all job orders"""
def max_suffix_avg(arr: List[int]):
"""Given arr, return value and length of max-average suffix"""
if len(arr) == 0:
return (-math.inf, 0)
best_len = 1
best = -math.inf
curr_sum = 0.0
for i, x in enumerate(reversed(arr), 1):
curr_sum += x
new_avg = curr_sum / i
if new_avg >= best:
best = new_avg
best_len = i
return (best, best_len)
arrays = [x for x in arrays if len(x) > 0] # Remove empty
total_time_sum = sum(sum(x) for x in arrays)
my_averages = [max_suffix_avg(arr) for arr in arrays]
total_cost = 0
while True:
largest_avg_idx = max(range(len(arrays)),
key=lambda y: my_averages[y][0])
_, n_to_remove = my_averages[largest_avg_idx]
if n_to_remove == 0:
break
for _ in range(n_to_remove):
total_time_sum -= arrays[largest_avg_idx].pop()
total_cost += total_time_sum
# Recompute the changed array's avg
my_averages[largest_avg_idx] = max_suffix_avg(arrays[largest_avg_idx])
return total_cost

Array not defined

I'm still confused why am not able to know the results of this small algorithm of my array. the array has almost 1000 number 1-D. am trying to find the peak and the index of each peak. I did found the peaks, but I can't find the index of them. Could you please help me out. I want to plot all my values regardless the indexes.
%clear all
%close all
%clc
%// not generally appreciated
%-----------------------------------
%message1.txt.
%-----------------------------------
% t=linspace(0,tmax,length(x)); %get all numbers
% t1_n=0:0.05:tmax;
x=load('ww.txt');
tmax= length(x) ;
tt= 0:tmax -1;
x4 = x(1:5:end);
t1_n = 1:5:tt;
x1_n_ref=0;
k=0;
for i=1:length(x4)
if x4(i)>170
if x1_n_ref-x4(i)<0
x1_n_ref=x4(i);
alpha=1;
elseif alpha==1 && x1_n_ref-x4(i)>0
k=k+1;
peak(k)=x1_n_ref; // This is my peak value. but I also want to know the index of it. which will represent the time.
%peak_time(k) = t1_n(i); // this is my issue.
alpha=2;
end
else
x1_n_ref=0;
end
end
%----------------------
figure(1)
% plot(t,x,'k','linewidth',2)
hold on
% subplot(2,1,1)
grid
plot( x4,'b'); % ,tt,x,'k'
legend('down-sampling by 5');
Here is you error:
tmax= length(x) ;
tt= 0:tmax -1;
x4 = x(1:5:end);
t1_n = 1:5:tt; % <---
tt is an array containing numbers 0 through tmax-1. Defining t1_n as t1_n = 1:5:tt will not create an array, but an empty matrix. Why? Expression t1_n = 1:5:tt will use only the first value of array tt, hence reduce to t1_n = 1:5:tt = 1:5:0 = <empty matrix>. Naturally, when you later on try to access t1_n as if it were an array (peak_time(k) = t1_n(i)), you'll get an error.
You probably want to exchange t1_n = 1:5:tt with
t1_n = 1:5:tmax;
You need to index the tt array correctly.
you can use
t1_n = tt(1:5:end); % note that this will give a zero based index, rather than a 1 based index, due to t1_n starting at 0. you can use t1_n = 1:tmax if you want 1 based (matlab style)
you can also cut down the code a little, there are some variables that dont seem to be used, or may not be necessary -- including the t1_n variable:
x=load('ww.txt');
tmax= length(x);
x4 = x(1:5:end);
xmin = 170
% now change the code
maxnopeaks = round(tmax/2);
peaks(maxnopeaks)=0; % preallocate the peaks for speed
index(maxnopeaks)=0; % preallocate index for speed
i = 0;
for n = 2 : tmax-1
if x(n) > xmin
if x(n) >= x(n-1) & x(n) >= x(n+1)
i = i+1;
peaks(i) = t(n);
index(i) = n;
end
end
end
% now trim the excess values (if any)
peaks = peaks(1:i);
index = index(1:i);

Gnuplot: Nested “plot” iteration (“plot for”) with dependent loop indices

I have recently attempted to concisely draw several graphs in a plot using gnuplot and the plot for ... syntax. In this case, I needed nested loops because I wanted to pass something like the following index combinations (simplified here) to the plot expression:
i = 0, j = 0
i = 1, j = 0
i = 1, j = 1
i = 2, j = 0
i = 2, j = 1
i = 2, j = 2
and so on.
So i loops from 0 to some upper limit N and for each iteration of i, j loops from 0 to i (so i <= j). I tried doing this with the following:
# f(i, j, x) = ...
N = 5
plot for [i=0:N] for [j=0:i] f(i, j, x) title sprintf('j = %d', j)
but this only gives five iterations with j = 0 every time (as shown by the title). So it seems that gnuplot only evaluates the for expressions once, fixing i = 0 at the beginning and not re-evaluating to keep up with changing i values. Something like this has already been hinted at in this answer (“in the plot for ... structure the second index cannot depend on the first one.”).
Is there a simple way to do what I want in gnuplot (i.e. use the combinations of indices given above with some kind of loop)? There is the do for { ... } structure since gnuplot 4.6, but that requires individual statements in its body, so it can’t be used to assemble a single plot statement. I suppose one could use multiplot to get around this, but I’d like to avoid multiplot if possible because it makes things more complicated than seems necessary.
I took your problem personally. For your specific problem you can use a mathematical trick. Remap your indices (i,j) to a single index k, such that
(0,0) -> (0)
(1,0) -> (1)
(1,1) -> (2)
(2,0) -> (3)
...
It can be shown that the relation between i and j and k is
k = i*(i+1)/2 + j
which can be inverted with a bit of algebra
i(k)=floor((sqrt(1+8.*k)-1.)/2.)
j(k)=k-i(k)*(i(k)+1)/2
Now, you can use a single index k in your loop
N = 5
kmax = N*(N+1)/2 + N
plot for [k=0:kmax] f(i(k), j(k), x) title sprintf('j = %d', j(k))

Finding blocks in arrays

I was looking over some interview questions, and I stumbled onto this one:
There's an m x n array. A block in the array is denoted by a 1 and a 0 indicates no block. You are supposed to find the number of objects in the array. A object is nothing but a set of blocks that are connected horizontally and/or vertically.
eg
0 1 0 0
0 1 0 0
0 1 1 0
0 0 0 0
0 1 1 0
Answer: There are 2 objects in this array. The L shape object and the object in the last row.
I'm having trouble coming up with an algorithm that would catch a 'u' (as below) shape. How should i approach this?
0 1 0 1
0 1 0 1
0 1 1 1
0 0 0 0
0 1 1 0
One approach would use Flood Fill. The algorithm would be something like this:
for row in block_array:
for block in row:
if BLOCK IS A ONE and BLOCK NOT VISITED:
FLOOD_FILL starting from BLOCK
You'd mark items visited during the flood fill process, and track shapes from there as well.
This works in C#
static void Main()
{
int[][] array = { new int[] { 0, 1, 0, 1 }, new int[] { 0, 1, 0, 1 }, new int[] { 0, 1, 1, 1 }, new int[] { 0, 0, 0, 0 }, new int[] { 0, 1, 1, 0 } };
Console.WriteLine(GetNumber(array));
Console.ReadKey();
}
static int GetNumber(int[][] array)
{
int objects = 0;
for (int i = 0; i < array.Length; i++)
for (int j = 0; j < array[i].Length; j++)
if (ClearObjects(array, i, j))
objects++;
return objects;
}
static bool ClearObjects(int[][] array, int x, int y)
{
if (x < 0 || y < 0 || x >= array.Length || y >= array[x].Length) return false;
if (array[x][y] == 1)
{
array[x][y] = 0;
ClearObjects(array, x - 1, y);
ClearObjects(array, x + 1, y);
ClearObjects(array, x, y - 1);
ClearObjects(array, x, y + 1);
return true;
}
return false;
}
I would use Disjoint sets (connected components).
At the begining, each (i,j) matrix element with value 1 is one element set itself.
Then you can iterate over each matrix element and for each element (i,j) you should join each adjacent position set {(i+1,j),(i-1,j),(i,j+1),(i,j-1)} to (i,j) set if its value is 1.
You can find an implementation of disjoint sets at Disjoint Sets in Python
At the end, the number of diffrent sets is the number of objects.
I would use a disjoint-set datastructure (otherwise known as union-find).
Briefly: for each connected component, build an "inverse tree" using a single link per element as a "parent" pointer. Following the parent pointers will eventually find the root of the tree, which is used for component identification (as it is the same for every member of the connected component). To merge neighboring components, make the root of one component the parent of the other (which will no longer be a root, as it now has a parent).
Two simple optimizations make this datastructure very efficient. One is, make all root queries "collapse" their paths to point directly to the root -- that way, the next query will only need one step. The other is, always use the "deeper" of the two trees as the new root; this requires a maintaining a "rank" score for each root.
In addition, in order to make evaluating neighbors more efficient, you might consider preprocessing your input on a row-by-row basis. That way, a contiguous segment of 1's on the same row can start life as a single connected component, and you can efficiently scan the segments of the previous row based on your neighbor criterion.
My two cents (slash) algorithm:
1. List only the 1's.
2. Group (collect connected ones).
In Haskell:
import Data.List (elemIndices, delete)
example1 =
[[0,1,0,0]
,[0,1,0,0]
,[0,1,1,0]
,[0,0,0,0]
,[0,1,1,0]]
example2 =
[[0,1,0,1]
,[0,1,0,1]
,[0,1,1,1]
,[0,0,0,0]
,[0,1,1,0]]
objects a ws = solve (mapIndexes a) [] where
mapIndexes s =
concatMap (\(y,xs)-> map (\x->(y,x)) xs) $ zip [0..] (map (elemIndices s) ws)
areConnected (y,x) (y',x') =
(y == y' && abs (x-x') == 1) || (x == x' && abs (y-y') == 1)
solve [] r = r
solve (x:xs) r =
let r' = solve' xs [x]
in solve (foldr delete xs r') (r':r)
solve' vs r =
let ys = filter (\y -> any (areConnected y) r) vs
in if null ys then r else solve' (foldr delete vs ys) (ys ++ r)
Output:
*Main> objects 1 example1
[[(4,2),(4,1)],[(2,2),(2,1),(1,1),(0,1)]]
(0.01 secs, 1085360 bytes)
*Main> objects 1 example2
[[(4,2),(4,1)],[(0,3),(1,3),(2,3),(2,2),(2,1),(1,1),(0,1)]]
(0.01 secs, 1613356 bytes)
Why not just look at all the adjacent cells of a given block? Start at some cell that has a 1 in it, keep track of the cells you have visited before, and keep looking through adjacent cells until you cannot find a 1 anymore. Then move onto cells that you have not looked yet and repeat the process.
Something like this should work:
while array has a 1 that's not marked:
Create a new object
Create a Queue
Add the 1 to the queue
While the queue is not empty:
get the 1 on top of the queue
Mark it
Add it to current object
look for its 4 neighbors
if any of them is a 1 and not marked yet, add it to queue

Resources