Algorithm balanced K-D tree with O(kn log n) - arrays

I tried to implement a balanced K-D tree with O(kn log n), I used presorted K arrays (sorted arrays for each index) to get O(kn log n), and median to get balanced tree.
The problem I faced was that mostly the median value at some level ,for example the median for x axis, maybe chosen again at another subsequent level, for example for y axis.
I tried to solve this by dividing y sorted array to two arrays by using chosen x value as a pivot, but this way wouldn't yield a balanced tree.
Any idea how to get K-D balanced tree with O(kn log n)?
EDIT
Quoted from wiki
https://en.wikipedia.org/wiki/K-d_tree
Alternative algorithms for building a balanced k-d tree presort the
data prior to building the tree. They then maintain the order of the
presort during tree construction and hence eliminate the costly step
of finding the median at each level of subdivision. Two such
algorithms build a balanced k-d tree to sort triangles in order to
improve the execution time of ray tracing for three-dimensional
computer graphics. These algorithms presort n triangles prior to
building the k-d tree, then build the tree in O(n log n) time in the
best case.[5][6] An algorithm that builds a balanced k-d tree to sort
points has a worst-case complexity of O(kn log n).[7] This algorithm
presorts n points in each of k dimensions using an O(n log n) sort
such as Heapsort or Mergesort prior to building the tree. It then
maintains the order of these k presorts during tree construction and
thereby avoids finding the median at each level of subdivision.
Anyone could provide such algorithm provided above?
EDIT
The came up with a way but it doesn't work if there is any duplicate value of specific axis for the median.
For example
x1 = [ (0, 7), (1, 3), (3, 0), (3, 1), (6, 2) ] y1 = [ (3, 0), (3, 1), (6, 2), (1, 3), (0, 7) ]
The median of x-axis is 3.
So when we want to split the array y11 and y12 we have to use > and < to distribute y array left and right considering pivot as delimiter.
there is no guarantee one of them is correct if the median a on specific axis is duplicated
Consider the partition on x axis, and there is no problem on x1 array following completion of above example of first step partition:
median=(3,0)
The pivot = 3 // is it's the median of x axis
y11[],y12[]
for(i = 0 ; i < x1.size;i++)
if(y1[i].getX()<pivot)
y11.add(y1[i])
else
if(y1[i].getX()>pivot)
y12.add(y1[i])
This will result y11 = [(2 ,1) , (1, 3), (0, 7) ] y12 = [ (6,2) ]
Any idea how to handle such case?
Or is there any another presorting kd-tree presorting algorithm O(kn log n) ?

To elaborate on my comment (and Anony-Mousse's answer, probably):
The key idea with pre-sorting in constructing KD-trees is to keep the order during split. The overhead looks quite high, a comparative benchmark with re-sorting (and k-select) seems in order.
Some proof-of principle Java source code:
package net.*.coder.greybeard.sandbox;
import java.util.Arrays;
import java.util.Comparator;
import java.util.LinkedList;
/** finger exercise pre-sorting & split for KD-tree construction
* (re. https://stackoverflow.com/q/35225509/3789665) */
public class KDPreSort {
/** K-dimensional key, dimensions fixed
* by number of coordinates in construction */
static class KKey {
public static KKey[] NONE = {};
final Comparable[]coordinates;
public KKey(Comparable ...coordinates) {
this.coordinates = coordinates;
}
/** #return {#code Comparator<KKey>} for coordinate {#code n}*/
static Comparator<KKey> comparator(int n) { // could be cached
return new Comparator<KDPreSort.KKey>() { #Override
public int compare(KKey l, KKey r) {
return l.coordinates[n]
.compareTo(r.coordinates[n]);
}
};
}
#Override
public String toString() {
StringBuilder sb = new StringBuilder(
Arrays.deepToString(coordinates));
sb.setCharAt(0, '(');
sb.setCharAt(sb.length()-1, ')');
return sb.toString();
}
}
// static boolean trimLists = true; // introduced when ArrayList was used in interface
/** #return two arrays of {#code KKey}s: comparing smaller than
* or equal to {#code pivot} (according to {#code comp)},
* and greater than pivot -
* in the same order as in {#code keys}. */
static KKey[][] split(KKey[] keys, KKey pivot, Comparator<KKey> comp) {
int length = keys.length;
ArrayList<KKey>
se = new ArrayList<>(length),
g = new ArrayList<>(length);
for (KKey k: keys) {
// pick List to add to
List<KKey>d = comp.compare(k, pivot) <= 0 ? se : g;
d.add(k);
}
// if (trimLists) { se.trimToSize(); g.trimToSize(); }
return new KKey[][] { se.toArray(KKey.NONE), g.toArray(KKey.NONE) };
}
/** #return two arrays of <em>k</em> arrays of {#code KKey}s:
* comparing smaller than or equal to {#code pivot}
* (according to {#code comp)}, and greater than pivot,
* in the same order as in {#code keysByCoordinate}. */
static KKey[][][]
splits(KKey[][] keysByCoordinate, KKey pivot, Comparator<KKey> comp) {
final int length = keysByCoordinate.length;
KKey[][]
se = new KKey[length][],
g = new KKey[length][],
splits;
for (int i = 0 ; i < length ; i++) {
splits = split(keysByCoordinate[i], pivot, comp);
se[i] = splits[0];
g[i] = splits[1];
}
return new KKey[][][] { se, g };
}
// demo
public static void main(String[] args) {
// from https://stackoverflow.com/q/17021379/3789665
Integer [][]coPairs = {// {0, 7}, {1, 3}, {3, 0}, {3, 1}, {6, 2},
{12, 21}, {13, 27}, {19, 5}, {39, 5}, {49, 63}, {43, 45}, {41, 22}, {27, 7}, {20, 12}, {32, 11}, {24, 56},
};
KKey[] someKeys = new KKey[coPairs.length];
for (int i = 0; i < coPairs.length; i++) {
someKeys[i] = new KKey(coPairs[i]);
}
//presort
Arrays.sort(someKeys, KKey.comparator(0));
List<KKey> x = new ArrayList<>(Arrays.asList(someKeys));
System.out.println("by x: " + x);
KKey pivot = someKeys[someKeys.length/2];
Arrays.sort(someKeys, KKey.comparator(1));
System.out.println("by y: " + Arrays.deepToString(someKeys));
// split by x
KKey[][] allOrdered = new KKey[][] { x.toArray(KKey.NONE), someKeys },
xSplits[] = splits(allOrdered, pivot, KKey.comparator(0));
for (KKey[][] c: xSplits)
System.out.println("split by x of " + pivot + ": "
+ Arrays.deepToString(c));
// split "higher x" by y
pivot = xSplits[1][1][xSplits[1][1].length/2];
KKey[][] ySplits[] = splits(xSplits[1], pivot, KKey.comparator(1));
for (KKey[][] c: ySplits)
System.out.println("split by y of " + pivot + ": "
+ Arrays.deepToString(c));
}
}
(Didn't find a suitable answer/implementation on SE while not investing too much energy. The output was non-convincing with your example, with the longer one, I had to re-format it to believe it.
The code looks ugly, in all likelihood because it is: if so inclined re-read about the licence of code posted on SE, an visit Code Review.)
(Consider that there's voting, accepting, and awarding bounties, and re-visit Anony-Mousse's answer.)

When splitting the data, you need to retain the sort order.
E.g. using data (x,y) we build
x1 = [ (0, 7), (1, 3), (3, 0), (4, 2), (6, 1) ]
y1 = [ (3, 0), (6, 1), (3, 2), (1, 3), (0, 7) ]
If we split at x now, we need to filter both sets by the record at x=3,y=0.
I.e. split both lists, removing (3,0), all items with x<3 go to the first list each, all with x>3 go to the second (order unchanged):
x1 -> filter to x11 = [ (0, 7), (1, 3) ] x12 = [ (4, 2), (6, 1) ]
y1 -> filter to y11 = [ (1, 3), (0, 7) ] y12 = [ (6, 1), (4, 2) ]
The point is to filter each sorted list by the x values, while keeping the sorting order (so this is in O(n*k) in each of O(log n) levels). If you use only x1, and reconstruct y11 and y12 from x1, then you would need to sort again. By necessity, it is the same as if you sort once by x, once by y. Except that we did not sort again, only select.
I do not think this is much better in practise. Sorting is cheaper than the extra memory.

Related

minimum operations to make array left part equal to right part

Given an even length array, [a1, a2,....,an], a beautiful array is an array where a[i] == a[i + n / 2] for 0<= i < n / 2. define an operation as change all array elements equal to value x to value y. what's the minimum operations required to make a given array beautiful? all elements are in range [1, 100000]. If simply return unmatch array pairs (ignore order) in left and right part of array, it will return wrong results in some cases such as [1, 1, 2, 5, 2, 5, 5, 2], unmatched pairs are (1, 2), (1, 5), (2, 5), but when change 2 -> 5, than (1, 2) and (1, 5) become the same. so what's the correct method to solve this problem?
It is a graph question.
For every pair(a[i], a[i+n/2]) where a[i]!=a[i+n/2], add an undirected edge between the two nodes.
Note that you shouldn't add multiple edges between 2 numbers.
Now you essentially need to remove all the edges in the graph by performing some operations. The final answer is the number of operations.
In each operation, you remove an edge. After removing an edge between two vertices, combine the vertices and rearrange their edges.

how to random generate a sequence of list that are unobserved before in python 3

Assume I have the code in Python 3
X, Y, Z = 10, 20, 30
data = [[1,3,6],[8,15,29],[8,9,19]] # observe data
Then how can I random generate n (not very large) data elements that are not in the data.
Condition: the element [a,b,c] must be not in data and 0<a<X, 0<b<Y, 0<c<Z
[1,3,5] is good since it is not in data and its element satisfy the Condition
[11,3,6] is bad since it does not satisfy the Condition, 11>10
For example, when n=4, I want a list of element that are not duplicate
newdata = [[1,6,6],[8,17,25],[2,6,11], [4,6,12]]
This should do it:
from random import randint
X, Y, Z = 10, 20, 30
data = [[1,3,6],[8,15,29],[8,9,19]]
n = 4
newdata = set()
for i in range(n):
while True:
l = [randint(1, X), randint(1, Y), randint(1, Z)]
if l not in data:
newdata.add(tuple(l))
break
print(newdata)
Example result:
newdata = [(9, 9, 11), (10, 10, 4), (7, 6, 23), (2, 10, 4)]
It took a small effort, but this seems to work:
from random import *
from pprint import pprint
X, Y, Z = 10, 20, 30
data = [[1,3,6],[8,15,29],[8,9,19]]
while 1:
newData = []
try: n = int(input("How many lists do you want: "))
except:
print("Please enter an integer.\n")
continue
for i in range(n):
newList = [randrange(1, X), randrange(1, Y), randrange(1, Z)]
while (newList in data) or (newList in newData):
newList = [randrange(1, X), randrange(1, Y), randrange(1, Z)]
newData.append(newList)
pprint(newData)
This works by creating an empty list, getting a value for n, then entering a loop of exactly n iterations. It then creates a new list that satisfies the requirements. If the new list is in the observed data list, it just does it again and again until it isn't in the data. Then it adds this data to the output list and repeats the process until the for loop breaks (after n iterations).
There may be a better way of doing it, but this does the trick.
In case X, Y, Z are not too large you can just create all possible combinations and then sample from this pool:
import itertools as it
import random
x, y, z = 10, 20, 30
pool = it.product(range(x), range(y), range(z))
data = [(1, 3, 6), (8, 15, 29), (8, 9, 19)]
pool = set(pool) - set(data)
n = 4
newdata = random.sample(pool, n)
For higher performance you can use Numpy and the fact that the tuples can be converted to integer and back by simply enumerating them (as z, y, x gets enumerated):
import numpy as np
x, y, z = 100, 200, 300
n = 1000
data = [[1,3,6],[8,15,29],[8,9,19]]
forbidden = [i[0]*y*z + i[1]*z + i[2] for i in data]
pool = np.arange(x*y*z)
mask = np.ones(pool.size, dtype=bool)
mask[forbidden] = False
pool = pool[mask]
newdata = np.random.choice(pool, n, replace=False)
newdata = [(i // (y*z), i // z, i % z) for i in newdata]

Binning then sorting arrays in each bin but keeping their indices together

I have two arrays and the indices of these arrays are related. So x[0] is related to y[0], so they need to stay organized. I have binned the x array into two bins as shown in the code below.
x = [1,4,7,0,5]
y = [.1,.7,.6,.8,.3]
binx = [0,4,9]
index = np.digitize(x,binx)
Giving me the following:
In [1]: index
Out[1]: array([1, 2, 2, 1, 2])
So far so good. (I think)
The y array is a parameter telling me how well measured the x data point is, so .9 is better than .2, so I'm using the next code to sort out the best of the y array:
y.sort()
ysorted = y[int(len(y) * .5):]
which gives me:
In [2]: ysorted
Out[2]: [0.6, 0.7, 0.8]
giving me the last 50% of the array. Again, this is what I want.
My question is how do I combine these two operations? From each bin, I need to get the best 50% and put these new values into a new x and new y array. Again, keeping the indices of each array organized. Or is there an easier way to do this? I hope this makes sense.
Many numpy functions have arg... variants that don't operate "by value" but rather "by index". In your case argsort does what you want:
order = np.argsort(y)
# order is an array of indices such that
# y[order] is sorted
top50 = order[len(order) // 2 :]
top50x = x[top50]
# now top50x are the x corresponding 1-to-1 to the 50% best y
You should make a list of pairs from your x and y lists
It can be achieved with the zip function:
x = [1,4,7,0,5]
y = [.1,.7,.6,.8,.3]
values = zip(x, y)
values
[(1, 0.1), (4, 0.7), (7, 0.6), (0, 0.8), (5, 0.3)]
To sort such a list of pairs by a specific element of each pair you may use the sort's key parameter:
values.sort(key=lambda pair: pair[1])
[(1, 0.1), (5, 0.3), (7, 0.6), (4, 0.7), (0, 0.8)]
Then you may do whatever you want with this sorted list of pairs.

Inplace changing position of an element in array by shifting others forward - NumPy

After searching I find no native way or current solution to change efficiently the position of an element in a numpy array, which seems to me quite natural operation. For example if I want to move the 3th element in the 1st position it should be like this:
x = np.array([1,2,3,4,5])
f*(x, 3, 1)
print x
array([1,4,2,3,5])
Im looking for a f* function here. This is different of rolling every elements, also for moves in big array I want to avoid copying operation that could be used by using insert and delete operation
Not sure about the efficiency, but here's an approach using masking -
def change_pos(in_arr, pick_idx, put_idx ):
range_arr = np.arange(in_arr.size)
tmp = in_arr[pick_idx]
in_arr[range_arr != put_idx ] = in_arr[range_arr != pick_idx]
in_arr[put_idx] = tmp
This would support both forward and backward movement.
Sample runs
1) Element moving backward -
In [542]: in_arr
Out[542]: array([4, 9, 3, 6, 8, 0, 2, 1])
*
In [543]: change_pos(in_arr,6,1)
In [544]: in_arr
Out[544]: array([4, 2, 9, 3, 6, 8, 0, 1])
^
2) Element moving forward -
In [546]: in_arr
Out[546]: array([4, 9, 3, 6, 8, 0, 2, 1])
*
In [547]: change_pos(in_arr,1,6)
In [548]: in_arr
Out[548]: array([4, 3, 6, 8, 0, 2, 9, 1])
^
With the small example, this wholesale copy tests faster than #Divakar's masked in-place copy:
def foo4(arr, i,j):
L=arr.shape[0]
idx=np.concatenate((np.arange(j),[i],np.arange(j,i),np.arange(i+1,L)))
return arr[idx]
I didn't try to make it work for forward moves. An analogous inplace function runs at about the same speed as Divakar's.
def foo2(arr, i,j):
L=arr.shape[0]
tgt=np.arange(j,i+1)
src=np.concatenate([[i],np.arange(j,i)])
arr[tgt]=arr[src]
But timings could well be different if the array was much bigger and the swap involved a small block in the middle.
Since the data for an array is stored in a contiguous block of memory, elements cannot change place without some sort of copy. You'd have implement lists as a linked list to have a no-copy form of movement.
It just occurred to me that there are some masked copyto and place functions, that might make this sort of copy/movement faster. But I haven't worked with those much.
https://stackoverflow.com/a/40228699/901925
================
np.roll does
idx = np.concatenate((np.arange(2,5),np.arange(2)))
# array([2, 3, 4, 0, 1])
np.take(a, idx) # or a[idx]
In the past I have found the simple numpy indexing i.e. a[:-1]=a[1:] to be faster than most alternatives (including np.roll()). Comparing the two other answers with an 'in place' shift I get:
for shift from 40000 to 100
1.015ms divakar
1.078ms hpaulj
29.7micro s in place shift (34 x faster)
for shift from 40000 to 39900
0.975ms divakar
0.985ms hpaulj
3.47micro s in place shift (290 x faster)
timing comparison using:
import timeit
init = '''
import numpy as np
def divakar(in_arr, pick_idx, put_idx ):
range_arr = np.arange(in_arr.size)
tmp = in_arr[pick_idx]
in_arr[range_arr != put_idx ] = in_arr[range_arr != pick_idx]
in_arr[put_idx] = tmp
def hpaulj(arr, fr, to):
L = arr.shape[0]
idx = np.concatenate((np.arange(to), [fr], np.arange(to, fr), np.arange(fr+1, L)))
return arr[idx]
def paddyg(arr, fr, to):
if fr >= arr.size or to >= arr.size:
return None
tmp = arr[fr].copy()
if fr > to:
arr[to+1:fr+1] = arr[to:fr]
else:
arr[fr:to] = arr[fr+1:to+1]
arr[to] = tmp
return arr
a = np.random.randint(0, 1000, (100000))
'''
fns = ['''
divakar(a, 40000, 100)
''', '''
hpaulj(a, 40000, 100)
''', '''
paddyg(a, 40000, 100)
''']
for f in fns:
print(timeit.timeit(f, setup=init, number=1000))

indexing rows in matrix using matlab

Suppose I have an empty m-by-n-by-p dimensional cell called "cellPoints", and I also have a D-by-3 dimensional array called "cellIdx" where each row i contains the subscripts in "cellPoints". Now I want to compute "cellPoints" so that cellPoints{x, y, z} contains an array of row numbers in "cellIdx".
A naive implementation could be
for i = 1:size(cellIdx, 1)
cellPoints{cellIdx(i, 1), cellIdx(i, 2), cellIdx(i, 3)} = ...
[cellPoints{cellIdx(i, 1), cellIdx(i, 2), cellIdx(i, 3)};i];
end
As an example, suppose
cellPoints = cell(10, 10, 10);% user defined, cannot change
cellIdx = [1, 3, 2;
3, 2, 1;
1, 3, 2;
1, 4, 2]
Then
cellPoints{1, 3, 2} = [1;3];
cellPoints{3, 2, 1} = [2];
cellPoints{1, 4, 2} = [4];
and other indices of cellPoints should be empty
Since cellIdx is a large matrix and this is clearly inefficient, are there any other better implementations?
I've tried using unique(cellIdx, 'rows') to find unique rows in cellIdx, and then writing a for-loop to compute cellPoints, but it's even slower than above.
See if this is faster:
cellPoints = cell(10,10,10); %// initiallize to proper size
[~, jj, kk] = unique(cellIdx, 'rows', 'stable')
sz = size(cellPoints);
sz = [1 sz(1:end-1)];
csz = cumprod(sz).'; %'// will be used to build linear index
ind = 1+(cellIdx(jj,:)-1)*csz; %// linear index to fill cellPoints
cellPoints(ind) = accumarray(kk, 1:numel(kk), [], #(x) {sort(x)});
Or remove sort from the last line if order within each cell is not important.

Resources