Efficient data sifting for unique values (Python)

Efficient data sifting for unique values (Python) - arrays

I have a 2D Numpy array that consists of (X,Y,Z,A) values, where (X,Y,Z) are Cartesian coordinates in 3D space, and A is some value at that location. As an example..
__X__|__Y__|__Z__|__A_
13 | 7 | 21 | 1.5
9 | 2 | 7 | 0.5
15 | 3 | 9 | 1.1
13 | 7 | 21 | 0.9
13 | 7 | 21 | 1.7
15 | 3 | 9 | 1.1
Is there an efficient way to find all the unique combinations of (X,Y), and add their values? For example, the total for (13,7) would be (1.5+0.9+1.7), or 4.1.

scipy.sparse matrix takes this kind of information, but for just 2d
sparse.coo_matrix((data, (row, col)))
where row and col are indices like your X,Y and Z. It sums duplicates.
The first step to doing that is a lexical sort of the indices. That puts points with matching coordinates next to each other.
The actually grouping and summing is done, I believe, in compiled code. Part of the difficulty in doing that fast in numpy terms is that there will be a variable number of elements in each group. Some will be unique, others might have 3 or more.
Python itertools has a groupby tool. Pandas also has grouping functions. I can also imagine using a default_dict to group and sum values.
The ufunc reduceat might also work, though it's easier to use in 1d than 2 or 3.
If you are ignoring the Z, the sparse coo_matrix approach may be easiest.
In [2]: X=np.array([13,9,15,13,13,15])
In [3]: Y=np.array([7,2,3,7,7,3])
In [4]: A=np.array([1.5,0.5,1.1,0.9,1.7,1.1])
In [5]: M=sparse.coo_matrix((A,(X,Y)))
In [15]: M.sum_duplicates()
In [16]: M.data
Out[16]: array([ 0.5, 2.2, 4.1])
In [17]: M.row
Out[17]: array([ 9, 15, 13])
In [18]: M.col
Out[18]: array([2, 3, 7])
In [19]: M
Out[19]:
<16x8 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
Here's what I had in mind with lexsort
In [32]: Z=np.array([21,7,9,21,21,9])
In [33]: xyz=np.stack([X,Y,Z],1)
In [34]: idx=np.lexsort([X,Y,Z])
In [35]: idx
Out[35]: array([1, 2, 5, 0, 3, 4], dtype=int32)
In [36]: xyz[idx,:]
Out[36]:
array([[ 9, 2, 7],
[15, 3, 9],
[15, 3, 9],
[13, 7, 21],
[13, 7, 21],
[13, 7, 21]])
In [37]: A[idx]
Out[37]: array([ 0.5, 1.1, 1.1, 1.5, 0.9, 1.7])
When sorted like this it becomes more evident that the Z coordinate is 'redundant', at least for this purpose.
Using reduceat to sum groups:
In [40]: np.add.reduceat(A[idx],[0,1,3])
Out[40]: array([ 0.5, 2.2, 4.1])
(for now I just eyeballed the [0,1,3] list)

Approach #1
Get each row as a view, thus converting each into a scalar each and then use np.unique to tag each row as a minimum scalar starting from (0......n), withnas no. of unique scalars based on the uniqueness among others and finally usenp.bincount` to perform the summing of the last column based on the unique scalars obtained earlier.
Here's the implementation -
def get_row_view(a):
void_dt = np.dtype((np.void, a.dtype.itemsize * np.prod(a.shape[1:])))
a = np.ascontiguousarray(a)
return a.reshape(a.shape[0], -1).view(void_dt).ravel()
def groupby_cols_view(x):
a = x[:,:2].astype(int)
a1D = get_row_view(a)
_, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]
Approach #2
Same as approach #1, but instead of working with the view, we will generate equivalent linear index equivalent for each row and thus reducing each row to a scalar. Rest of the workflow is same as with the first approach.
The implementation -
def groupby_cols_linearindex(x):
a = x[:,:2].astype(int)
a1D = a[:,0] + a[:,1]*(a[:,0].max() - a[:,1].min() + 1)
_, indx, IDs = np.unique(a1D, return_index=1, return_inverse=1)
return np.c_[x[indx,:2],np.bincount(IDs, x[:,-1])]
Sample runs
In [80]: data
Out[80]:
array([[ 2. , 5. , 1. , 0.40756048],
[ 3. , 4. , 6. , 0.78945661],
[ 1. , 3. , 0. , 0.03943097],
[ 2. , 5. , 7. , 0.43663582],
[ 4. , 5. , 0. , 0.14919507],
[ 1. , 3. , 3. , 0.03680583],
[ 1. , 4. , 8. , 0.36504428],
[ 3. , 4. , 2. , 0.8598825 ]])
In [81]: groupby_cols_view(data)
Out[81]:
array([[ 1. , 3. , 0.0762368 ],
[ 1. , 4. , 0.36504428],
[ 2. , 5. , 0.8441963 ],
[ 3. , 4. , 1.64933911],
[ 4. , 5. , 0.14919507]])
In [82]: groupby_cols_linearindex(data)
Out[82]:
array([[ 1. , 3. , 0.0762368 ],
[ 1. , 4. , 0.36504428],
[ 3. , 4. , 1.64933911],
[ 2. , 5. , 0.8441963 ],
[ 4. , 5. , 0.14919507]])

Related

How to perform series of operations on a region of a NumPy array?

I have an array of the following structure which I'll refer to as x:
1 2 3 4
2 3 4 5
3 4 5 6
What I wish to do is perform a series of equations on this array but only on a specific section of the array every single time and keep the array structured the same.
I am aware of using np.where to locate values based on a condition such as:
loc = np.where(x >4)
Now performing the above returns:
(array([1, 2, 2], dtype=int64), array([3, 2, 3], dtype=int64))
But using x[loc] returns the raw values which whilst expected is not what I'm looking for as it will only returns those values not the whole array.
So my desired output is to have the initial array x same as above:
1 2 3 4
2 3 4 5
3 4 5 6
From that perform a series of operations for values which satisfy a given condition and also keeps the array intact so for a given equation like so:
5*x+1
Will only be performed on values greater than 5 (x>5) and keep the array in the same structure so it will result in.
1 2 3 4
2 3 4 26
3 4 26 31
How do I go about doing this?

Here's one way using masking shown as a sample case -
1) Setup sample input :
In [373]: np.random.seed(0)
In [374]: a = np.random.randint(0,9,(3,4))
In [375]: a
Out[375]:
array([[5, 0, 3, 3],
[7, 3, 5, 2],
[4, 7, 6, 8]])
2) Get the mask :
In [376]: mask = a>4
In [377]: mask
Out[377]:
array([[ True, False, False, False],
[ True, False, True, False],
[False, True, True, True]], dtype=bool)
3) Get masked values :
In [378]: a_masked = a[mask]
4) Update the masked places with computations on the masked values :
In [380]: a[mask] = 5*a_masked + 1
In [381]: a
Out[381]:
array([[26, 0, 3, 3],
[36, 3, 26, 2],
[ 4, 36, 31, 41]])
5) For more operations, get the masked values again and repeat -
In [382]: a_masked = a[mask]
In [383]: a[mask] = a_masked + 100
In [384]: a
Out[384]:
array([[126, 0, 3, 3],
[136, 3, 126, 2],
[ 4, 136, 131, 141]])
Alernative to 4 and 5 : If you don't want to update a after each operation, we can update the array of masked values and write-back to the input array at the very end. Thus, steps 4 and 5 would be replaced, as shown below -
In [386]: a # Input array
Out[386]:
array([[5, 0, 3, 3],
[7, 3, 5, 2],
[4, 7, 6, 8]])
In [387]: a_masked = a[mask]
In [388]: a_masked = 5*a_masked + 1 # operation #1
In [389]: a_masked = a_masked + 100 # operation #2
In [390]: a[mask] = a_masked # write back to input array
In [391]: a
Out[391]:
array([[126, 0, 3, 3],
[136, 3, 126, 2],
[ 4, 136, 131, 141]])

Load text from csv file with int and float columns into ndarray

I have csv file as input :
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
It has mix of int and float.
when i tried to import file using "numpy.loadtext" what i got is 2d array with every column as float.
r = np.loadtxt(open("text.csv", "rb"), delimiter=",", skiprows=0)
and i received output like :
array([[ 6. , 148. , 72. , ..., 0.627, 50. , 1. ],
[ 1. , 85. , 66. , ..., 0.351, 31. , 0. ],
[ 8. , 183. , 64. , ..., 0.672, 32. , 1. ],
...,
[ 5. , 121. , 72. , ..., 0.245, 30. , 0. ],
[ 1. , 126. , 60. , ..., 0.349, 47. , 1. ],
[ 1. , 93. , 70. , ..., 0.315, 23. , 0. ]])
which is perfect have 2d array with each row in list instead of tuple.
but when looking into datatypes every column treated as float which is not correct.
What i am asking is there any way i can do output like :
Desired output
array([[ 6 , 148 , 72 , ..., 0.627, 50 , 1 ],
[ 1 , 85 , 66 , ..., 0.351, 31 , 0 ],
[ 8 , 183 , 64 , ..., 0.672, 32 , 1 ],
...,
[ 5 , 121 , 72 , ..., 0.245, 30 , 0 ],
[ 1 , 126 , 60 , ..., 0.349, 47 , 1 ],
[ 1 , 93 , 70 , ..., 0.315, 23 , 0 ]])
I did tried this approach:
r = np.loadtxt(open("F:/idm/compressed/ANN-CI1/Diabetes.csv", "rb"), delimiter=",", skiprows=0, dtype=[('f0',int),('f1',int),('f2',int),('f3',int),('f4',int),('f5',float),('f6',float),('f7',int),('f8',int)])
Output
array([( 6, 148, 72, 35, 0, 33.6, 0.627, 50, 1),
( 1, 85, 66, 29, 0, 26.6, 0.351, 31, 0),
( 8, 183, 64, 0, 0, 23.3, 0.672, 32, 1),
( 1, 89, 66, 23, 94, 28.1, 0.167, 21, 0),
...,
( 1, 126, 60, 0, 0, 30.1, 0.349, 47, 1),
( 1, 93, 70, 31, 0, 30.4, 0.315, 23, 0)],
dtype=[('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4','<i4'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<i4'), ('f8', '<i4')])
Here you can see dtype solve the problem but now its not in correct form which i require,
[[col1,col2,...,coln],] instead of [(col1,col2,...,coln),] ndarray
Thanks
------------------EDIT------------------------
problem why i am asking is that I am giving this 2d array as input to my binary classification network, when all values are int and in [[ ]] shape it's converging to values , but in current case it's mixed output is either 0. or 1. with very high error learning.
Visit https://github.com/naitikshukla/MachineLearning/blob/master/neural/demo_ann.py!
for complete code
In input space if i mark my current input and unmark from line 69-88 then output will be both 0 and 1.
So i wanted to change it to correct datatype and see if that will solve my issue.
There are very good explanation below for this not possible , i will see any workaround and see if i can use current input for train and predict.

It's impossible to create a numpy array like [[col1,col2,...,coln],] which containing different types of values.
numpy array is homogeneous. In other words, numpy array contains only values of one single type.
In [32]: sio = StringIO('''6,148,72,35,0,33.6,0.627,50,1
...: 1,85,66,29,0,26.6,0.351,31,0
...: 8,183,64,0,0,23.3,0.672,32,1
...: 1,89,66,23,94,28.1,0.167,21,0''')
In [33]: r = np.loadtxt(sio, delimiter=",", skiprows=0)
In [34]: r.shape
Out[34]: (4, 9)
In [41]: r.dtype
Out[41]: dtype('float64')
This above line create a 2D array of floats, and it's shape is 4x9.
In [36]: r = np.loadtxt(sio, delimiter=",", skiprows=0, dtype=[('f0',int),('f1'
...: ,int),('f2',int),('f3',int),('f4',int),('f5',float),('f6',float),('f7'
...: ,int),('f8',int)])
In [38]: r.shape
Out[38]: (4,)
In [45]: r.dtype
Out[45]: dtype([('f0', '<i4'), ('f1', '<i4'), ('f2', '<i4'), ('f3', '<i4'), ('f4', '<i4'), ('f5', '<f8'), ('f6', '<f8'), ('f7', '<i4'), ('f8', '<i4')])
This line code create a 1-D structured array. Each element of this array is a structure that contains 9 items. It's still homogeneous.

In the first case you get a 2d array of floats. In the second, a 1d array with a structured dtype, a mix of ints and floats. What where columns in the first are now named fields. The structured records are marked with () instead of the [].
Both forms are valid and useful. It just depends on what you need to do.
The structured form is more useful when some of the fields are strings or other things that don't fit the integer/float pattern. Usually you can work with the integers as floats without any loss of functionality.
What exactly is wrong with the first case, the all floats? Which is most important - named columns, or ranges of columns (e.g. 0:5, 5:8)?

Fast response for subset queries

I have a database of 10,000 vector of integers ranging from 1 to 1,000. The length of each vector can be up to 1,000. For example, it can look like this:
vec1: 1 2 56 78
vec2: 23 34 35 36 37 38
vec3: 1 2 3 4 5 7
vec4: 2 3 4 6 100
...
vec10000: 13 234
Now, I want to store this database in a way that is fast in response to a particular type of request. Each request will come in the form of an integer vector, up to 10,000 long:
query: 1 2 3 4 5 7 56 78 100
The response should be the indices of the vectors that are subsets of this query string. For example, in the above list, only vec1 and vec3 are subsets of the query, so the response in this case should be
response: 1 3
This database is not going to change so you can preprocess it in any possible way. You may specify that queries come in any protocol as well, as long as the information is the same. For example, it can come as a sorted list or a boolean table.
What is the best strategy to encode the database and the query to achieve the highest response rate possible?

Since you are using python, this method seems easy. (For any other language also, it is implementable but will include modular arithmetic etc.)
So, for each number from 1-1000, assign a prime number to it. So,
1 => 2
2 => 3
3 => 5
4 => 7
...
...
25 => 97
...
...
1000 => 7919
For every set, use its value to be the hash function generated by product of all values in the set.
eg. If your vector, vec-x = {1,2,5,25}, vec-x = 2 * 3 * 11 * 97
Similarly, your query vector can be calculated as above. Let its value be Q.
If Q % vec-i == 0, it is a subset, else not.

What about just preprocessing your vector list into an indicator matrix and using matrix multiplication, something like:
import numpy as np
# generate 10000 random vectors with length in [0-1000]
# and elements in [0-1000]
vectors = [np.random.randint(1000, size=n)
for n in np.random.randint(1000, size=10000)]
# generate indicator matrix
database = np.zeros((10000, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
def query(ints):
tmp = np.zeros(1000, dtype='int8')
tmp[ints] = 1
return np.where(database.dot(tmp) == lengths)[0]
The dot product of a database row and the transformed query will be equal to the number of elements of the row that are in the query. If this number is equal to total number of elements in the row, then we've found a subset. Note that this uses 0-based indexing.
Here's this revised for your example data
vectors = [[1, 2, 56, 78],
[23, 34, 35, 36, 37, 38],
[1, 2, 3, 4, 5, 7],
[2, 3, 4, 6, 100],
[13, 234]]
database = np.zeros((5, 1000), dtype='int8')
for i, vector in enumerate(vectors):
database[i, vector] = 1
lengths = database.sum(axis=1)
print query([1, 2, 3, 4, 5, 7, 56, 78, 100])
# [0, 2] 0-based indexing

Easy way of printing two numpy arrays with each element in a different line?

Let's say I have a 1D numpy array x and another one y = x ** 2.
I am looking for an easier alternative to
for i in range(x.size):
print(x[i], y[i])
With one array one can do print(*x, sep = '\n') which is easier than a for loop. I'm thinking of something like converting x and y to arrays of strings and then adding them up into an array z and then using print(*z, sep = '\n'). However, I tried to do that but numpy gives an error when the add operation is performed.
Edit: This is the function I use for this
def to_str(*args):
return '\n'.join([' '.join([str(ls[i]) for ls in args]) for i in range(len(args[0]))]) + '\n'
>>> x = np.arange(10)
>>> y = x ** 2
>>> print(to_str(x,y))
0 0
1 1
2 4
3 9
4 16
5 25
6 36
7 49
8 64
9 81
>>>

or if something quick and dirty is enough:
print(np.array((x,y)).T)

You could do something along these lines -
# Input arrays
In [238]: x
Out[238]: array([14, 85, 79, 89, 41])
In [239]: y
Out[239]: array([13, 79, 13, 79, 11])
# Join arrays with " "
In [240]: z = [" ".join(item) for item in np.column_stack((x,y)).astype(str)]
# Finally print it
In [241]: print(*z, sep='\n')
14 13
85 79
79 13
89 79
41 11
# Original approach for printing
In [242]: for i in range(x.size):
...: print(x[i], y[i])
...:
14 13
85 79
79 13
89 79
41 11
To make things a bit more compact, np.column_stack((x,y)) could be replaced by np.vstack((x,y)).T.
There are few other methods to create z as listed below -
z = [str(i)[1:-1] for i in zip(x,y)] # Prints commas between elems
z = [str(i)[1:-1] for i in np.column_stack((x,y))]
z = [str(i)[1:-1] for i in np.vstack((x,y)).T]

Here is one way without loop:
print(np.array2string(np.column_stack((x, y)),separator=',').replace(' [ ','').replace('],', '').strip('[ ]'))
Demo:
In [86]: x
Out[86]: array([0, 1, 2, 3, 4])
In [87]: y
Out[87]: array([ 0, 1, 4, 9, 16])
In [85]: print(np.array2string(np.column_stack((x, y)),separator=',').replace(' [ ','').replace('],', '').strip('[ ]'))
0, 0
1, 1
2, 4
3, 9
4,16

There are 2 issues - combining the 2 arrays, and printing the result
In [1]: a = np.arange(4)
In [2]: b = a**2
In [3]: ab = [a,b] # join arrays in a simple list
In [4]: ab
Out[4]: [array([0, 1, 2, 3]), array([0, 1, 4, 9])]
In [6]: list(zip(*ab)) # 'transpose' that list
Out[6]: [(0, 0), (1, 1), (2, 4), (3, 9)]
That zip(*) is a useful tool or idiom.
I could use your print(*a, sep...) method with this
In [11]: print(*list(zip(*ab)), sep='\n')
(0, 0)
(1, 1)
(2, 4)
(3, 9)
Using sep is a neat py3 trick, but is rarely used. I'm not even sure how to do the equivalent with the older py2 print statement.
But if we convert the list of arrays into a 2d array we have more options.
In [12]: arr = np.array(ab)
In [13]: arr
Out[13]:
array([[0, 1, 2, 3],
[0, 1, 4, 9]])
In [14]: np.vstack(ab) # does the same thing
Out[14]:
array([[0, 1, 2, 3],
[0, 1, 4, 9]])
For simply looking at the 2 arrays together this arr is quite useful. And if the lines get too long, transpose it:
In [15]: arr.T
Out[15]:
array([[0, 0],
[1, 1],
[2, 4],
[3, 9]])
In [16]: print(arr.T)
[[0 0]
[1 1]
[2 4]
[3 9]]
Note that array print format is different that for nested lists. That's intentional.
The brackets seldom get in the way of understanding the display. They even help with the array becomes 3d and higher.
For printing a file that can be read by other programs, np.savetxt is quite useful. It lets me specify the delimiter, and the format for each column or line.
In [17]: np.savetxt('test.csv', arr.T, delimiter=',',fmt='%10d')
In ipython I can look at the file with a simple system call:
In [18]: cat test.csv
0, 0
1, 1
2, 4
3, 9
I can omit the delimiter parameter.
I can reload it with loadtxt
In [20]: np.loadtxt('test.csv',delimiter=',',dtype=int)
Out[20]:
array([[0, 0],
[1, 1],
[2, 4],
[3, 9]])
In Py3 it is hard to write savetxt to the screen. It operates on a byte string file, and sys.stdout is unicode. In Py2 np.savetxt(sys.stdout, ...) might work.
savetxt is not sophisticated. In this example, it is essentially doing a fwrite equivalent of:
In [21]: for row in arr.T:
...: print('%10d,%10d'%tuple(row))
...:
0, 0
1, 1
2, 4
3, 9

Partition an array of numbers into sets by proximity

Let's say we have an array like
[37, 20, 16, 8, 5, 5, 3, 0]
What algorithm can I use so that I can specify the number of partitions and have the array broken into them.
For 2 partitions, it should be
[37] and [20, 16, 8, 5, 5, 3, 0]
For 3, it should be
[37],[20, 16] and [8, 5, 5, 3, 0]
I am able to break them down by proximity by simply subtracting the element with right and left numbers but that doesn't ensure the correct number of partitions.
Any ideas?
My code is in ruby but any language/algo/pseudo-code will suffice.
Here's the ruby code by Vikram's algorithm
def partition(arr,clusters)
# Return same array if clusters are less than zero or more than array size
return arr if (clusters >= arr.size) || (clusters < 0)
edges = {}
# Get weights of edges
arr.each_with_index do |a,i|
break if i == (arr.length-1)
edges[i] = a - arr[i+1]
end
# Sort edge weights in ascending order
sorted_edges = edges.sort_by{|k,v| v}.collect{|k| k.first}
# Maintain counter for joins happening.
prev_edge = arr.size+1
joins = 0
sorted_edges.each do |edge|
# If join is on right of previous, subtract the number of previous joins that happened on left
if (edge > prev_edge)
edge -= joins
end
joins += 1
# Join the elements on the sides of edge.
arr[edge] = arr[edge,2].flatten
arr.delete_at(edge+1)
prev_edge = edge
# Get out when right clusters are done
break if arr.size == clusters
end
end

(assuming the array is sorted in descending order)
37, 20, 16, 8, 5, 5, 3, 0
Calculate the differences between adjacent numbers:
17, 4, 8, 3, 0, 2, 3
Then sort them in descending order:
17, 8, 4, 3, 3, 2, 0
Then take the first few numbers. For example, for 4 partitions, take 3 numbers:
17, 8, 4
Now look at the original array and find the elements with these given differences (you should attach the index in the original array to each element in the difference array to make this most easy).
17 - difference between 37 and 20
8 - difference between 16 and 8
4 - difference between 20 and 16
Now print the stuff:
37 | 20 | 16 | 8, 5, 5, 3, 0

I think your problem can be solved using k-clustering using kruskal's algorithm . Kruskal algorithm is used to find the clusters such that there is maximum spacing between them.
Algorithm : -
Construct path graph from your data set like following : -
[37, 20, 16, 8, 5, 5, 3, 0]
path graph: - 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7
then weight for each edge will be difference between their values
edge(0,1) = abs(37-20) = 17
edge(1,2) = abs(20-16) = 4
edge(2,3) = abs(16-8) = 8
edge(3,4) = abs(8-5) = 3
edge(4,5) = abs(5-5) = 0
edge(5,6) = abs(5-3) = 2
edge(6,7) = abs(3-0) = 3
Use kruskal on this graph till there are only k clusters remaining : -
Sort the edges first according to weights in ascending order:-
(4,5),(5,6),(6,7),(3,4),(1,2),(2,3),(0,1)
Use krushkal on it find exactly k = 3 clusters : -
iteration 1 : join (4,5) clusters = 7 clusters: [37,20,16,8,(5,5),3,0]
iteration 2 : join (5,6) clusters = 6 clusters: [37,20,16,8,(5,5,3),0]
iteration 3 : join (6,7) clusters = 5 clusters: [37,20,16,8,(5,5,3,0)]
iteration 4 : join (3,4) clusters = 4 clusters: [37,20,16,(8,5,5,3,0)]
iteration 5 : join (1,2) clusters = 3 clusters: [37,(20,16),(8,5,5,3,0)]
stop as clusters = 3
reconstrusted solution : [(37), (20, 16), (8, 5, 5, 3, 0)] is what
u desired

While #anatolyg's solution may be fine, you should also look at k-means clustering. It's usually done in higher dimensions, but ought to work fine in 1d.
You pick k; your examples are k=2 and k=3. The algorithm seeks to put the inputs into k sets that minimize the sum of distances squared from the set's elements to the centroid (mean position) of the set. This adds a bit of rigor to your rather fuzzy definition of the right result.
While getting an optimal result is NP hard, there is a simple greedy solution.
It's an iteration. Take a guess to get started. Either pick k elements at random to be the initial means or put all the elements randomly into k sets and compute their means. Some care is needed here because each of the k sets must have at least one element.
Additionally, because your integer sets can have repeats, you'll have to ensure the initial k means are distinct. This is easy enough. Just pick from a set that has been "unqualified."
Now iterate. For each element find its closest mean. If it's already in the set corresponding to that mean, leave it there. Else move it. After all elements have been considered, recompute the means. Repeat until no elements need to move.
The Wikipedia page on this is pretty good.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Efficient data sifting for unique values (Python) - arrays

Related

How to perform series of operations on a region of a NumPy array?

Load text from csv file with int and float columns into ndarray

Fast response for subset queries

Easy way of printing two numpy arrays with each element in a different line?

Partition an array of numbers into sets by proximity

Categories

Resources