1D Number Array Clustering - arrays

So let's say I have an array like this:
[1,1,2,3,10,11,13,67,71]
Is there a convenient way to partition the array into something like this?
[[1,1,2,3],[10,11,13],[67,71]]
I looked through similar questions yet most people suggested using k-means to cluster points, like scipy, which is quite confusing to use for a beginner like me. Also I think that k-means is more suitable for two or more dimensional clustering right? Are there any ways to partition an array of N numbers to many partitions/clustering depending on the numbers?
Some people also suggest rigid range partitioning, but it doesn't always render the results as
expected

Don't use multidimensional clustering algorithms for a one-dimensional problem. A single dimension is much more special than you naively think, because you can actually sort it, which makes things a lot easier.
In fact, it is usually not even called clustering, but e.g. segmentation or natural breaks optimization.
You might want to look at Jenks Natural Breaks Optimization and similar statistical methods. Kernel Density Estimation is also a good method to look at, with a strong statistical background. Local minima in density are be good places to split the data into clusters, with statistical reasons to do so. KDE is maybe the most sound method for clustering 1-dimensional data.
With KDE, it again becomes obvious that 1-dimensional data is much more well behaved. In 1D, you have local minima; but in 2D you may have saddle points and such "maybe" splitting points. See this Wikipedia illustration of a saddle point, as how such a point may or may not be appropriate for splitting clusters.
See this answer for an example how to do this in Python (green markers are the cluster modes; red markers a points where the data is cut; the y axis is a log-likelihood of the density):

This simple algorithm works:
points = [0.1, 0.31, 0.32, 0.45, 0.35, 0.40, 0.5 ]
clusters = []
eps = 0.2
points_sorted = sorted(points)
curr_point = points_sorted[0]
curr_cluster = [curr_point]
for point in points_sorted[1:]:
if point <= curr_point + eps:
curr_cluster.append(point)
else:
clusters.append(curr_cluster)
curr_cluster = [point]
curr_point = point
clusters.append(curr_cluster)
print(clusters)
The above example clusters points into a group, such that each element in a group is at most eps away from another element in the group. This is like the clustering algorithm DBSCAN with eps=0.2, min_samples=1. As others noted, 1d data allows you to solve the problem directly, instead of using the bigger guns like DBSCAN.
The above algorithm is 10-100x faster for some small datasets with <1000 elements I tested.

You may look for discretize algorithms. 1D discretization problem is a lot similar to what you are asking. They decide cut-off points, according to frequency, binning strategy etc.
weka uses following algorithms in its , discretization process.
weka.filters.supervised.attribute.Discretize
uses either Fayyad & Irani's MDL method or Kononeko's MDL criterion
weka.filters.unsupervised.attribute.Discretize
uses simple binning

CKwrap is a fast and straightforward k-means clustering function, though a bit light on documentation.
Example Usage
pip install ckwrap
import ckwrap
nums= np.array([1,1,2,3,10,11,13,67,71])
km = ckwrap.ckmeans(nums,3)
print(km.labels)
# [0 0 0 0 1 1 1 2 2]
buckets = [[],[],[]]
for i in range(len(nums)):
buckets[km.labels[i]].append(nums[i])
print(buckets)
# [[1, 1, 2, 3], [10, 11, 13], [67, 71]]
exit()
I expect the authors intended you to make use of the nd array functionality rather than create a list of lists.
other measures:
km.centers
km.k
km.sizes
km.totss
km.betweenss
km.withinss
The underlying algorithm is based on this article.

Late response and just for the record. You can partition a 1D array using Ckmeans.1d.dp.
This method guarantees optimality and it is O(n^2), where n is the num of observations. The implementation is in C++ and there is a wrapper in R.

The code for Has QUIT--Anony-Mousse's answer to Clustering values by their proximity in python (machine learning?)
When you have 1-dimensional data, sort it, and look for the largest
gaps
I only added that gaps need to be relatively large
import numpy as np
from scipy.signal import argrelextrema
# lst = [1,1,5,6,1,5,10,22,23,23,50,51,51,52,100,112,130,500,512,600,12000,12230]
lst = [1,1,2,3,10,11,13,67,71]
lst.sort()
diff = [lst[i] - lst[i-1] for i in range(1, len(lst))]
rel_diff = [diff[i]/lst[i] for i in range(len(diff))]
arg = argrelextrema(np.array(rel_diff), np.greater)[0]
last = 0
for x in arg:
print(f'{last}:{x + 1} {lst[last:x + 1]}')
last = x + 1
print(f'{last}: {lst[last:]}')
output:
0:2 [1, 1]
2:4 [2, 3]
4:7 [10, 11, 13]
7: [67, 71]

Related

Optimal combination of “baskets” to best approximate target basket?

Say I have some array of length n where arr[k] represents how much of object k I want. I also have some arbitrary number of arrays which I can sum integer multiples of in any combination - my goal being to minimise the sum of the absolute differences across each element.
So as a dumb example if my target was [2,1] and my options were A = [2,3] and B = [0,1], then I could take A - 2B and have a cost of 0
I’m wondering if there is an efficient algorithm for approximating something like this? It has a weird knapsack-y flavour to is it maybe just intractable for large n? It doesn’t seem very amenable to DP methods
This is the (NP-hard) closest vector problem. There's an algorithm due to Fincke and Pohst ("Improved methods for calculating vectors of short length in a lattice, including a complexity analysis"), but I haven't personally worked with it.

Sorting and making "genes" in output bitstrings from a genetic algorithm

I was wondering if anybody had suggestions as to how I could analyze an output bitstring that is being permuted by a genetic algorithm. In particular it would be nice if I could try to identify patterns of bits (I'm calling them genes here) that seem to yield a desirable cv score. The difficulty comes in trying to examine these datasets because there are a lot of them (I have probably already something like 30 million bitstrings that are 140 bits long and I'll probably hit over 100 million pretty quickly), so after I sort out the desirable data there is still ALOT of potential datasets and doing similarity comparisons by eye is out of the question. My questions are:
How should I compare for similarity between these bitstrings?
How can I identify "genes" in these bitstrings in an algorithmic (aka programmable) way?
As you want to extract common gene-patterns, what about looking at the intersection of the two strings. So if you have
set1 = 11011101110011...
set2 = 11001100000110...
# apply bitwise '=='
set1 && set2 == 11101110000010...
The result now shows what genes are the same, and could be used in further analysis.
For the similarity part you need to do an exclusive-or (XOR). The result of this bit-wise operation will give you the difference between two bit strings, and is probably the most efficient and easy way of doing it (for pair comparison). As an example:
>>> from bitarray import bitarray
>>> a = bitarray('0001100111')
>>> b = bitarray('0100110110')
>>> a ^ b
bitarray('0101010001')
Then you can either count the differences, inspect quickly where the differences lie, etc.
For the second part, it depends on the representation of course, and on the programming language (PL) chosen for the implementation. Most PL libraries will have a search function, that retrieves all or at least the first of the indexes where some pattern is found in a string (or bitstring, or bitstream...). You just have to refer to the documentation of your chosen PL to know more about the performance if you have more than one option for the task.

Artificial neural networks

I want to know whether Artificial Neural Networks can be applied to discrete values inputs? I know they can be applied to continuous valued inputs, but can they be applied to discrete valued ones? Also, will perform well for discrete valued inputs?
Yes, artificial neural networks may be applied to data featuring discrete-value input variables. In the most commonly used neural network architectures (which are numeric), discrete inputs are typically represented by a series of dummy variables, just as in statistical regression. Also, as with regression, one less than the number of distinct values dummy variables is needed. There are other methods, but this is the most straightforward.
Well, good question let me say!
First of all let me answer directly yes to your question!
The answer implies to consider few aspects about the use and the implementation of the network itself.
Than let me explain why:
The easiest way is to normalize input as usual, this is the first rule of thumb with NN,
than let the neural network compute the task, and once you have your output, invert the normalization to get the output in the original range but still continuous, to get back descrete values just consider the integer part of your output. It is easy, it works and is fine, DONE! A good result just depends on the topology you design for you network.
As a plus you could consider the use of "step" transfer function, instead of "tan-sigmoid", between layers just to strenght and mimic a sort of digitization forcing the output to be just 0 or 1. But you should reconsider also the starting normalization as well as the use of well tuned thresholds.
NB: this latter trick is not really necessary but could give some secondary benefits; maybe test it in a second stage of your development and look at the differences.
PS: Just let me suggest something that should apply to your issue; if you would be smart take into account the use of some fuzzy logic on your learning algorithm ;-)
Cheers!
I'm late on this question, but this may help someone.
Say you have a categorical output variable, for example 3 different categories (0, 1 and 2),
outputs
0
2
1
2
1
0
then becomes
1, 0, 0
0, 0, 1
0, 1, 0
0, 0, 1
0, 1, 0
1, 0, 0
A possible NN output result is
0.2, 0.3, 0.5 (winner is categ 2)
0.05, 0.9, 0.05 (winner is categ 1)
...
Then your NN hill have 3 output nodes in this case, so take the max value.
To improve this, use entropy as a error measure and a softmax activation on the output layer, so that the outputs sum up to 1.
The purpose of a neural network is to approximate complicated functions by interpolating samples. As such, they tend to be a poor fit for discrete data, unless that data can be expressed by thresholding a continuous function. Depending on your problem, there are likely to be much more effective learning methods.

Is there a supervised learning algorithm that takes tags as input, and produces a probability as output?

Let's say I want to determine the probability that I will upvote a question on SO, based only on which tags are present or absent.
Let's also imagine that I have plenty of data about past questions that I did or did not upvote.
Is there a machine learning algorithm that could take this historical data, train on it, and then be able to predict my upvote probability for future questions? Note that it must be the probability, not just some arbitrary score.
Let's assume that there will be up-to 7 tags associated with any given question, these being drawn from a superset of tens of thousands.
My hope is that it is able to make quite sophisticated connections between tags, rather than each tag simply contributing to the end result in a "linear" way (much as words do in a Bayesian spam filter).
So for example, it might be that the word "java" increases my upvote probability, except when it is present with "database", however "database" might increase my upvote probability when present with "ruby".
Oh, and it should be computationally reasonable (training within an hour or two on millions of questions).
What approaches should I research here?
Given that there probably aren't many tags per message, you could just create "n-gram" tags and apply naive Bayes. Regression trees would also produce an empirical probability at the leaf nodes, using +1 for upvote and 0 for no upvote. See http://www.stat.cmu.edu/~cshalizi/350-2006/lecture-10.pdf for some readable lecture notes and http://sites.google.com/site/rtranking/ for an open source implementation.
You can try several methods (linear regression, SMV, neural networks). The input vector should consist of all possible tags, where each tag represents one dimension.
Then each record in a training set has to be transformed to the input vector according to the tags. For example let's say you have different combinations of 4 tags in your training set (php, ruby, ms, sql) and you define an unweighted input vector [php, ruby, ms, sql]. Let's say you have the following 3 records whic are transformed to weighted input vectors:
php, sql -> [1, 0, 0, 1]
ruby -> [0, 1, 0, 0]
ms, sql -> [0, 0, 1, 1]
In case you use linear regression you use the following formula
y = k * X
where y represents an answer (upvote/downvote) in your case and by inserting known values (X - weighted input vectors).
How ta calculate weights in case you use linear regression you can read here but the point is to create binary input vectors which size is equal (or larger in case you take into account some other variables) to the number of all tags and then for each record you set weights for each tag (0 if it is not included or 1 otherwise).

Efficient algorithm for searching 3D coordinate in an array

I have a large array (>10^5 entries) of 3D coordinates r=(x, y, z), where x, y and z are floats. Which is the most efficient way to search a given coordinate r' in the array and give the array index. Note that the r' may not given with the same accuracy as r; say, if the array has stored coordinate (1.5, 0.5, 0.0) and r' is given as (1.49999, 0.49999, 0.0), the algorithm should rightly pick the coordinate. I am developing the code in C.
How can one use O(1) search capability of hash table for this purpose? Converting the coordinate into string is out of question due to accuracy related issue. Is there any particular data structure that would help in O(1) algorithm?
Thanks
OnRoadCoder
check R-trees, already implemented on some RDBMS, like SQLite, and (i think) Postgres
In order to have "fuzzy" searching as you're describing (so you can support slight inaccuracies), you will have to sacrifice on O(1) algorithms.
That being said, there are some very good algorithms for this. Space partitioning (such as using an Octree or KD-Tree) is a common, popular option.
If the range of values is limited, pick the precision you want. Now, the key (1,2,3) will point to a linked list (or a fancier data structure) of all points that are within Manhattan Distance of 3 * d (d = 0.5? - depends on details) from (1,2,3). You know your application best, so you can do a better job of choosing d. Optimization approach would depend on how the data is distributed.
EDIT:
The weakness here is - if you have many points concentrated within a single cube, then there is little that can be done using a hash table about guaranteeing O(1) ... more like O(n) :)
Some sort of tree-based data structure can guaranteed O(log n).
What you are asking sounds like Nearest Neighbour Search. One approach might be to code a kd-tree (or any space partition based technique) and use that to find the nearest point to your query. But you can also go with a hash based approach, which basically does what Ipthnc's answer describes, but tries to avoid bad performance for degenerate cases.

Resources