Similarity measures for vectors containing mixed data type (discrete and continuous) - analytics

I have a set of vectors each of which contain both textual and numeric elements. I am looking for similarity measures for such vectors and if possible their implemented frameworks. Any help much appreciated.

To me this is a data modeling problem rather than one of finding an appropriate similiarty metric.
for instance, you can use euclidean distance provided that you
re-scale your data (e.g., mean-centered & unit variance); and
re-code the "textual" elements (by which i assume you mean discrete variables such as a field storing gender with values of male and female)
so for instance, imagine a dataset comprised of data vectors each with four features (columns or fields):
minutes_per_session, sessions_per_week, registered_user, sex
The first two are continuous (aka "numeric") variables--i.e., proper values are 12.5, 4.7 and so on.
the second two are discrete and obviously require transformation.
step 1: recoding discrete variables
The common technique is to re-code each discrete feature into a sequence of features, once feature for each value recorded for that feature (and in which each feature is given the name of a value of that original feature).
hence a single column storing the sex of each user might have values of M and F would be transformed into two features (fields or columns) because sex has two possible values.
so the column of values for user sex:
['M']
['M']
['F']
['M']
['M']
['F']
['F']
['M']
['M']
['M']
becomes two columns
[1, 0]
[1, 0]
[0, 1]
[1, 0]
[1, 0]
[0, 1]
[0, 1]
[1, 0]
[1, 0]
[1, 0]
step 2: re-scaling the data (e.g., mean-centered and unit-variance)
a random-generated 2D array for synthetic data:
array([[ 3., 5., 2., 4.],
[ 9., 2., 0., 8.],
[ 5., 1., 8., 0.],
[ 9., 9., 7., 4.],
[ 3., 1., 6., 2.]])
for each column: calculate the mean
then subtract the mean from each value in that column:
>>> A -= A.mean(axis=0)
>>> A
array([[-2.8, 1.4, -2.6, 0.4],
[ 3.2, -1.6, -4.6, 4.4],
[-0.8, -2.6, 3.4, -3.6],
[ 3.2, 5.4, 2.4, 0.4],
[-2.8, -2.6, 1.4, -1.6]])
for each column:now calculate the *standard deviation*
then divide each value in that column by this std:
>>> A /= A.std(axis=0)
verify:
>>> A.mean(axis=0)
array([ 0., -0., 0., -0.])
>>> A.std(axis=0)
array([ 1., 1., 1., 1.])
so the original array comprised of four columns now has six; pair-wise similarity can be measured by Euclidean distance, like so:
take the first data vectors (rows):
>>> v1, v2 = A1[:2,:]
Euclidean distance, for a 2-feature space:
dist = ( (x2 - x1)**2 + (y2 - y1)**2 )**0.5
>>> sm = NP.sum((v2 - v1)**2)**.5
>>> sm
3.79

A nice metric for textual data is the Levenshtein distance (or edit distance) that counts how much you should change a string to obtain the other string. In a less computationally intensive way, there is the Hamming distance which provides a similar metric but requiring the strings to have the same size. Converting letters to their ASCII representation is unlikely to give relevant results (or it depends your application and your use of the distance) : is "Z" closer to "S" or to "A" ?
Combined with an Euclidean distance for your numeric data (if you expect them to lie in the Euclidean plane... this might not be the case if they represent coordinates on Earth, angles, etc.), you can sum and weight all the squared distances to obtain a final metric.
For instance, you will get d(A,B) = sqrt( weight1*Levenshtein(textA, textB)^2 + weight2*Euclidean(numericA, numericB)^2)
Now the problem arises about how to set such weights. For instance, if you are measuring tiny numeric data in kilometers and you compute edit distances with very long strings, the numeric data will almost be irrelevant, so you would need to weigh them more. This is domain specific, and only you can choose such weights depending on your data and your applications.
At the end, all depends on your applications that you did not specify, and your data that you didn't mention what they represent. An application can be to build an acceleration structure - in such a case, any not-too-stupid metric could work (including converting letters to ASCII numbers) ; or it could be to query a database or display these points, for which it would matter more. For your data, numeric data could represent coordinates on a plane or on the earth (and that would change the metric), and textual data could be a single letter that you want to check how much similar it sounds to another one, or a full text which could be off by a few letters to another text... Without more precision, it's hard to tell.

Related

Tensorflow-probability transform event shape of JointDistribution

I would like to create a distribution for n categorical variables C_1,.., C_n whose event shape is n. Using JointDistributionSequentialAutoBatched the event dimension is a list [[],..,[]]. For example for n=2
import tensorflow_probability.python.distributions as tfd
probs = [
[0.8, 0.2], # C_1 in {0,1}
[0.3, 0.3, 0.4] # C_2 in {0,1,2}
]
D = tfd.JointDistributionSequentialAutoBatched([tfd.Categorical(probs=p) for p in probs])
>>> D
<tfp.distributions.JointDistributionSequentialAutoBatched 'JointDistributionSequentialAutoBatched' batch_shape=[] event_shape=[[], []] dtype=[int32, int32]>
How do I reshape it to get event shape [2]?
A few different approaches could work here:
Create a batch of Categorical distributions and then use tfd.Independent to reinterpret the batch dimension as the event:
vector_dist = tfd.Independent(
tfd.Categorical(
probs=[
[0.8, 0.2, 0.0], # C_1 in {0,1}
[0.3, 0.3, 0.4] # C_2 in {0,1,2}
]),
reinterpreted_batch_ndims=1)
Here I added an extra zero to pad out probs so that both distributions can be represented by a single Categorical object.
Use the Blockwise distribution, which stuffs its component distributions into a single vector (as opposed to the JointDistribution classes, which return them as separate values):
vector_dist = tfd.Blockwise([tfd.Categorical(probs=p) for p in probs])
The closest to a direct answer to your question is to apply the Split bijector, whose inverse is Concat, to the joint distribution:
tfb = tfp.bijectors
D = tfd.JointDistributionSequentialAutoBatched(
[tfd.Categorical(probs=[p] for p in probs])
vector_dist = tfb.Invert(tfb.Split(2))(D)
Note that I had to awkwardly write probs=[p] instead of just probs=p. This is because the Concat bijector, like tf.concat, can't change the tensor rank of its argument---it can concatenate small vectors into a big vector, but not scalars into a vector---so we have to ensure that its inputs are themselves vectors. This could be avoided if TFP had a Stack bijector analogous to tf.stack / tf.unstack (it doesn't currently, but there's no reason this couldn't exist).

Why my 1-D histograms are not showing correcctly?

I have two sets of data (x, y) corresponding to two 1-D histograms that are meant to be plotted next to each other as subplots. Both x and y values are different and hence they would be represented in different axes. The histogram heights (first item in hists) and the corresponding sequence of bins (second items in hists) are given for each subplot as the following:
*Please note that each height correspond to the bin in the sequence; heights are already known for each bin. I just want to put data in a bar format using hist function
array_1 = np.array([ 8.20198063, 8.30645018, 8.30829034, 8.63297701, 0., 0., 10.43478942])
array_random_1 = np.array([ 8.23460584, 8.31556503, 8.3090378, 8.63147021, 0., 0., 10.41481862])
array_2 = np.array([10.4348338, 8.69943553, 8.68710347, 6.67854038])
array_random_2 = np.array([10.41597028, 8.76635268, 8.19516216, 6.68126994])
bins_1, bins_2 = [8.0, 8.6, 9.2, 9.8, 10.4, 11.0, 11.6, 12.2], [0.0, 0.25, 0.5, 0.75, 1.0]
Here is my try to plot these two subplots using hist function from python:
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(12,3))
ax1.hist(array_1, bins_1, ec='blue', fc='none', lw=1.5, histtype='step', label='1')
ax1.hist(array_random_1, bins_1, ec='red', fc='none', lw=1.5, histtype='step', label='Random_1')
ax1.set_xlabel('X1')
ax1.set_ylabel('Y1')
ax2.hist(array_2, bins_2, ec='blue', fc='none', lw=1.5, histtype='step', label='2')
ax2.hist(array_random_2, bins_2, ec='red', fc='none', lw=1.5, histtype='step', label='Random_2')
ax2.set_xlabel('X2')
plt.show()
However, as you can see bars are not drawn to the correct height (blue bars are missing entirely) in left-side panel and everything is missing from the second panel. What is the issue in making these 1d histograms? Does this mean that I cannot use hist for my purpose?
What I want is the following which is doable using bar. How to do it using hist?
By what I understood.
In your code try replacing:
histtype='step'
with
histtype='bar'

Find edge points of numpy array for kmeans centroids initialization

I am working on implementing a kmeans algorithm in python.
I am testing out new ways of initializing my centroids and wanted to implement it and see what affect it would have on the cluster.
My idea is to select datapoints from my data set in a way that the centroids are initialized to edge points of my data.
Simple example 2 attribute example:
Lets say this is my input array
input = array([[3,3], [1,1], [-1,-1], [3,-3], [-1,1], [-3,3], [1,-1], [-3,-3]])
From this array I would like to select the edges points which would be [3,3] [-3,-3] [-3,3] [3,-3]. So if my k is 4, these points would be selected
In the data that I am working with has 4 and 9 attributes and around 300 data points in my data set
note: I have no found a solution to when k <> edge points but if k is > edge points I think I would select these 4 points and then try to place the rest of them around the center point of the graph
I have also thought about finding max and min for each column and from there try to find the edges of my data set but I don't have an idea of an effective way of identifying the edges from these values.
If you believe this idea will not work I would love to hear what you have to say.
Questions
Does numpy have such a function to get the indexes of data points on the edge of my data set?
If not, how would I go at finding these edge points in my data set?
Use scipy and pair-wise distances to find how farther each one is from another:
from scipy.spatial.distance import pdist, squareform
p=pdist(input)
Then, use sqaureform to get p vector into a matrix shape:
s=squareform(pdist(input))
Then, use numpy argwhere to find the indices where values are max or are extreme, and then look up those indices in the input array:
input[np.argwhere(s==np.max(p))]
array([[[ 3, 3],
[-3, -3]],
[[ 3, -3],
[-3, 3]],
[[-3, 3],
[ 3, -3]],
[[-3, -3],
[ 3, 3]]])
Complete code would be:
from scipy.spatial.distance import pdist, squareform
p=pdist(input)
s=squareform(p)
input[np.argwhere(s==np.max(p))]

Why does deepcopy change values of numpy array?

I am having a problem in which values in a Numpy array change after copying it with copy.deepcopy or numpy.copy, in fact, I get different values if I just print the array first before copying it.
I am using Python 3.5, Numpy 1.11.1, Scipy 0.18.0
My starting array is contained in a list of tuples; each tuple is pair: a float (a time point) and a numpy array (the solution of an ODE at that time point), e.g.:
[(0.0, array([ 0., ... 0.])), ...
(3.0, array([ 0., ... 0.]))]
In this case, I want the array for the last time point.
When I call the following:
tandy = c1.IntegrateColony(3)
ylast = copy.deepcopy(tandy[-1][1])
print(ylast)
I get something that makes sense for the system I'm trying to simulate:
[7.14923891e-07 7.14923891e-07 ... 8.26478813e-01 8.85589634e-01]
However, with the following:
tandy = c1.IntegrateColony(3)
print(tandy[-1][1])
ylast = copy.deepcopy(tandy[-1][1])
print(ylast)
I get all zeros:
[0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00]
[ 0. 0. ... 0. 0.]
I should add, with larger systems and different parameters, displaying tandy[k][1] (either with print() or just by calling it in the command line) shows all non-zero values that are all very close to zero, i.e. <1e-70, but that's still not sensible for the system.
With:
tandy = c1.IntegrateColony(3)
ylast = np.copy(tandy[-1][1])
print(ylast)
I get sensible output again:
[7.14923891e-07 7.14923891e-07 ... 8.26478813e-01 8.85589634e-01]
The function that generates 'tandy' is the following (edited for clarity), which uses scipy.integrate.ode, and the set_solout method to get the solution at intermediate time points:
def IntegrateColony(self, tmax=1):
# I edited out initialization of dCdt & first_step for clarity.
y = ode(dCdt)
y.set_integrator('dopri5', first_step=dt0, nsteps=2000)
sol = []
def solout(tcurrent, ytcurrent):
sol.append((tcurrent, ytcurrent))
y.set_solout(solout)
y.set_initial_value(y=C0, t=0)
yfinal = y.integrate(tmax)
return sol
Although I could get the last time point by returning yfinal, I'd like to get the whole time course once I figure out why it's behaving the way it is.
Thanks for your suggestions!
Mickey
Edit:
If I print all of sol (print(tandy) or print(IntegrateColony...), it comes out as shown above (with the values in the arrays as 0), i.e.:
[(0.0, array([ 0., ... 0.])), ...
(3.0, array([ 0., ... 0.]))]
However, if I copy it with (y = copy.deepcopy(tandy); print(y)), the arrays take on values between 1e-7 and 1e+1.
If I do print(tandy[-1][1]) twice in a row, they're filled with zeros, but the format changes (from 0.0000 to 0.).
One other feature I noticed while following the suggestions in LutzL's and hpaulj's comments: if I run tandy = c1.IntegrateColony(3) in the console (running Spyder), the arrays are filled with zeros in the variable explorer. However, if I run the following in the console:
tandy = c1.IntegrateColony(3); ylast=copy.deepcopy(tandy)
Both the arrays in tandy and in ylast are filled with values in the range I would expect, and print(tandy[-1][1]) now gives:
[7.14923891e-07 7.14923891e-07 ... 8.26478813e-01 8.85589634e-01]
Even if I find a solution that stops this behavior, I'd appreciate anyone's insight about what's going on so I don't make the same mistakes again.
Thanks!
Edit:
Here's a simple case that gives this behavior:
import numpy as np
from scipy.integrate import ode
def testODEint(tmax=1):
C0 = np.ones((3,))
# C0 = 1 # This seems to behave the same
def dCdt_simpleinputs(t, C):
return C
y = ode(dCdt_simpleinputs)
y.set_integrator('dopri5')
sol = []
def solout(tcurrent, ytcurrent):
sol.append((tcurrent, ytcurrent)) # Behaves oddly
# sol.append((tcurrent, ytcurrent.copy())) # LutzL's idea: Works
y.set_solout(solout)
y.set_initial_value(y=C0, t=0)
yfinal = y.integrate(tmax)
return sol
tandy = testODEint(1)
ylast = np.copy(tandy[-1][1])
print(ylast) # Expect same values as tandy[-1][1] below
tandy = testODEint(1)
tandy[-1][1]
print(tandy[-1][1]) # Expect same values as ylast above
When I run this, I get the following output for ylast and tandy[-1][1]:
[ 2.71828196 2.71828196 2.71828196]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00]
The code I was working on when I ran into this problem is an embarrassing mess, but if you want to take a look, an old version is here: https://github.com/mvondassow/BryozoanModel2
The details of why this is happening are tied to how ytcurrent is handled in integrate. But there are various contexts in Python where all values of a list end up the same - contrary to expectations.
For example:
In [159]: x
Out[159]: [0, 1, 2]
In [160]: x=[]
In [161]: y=np.array([1,2,3])
In [162]: for i in range(3):
...: y += i
...: x.append(y)
In [163]: x
Out[163]: [array([4, 5, 6]), array([4, 5, 6]), array([4, 5, 6])]
All elements of x have the same value - because they all are pointers to the same y, and thus show its final value.
but if I copy y before appending it to the list, I see the changes.
In [164]: x=[]
In [165]: for i in range(3):
...: y += i
...: x.append(y.copy())
In [166]: x
Out[166]: [array([4, 5, 6]), array([5, 6, 7]), array([7, 8, 9])]
In [167]:
Now that does not explain why the print statement changes the values. But that whole solout callback mechanism is a bit obscure. I wonder if there are any warnings in scipy about pitfalls in defining such a callback?

Looping through slices of Theano tensor

I have two 2D Theano tensors, call them x_1 and x_2, and suppose for the sake of example, both x_1 and x_2 have shape (1, 50). Now, to compute their mean squared error, I simply run:
T.sqr(x_1 - x_2).mean(axis = -1).
However, what I wanted to do was construct a new tensor that consists of their mean squared error in chunks of 10. In other words, since I'm more familiar with NumPy, what I had in mind was to create the following tensor M in Theano:
M = [theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1) for i in xrange(0, 50, 10)]
Now, since Theano doesn't have for loops, but instead uses scan (which map is a special case of), I thought I would try the following:
sequence = T.arange(0, 50, 10)
M = theano.map(lambda i: theano.tensor.sqr(x_1[:, i:i+10] - x_2[:, i:i+10]).mean(axis = -1), sequence)
However, this does not seem to work, as I get the error:
only integers, slices (:), ellipsis (...), numpy.newaxis (None) and integer or boolean arrays are valid indices
Is there a way to loop through the slices using theano.scan (or map)? Thanks in advance, as I'm new to Theano!
Similar to what can be done in numpy, a solution would be to reshape your (1, 50) tensor to a (1, 10, 5) tensor (or even a (10, 5) tensor), and then to compute the mean along the second axis.
To illustrate this with numpy, suppose I want to compute means by slices of 2
x = np.array([0, 2, 0, 4, 0, 6])
x = x.reshape([3, 2])
np.mean(x, axis=1)
outputs
array([ 1., 2., 3.])

Resources