I have two sets of data (x, y) corresponding to two 1-D histograms that are meant to be plotted next to each other as subplots. Both x and y values are different and hence they would be represented in different axes. The histogram heights (first item in hists) and the corresponding sequence of bins (second items in hists) are given for each subplot as the following:
*Please note that each height correspond to the bin in the sequence; heights are already known for each bin. I just want to put data in a bar format using hist function
array_1 = np.array([ 8.20198063, 8.30645018, 8.30829034, 8.63297701, 0., 0., 10.43478942])
array_random_1 = np.array([ 8.23460584, 8.31556503, 8.3090378, 8.63147021, 0., 0., 10.41481862])
array_2 = np.array([10.4348338, 8.69943553, 8.68710347, 6.67854038])
array_random_2 = np.array([10.41597028, 8.76635268, 8.19516216, 6.68126994])
bins_1, bins_2 = [8.0, 8.6, 9.2, 9.8, 10.4, 11.0, 11.6, 12.2], [0.0, 0.25, 0.5, 0.75, 1.0]
Here is my try to plot these two subplots using hist function from python:
fig, (ax1, ax2) = plt.subplots(1, 2, sharex=False, sharey=False, figsize=(12,3))
ax1.hist(array_1, bins_1, ec='blue', fc='none', lw=1.5, histtype='step', label='1')
ax1.hist(array_random_1, bins_1, ec='red', fc='none', lw=1.5, histtype='step', label='Random_1')
ax1.set_xlabel('X1')
ax1.set_ylabel('Y1')
ax2.hist(array_2, bins_2, ec='blue', fc='none', lw=1.5, histtype='step', label='2')
ax2.hist(array_random_2, bins_2, ec='red', fc='none', lw=1.5, histtype='step', label='Random_2')
ax2.set_xlabel('X2')
plt.show()
However, as you can see bars are not drawn to the correct height (blue bars are missing entirely) in left-side panel and everything is missing from the second panel. What is the issue in making these 1d histograms? Does this mean that I cannot use hist for my purpose?
What I want is the following which is doable using bar. How to do it using hist?
By what I understood.
In your code try replacing:
histtype='step'
with
histtype='bar'
Related
I would like to create a distribution for n categorical variables C_1,.., C_n whose event shape is n. Using JointDistributionSequentialAutoBatched the event dimension is a list [[],..,[]]. For example for n=2
import tensorflow_probability.python.distributions as tfd
probs = [
[0.8, 0.2], # C_1 in {0,1}
[0.3, 0.3, 0.4] # C_2 in {0,1,2}
]
D = tfd.JointDistributionSequentialAutoBatched([tfd.Categorical(probs=p) for p in probs])
>>> D
<tfp.distributions.JointDistributionSequentialAutoBatched 'JointDistributionSequentialAutoBatched' batch_shape=[] event_shape=[[], []] dtype=[int32, int32]>
How do I reshape it to get event shape [2]?
A few different approaches could work here:
Create a batch of Categorical distributions and then use tfd.Independent to reinterpret the batch dimension as the event:
vector_dist = tfd.Independent(
tfd.Categorical(
probs=[
[0.8, 0.2, 0.0], # C_1 in {0,1}
[0.3, 0.3, 0.4] # C_2 in {0,1,2}
]),
reinterpreted_batch_ndims=1)
Here I added an extra zero to pad out probs so that both distributions can be represented by a single Categorical object.
Use the Blockwise distribution, which stuffs its component distributions into a single vector (as opposed to the JointDistribution classes, which return them as separate values):
vector_dist = tfd.Blockwise([tfd.Categorical(probs=p) for p in probs])
The closest to a direct answer to your question is to apply the Split bijector, whose inverse is Concat, to the joint distribution:
tfb = tfp.bijectors
D = tfd.JointDistributionSequentialAutoBatched(
[tfd.Categorical(probs=[p] for p in probs])
vector_dist = tfb.Invert(tfb.Split(2))(D)
Note that I had to awkwardly write probs=[p] instead of just probs=p. This is because the Concat bijector, like tf.concat, can't change the tensor rank of its argument---it can concatenate small vectors into a big vector, but not scalars into a vector---so we have to ensure that its inputs are themselves vectors. This could be avoided if TFP had a Stack bijector analogous to tf.stack / tf.unstack (it doesn't currently, but there's no reason this couldn't exist).
I have a very large 2d array of shape (186295, 2) with the first element of every 2-element sub-array being x and the second element being y. Here is how I produce the scatter plot by separating x and y components in matplotlib:
ax.scatter(A[:, 0]+np.random.uniform(-.02, .02, A.shape[0]), A[:, 1], s=2, color='b', alpha=0.5, zorder=3)
However, I would like
all points with x-value in the range [8,9.2] be shown as a dot plot at the mid point x=8.6,
all points with x-value in the range [9.2,10.4] be shown as a dot plot at the mid point x=9.8,
all points with x-value in the range [10.4,12.2] be shown as a dot plot at the mid point x=11.3.
Your help is greatly appreciated,
You can use np.select:
Example:
import numpy as np
from matplotlib import pyplot as plt
n=100
x = np.random.uniform(8, 12, n)
y = np.random.uniform(.01, 1, n)
a = np.array(list(zip(x,y)))
fig,ax = plt.subplots(2, sharex=True)
ax[0].scatter(a[:,0], a[:,1])
ax[0].title.set_text('Scatter Plot')
conditions = [a[:,0]<=8, a[:,0]<=9.2, a[:,0]<=10.4, a[:,0]<=12.2, a[:,0]>12.2]
choices = [a[:,0], 8.6, 9.8, 11.3, a[:,0]]
a[:,0] = np.select(conditions, choices)
ax[1].scatter(a[:,0], a[:,1])
ax[1].title.set_text('Dot Plot')
Result:
Another possibility is using np.digitize which saves some typing as it uses a list of bins (upper bounds) instead of a list of conditions.
I am having a problem in which values in a Numpy array change after copying it with copy.deepcopy or numpy.copy, in fact, I get different values if I just print the array first before copying it.
I am using Python 3.5, Numpy 1.11.1, Scipy 0.18.0
My starting array is contained in a list of tuples; each tuple is pair: a float (a time point) and a numpy array (the solution of an ODE at that time point), e.g.:
[(0.0, array([ 0., ... 0.])), ...
(3.0, array([ 0., ... 0.]))]
In this case, I want the array for the last time point.
When I call the following:
tandy = c1.IntegrateColony(3)
ylast = copy.deepcopy(tandy[-1][1])
print(ylast)
I get something that makes sense for the system I'm trying to simulate:
[7.14923891e-07 7.14923891e-07 ... 8.26478813e-01 8.85589634e-01]
However, with the following:
tandy = c1.IntegrateColony(3)
print(tandy[-1][1])
ylast = copy.deepcopy(tandy[-1][1])
print(ylast)
I get all zeros:
[0.00000000e+00 0.00000000e+00 ... 0.00000000e+00 0.00000000e+00]
[ 0. 0. ... 0. 0.]
I should add, with larger systems and different parameters, displaying tandy[k][1] (either with print() or just by calling it in the command line) shows all non-zero values that are all very close to zero, i.e. <1e-70, but that's still not sensible for the system.
With:
tandy = c1.IntegrateColony(3)
ylast = np.copy(tandy[-1][1])
print(ylast)
I get sensible output again:
[7.14923891e-07 7.14923891e-07 ... 8.26478813e-01 8.85589634e-01]
The function that generates 'tandy' is the following (edited for clarity), which uses scipy.integrate.ode, and the set_solout method to get the solution at intermediate time points:
def IntegrateColony(self, tmax=1):
# I edited out initialization of dCdt & first_step for clarity.
y = ode(dCdt)
y.set_integrator('dopri5', first_step=dt0, nsteps=2000)
sol = []
def solout(tcurrent, ytcurrent):
sol.append((tcurrent, ytcurrent))
y.set_solout(solout)
y.set_initial_value(y=C0, t=0)
yfinal = y.integrate(tmax)
return sol
Although I could get the last time point by returning yfinal, I'd like to get the whole time course once I figure out why it's behaving the way it is.
Thanks for your suggestions!
Mickey
Edit:
If I print all of sol (print(tandy) or print(IntegrateColony...), it comes out as shown above (with the values in the arrays as 0), i.e.:
[(0.0, array([ 0., ... 0.])), ...
(3.0, array([ 0., ... 0.]))]
However, if I copy it with (y = copy.deepcopy(tandy); print(y)), the arrays take on values between 1e-7 and 1e+1.
If I do print(tandy[-1][1]) twice in a row, they're filled with zeros, but the format changes (from 0.0000 to 0.).
One other feature I noticed while following the suggestions in LutzL's and hpaulj's comments: if I run tandy = c1.IntegrateColony(3) in the console (running Spyder), the arrays are filled with zeros in the variable explorer. However, if I run the following in the console:
tandy = c1.IntegrateColony(3); ylast=copy.deepcopy(tandy)
Both the arrays in tandy and in ylast are filled with values in the range I would expect, and print(tandy[-1][1]) now gives:
[7.14923891e-07 7.14923891e-07 ... 8.26478813e-01 8.85589634e-01]
Even if I find a solution that stops this behavior, I'd appreciate anyone's insight about what's going on so I don't make the same mistakes again.
Thanks!
Edit:
Here's a simple case that gives this behavior:
import numpy as np
from scipy.integrate import ode
def testODEint(tmax=1):
C0 = np.ones((3,))
# C0 = 1 # This seems to behave the same
def dCdt_simpleinputs(t, C):
return C
y = ode(dCdt_simpleinputs)
y.set_integrator('dopri5')
sol = []
def solout(tcurrent, ytcurrent):
sol.append((tcurrent, ytcurrent)) # Behaves oddly
# sol.append((tcurrent, ytcurrent.copy())) # LutzL's idea: Works
y.set_solout(solout)
y.set_initial_value(y=C0, t=0)
yfinal = y.integrate(tmax)
return sol
tandy = testODEint(1)
ylast = np.copy(tandy[-1][1])
print(ylast) # Expect same values as tandy[-1][1] below
tandy = testODEint(1)
tandy[-1][1]
print(tandy[-1][1]) # Expect same values as ylast above
When I run this, I get the following output for ylast and tandy[-1][1]:
[ 2.71828196 2.71828196 2.71828196]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00]
The code I was working on when I ran into this problem is an embarrassing mess, but if you want to take a look, an old version is here: https://github.com/mvondassow/BryozoanModel2
The details of why this is happening are tied to how ytcurrent is handled in integrate. But there are various contexts in Python where all values of a list end up the same - contrary to expectations.
For example:
In [159]: x
Out[159]: [0, 1, 2]
In [160]: x=[]
In [161]: y=np.array([1,2,3])
In [162]: for i in range(3):
...: y += i
...: x.append(y)
In [163]: x
Out[163]: [array([4, 5, 6]), array([4, 5, 6]), array([4, 5, 6])]
All elements of x have the same value - because they all are pointers to the same y, and thus show its final value.
but if I copy y before appending it to the list, I see the changes.
In [164]: x=[]
In [165]: for i in range(3):
...: y += i
...: x.append(y.copy())
In [166]: x
Out[166]: [array([4, 5, 6]), array([5, 6, 7]), array([7, 8, 9])]
In [167]:
Now that does not explain why the print statement changes the values. But that whole solout callback mechanism is a bit obscure. I wonder if there are any warnings in scipy about pitfalls in defining such a callback?
I have a set of vectors each of which contain both textual and numeric elements. I am looking for similarity measures for such vectors and if possible their implemented frameworks. Any help much appreciated.
To me this is a data modeling problem rather than one of finding an appropriate similiarty metric.
for instance, you can use euclidean distance provided that you
re-scale your data (e.g., mean-centered & unit variance); and
re-code the "textual" elements (by which i assume you mean discrete variables such as a field storing gender with values of male and female)
so for instance, imagine a dataset comprised of data vectors each with four features (columns or fields):
minutes_per_session, sessions_per_week, registered_user, sex
The first two are continuous (aka "numeric") variables--i.e., proper values are 12.5, 4.7 and so on.
the second two are discrete and obviously require transformation.
step 1: recoding discrete variables
The common technique is to re-code each discrete feature into a sequence of features, once feature for each value recorded for that feature (and in which each feature is given the name of a value of that original feature).
hence a single column storing the sex of each user might have values of M and F would be transformed into two features (fields or columns) because sex has two possible values.
so the column of values for user sex:
['M']
['M']
['F']
['M']
['M']
['F']
['F']
['M']
['M']
['M']
becomes two columns
[1, 0]
[1, 0]
[0, 1]
[1, 0]
[1, 0]
[0, 1]
[0, 1]
[1, 0]
[1, 0]
[1, 0]
step 2: re-scaling the data (e.g., mean-centered and unit-variance)
a random-generated 2D array for synthetic data:
array([[ 3., 5., 2., 4.],
[ 9., 2., 0., 8.],
[ 5., 1., 8., 0.],
[ 9., 9., 7., 4.],
[ 3., 1., 6., 2.]])
for each column: calculate the mean
then subtract the mean from each value in that column:
>>> A -= A.mean(axis=0)
>>> A
array([[-2.8, 1.4, -2.6, 0.4],
[ 3.2, -1.6, -4.6, 4.4],
[-0.8, -2.6, 3.4, -3.6],
[ 3.2, 5.4, 2.4, 0.4],
[-2.8, -2.6, 1.4, -1.6]])
for each column:now calculate the *standard deviation*
then divide each value in that column by this std:
>>> A /= A.std(axis=0)
verify:
>>> A.mean(axis=0)
array([ 0., -0., 0., -0.])
>>> A.std(axis=0)
array([ 1., 1., 1., 1.])
so the original array comprised of four columns now has six; pair-wise similarity can be measured by Euclidean distance, like so:
take the first data vectors (rows):
>>> v1, v2 = A1[:2,:]
Euclidean distance, for a 2-feature space:
dist = ( (x2 - x1)**2 + (y2 - y1)**2 )**0.5
>>> sm = NP.sum((v2 - v1)**2)**.5
>>> sm
3.79
A nice metric for textual data is the Levenshtein distance (or edit distance) that counts how much you should change a string to obtain the other string. In a less computationally intensive way, there is the Hamming distance which provides a similar metric but requiring the strings to have the same size. Converting letters to their ASCII representation is unlikely to give relevant results (or it depends your application and your use of the distance) : is "Z" closer to "S" or to "A" ?
Combined with an Euclidean distance for your numeric data (if you expect them to lie in the Euclidean plane... this might not be the case if they represent coordinates on Earth, angles, etc.), you can sum and weight all the squared distances to obtain a final metric.
For instance, you will get d(A,B) = sqrt( weight1*Levenshtein(textA, textB)^2 + weight2*Euclidean(numericA, numericB)^2)
Now the problem arises about how to set such weights. For instance, if you are measuring tiny numeric data in kilometers and you compute edit distances with very long strings, the numeric data will almost be irrelevant, so you would need to weigh them more. This is domain specific, and only you can choose such weights depending on your data and your applications.
At the end, all depends on your applications that you did not specify, and your data that you didn't mention what they represent. An application can be to build an acceleration structure - in such a case, any not-too-stupid metric could work (including converting letters to ASCII numbers) ; or it could be to query a database or display these points, for which it would matter more. For your data, numeric data could represent coordinates on a plane or on the earth (and that would change the metric), and textual data could be a single letter that you want to check how much similar it sounds to another one, or a full text which could be off by a few letters to another text... Without more precision, it's hard to tell.
I have searched for an answer for my question on here but cannot find one, so I apologize in advance if it already exists!
What I am trying to do is create a 3D array of 3-d points in space (x,y,z). I know in a 1D vector you can specify the interval, like 1:5:20, to get a vector from 1 to 20 spaced by 5. What I would like to do is create a 3D array, most likely row by row would be the most efficient, where the spacing is by a unit vector (ix, iy, iz). so, for example,
a(1,1,:) = [1, 1, 1]
uv = [0.5 0.5 0.5]
a(2,2,:) = [1.5, 1.5, 1.5]
etc. I know the numbers are not 'unit vectors', but the idea is there. Is there something along the lines of a = [1, 1, 1] : uv : [end, end, end] ???
You might be interested in a mesh grid.
An example:
[X,Y,Z] = meshgrid(1:0.1:2, 1:0.1:2, 1:0.1:2); %# they can be different
points = [X(:) Y(:) Z(:)];
plot3(points(:,1),points(:,2),points(:,3),'.')
box on, axis equal
xlabel x, ylabel y, zlabel z