Related
I have a JaggedArray (awkward.array.jagged.JaggedArray) that contains indices that point to positions in another JaggedArray. Both arrays have the same length, but each of the numpy.ndarrays that the JaggedArrays contain can be of different length. I would like to sort the second array using the indices of the first array, at the same time dropping the elements from the second array that are not indexed from the first array. The first array can additionally contain values of -1 (could also be replaced by None if needed, but this is currently not that case) that mean that there is no match in the second array. In such a case, the corresponding position in the first array should be set to a default value (e.g. 0).
Here's a practical example and how I solve this at the moment:
import uproot
import numpy as np
import awkward
def good_index(my_indices, my_values):
my_list = []
for index in my_indices:
if index > -1:
my_list.append(my_values[index])
else:
my_list.append(0)
return my_list
indices = awkward.fromiter([[0, -1], [3,1,-1], [-1,0,-1]])
values = awkward.fromiter([[1.1, 1.2, 1.3], [2.1,2.2,2.3,2.4], [3.1]])
new_map = awkward.fromiter(map(good_index, indices, values))
The resulting new_map is: [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]].
Is there a more efficient/faster way achieving this? I was thinking that one could use numpy functionality such as numpy.where, but due to the different lengths of the ndarrays this fails at least for the ways that I tried.
If all of the subarrays in values are guaranteed to be non-empty (so that indexing with -1 returns the last subelement, not an error), then you can do this:
>>> almost = values[indices] # almost what you want; uses -1 as a real index
>>> almost.content = awkward.MaskedArray(indices.content < 0, almost.content)
>>> almost.fillna(0.0)
<JaggedArray [[1.1 0.0] [2.4 2.2 0.0] [0.0 3.1 0.0]] at 0x7fe54c713c88>
The last step is optional because without it, the missing elements are None, rather than 0.0.
If some of the subarrays in values are empty, you can pad them to ensure they have at least one subelement. All of the original subelements are indexed the same way they were before, since pad only increases the length, if need be.
>>> values = awkward.fromiter([[1.1, 1.2, 1.3], [], [2.1, 2.2, 2.3, 2.4], [], [3.1]])
>>> values.pad(1)
<JaggedArray [[1.1 1.2 1.3] [None] [2.1 2.2 2.3 2.4] [None] [3.1]] at 0x7fe54c713978>
I am looking for a way to loop over 1D fibers (row, column, and multi-dimensional equivalents) along any dimension in a 3+-dimensional array.
In a 2D array this is fairly trivial since the fibers are rows and columns, so just saying for row in A gets the job done. But for 3D arrays for example, this expression iterates over 2D slices, not 1D fibers.
A working solution is the one below:
import numpy as np
A = np.arange(27).reshape((3,3,3))
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(A[fiber_index])
However, I am wondering whether there is something that is:
More idiomatic
Faster
Hope you can help!
I think you might be looking for numpy.apply_along_axis
In [10]: def my_func(x):
...: return x**2 + x
In [11]: np.apply_along_axis(my_func, 2, A)
Out[11]:
array([[[ 0, 2, 6],
[ 12, 20, 30],
[ 42, 56, 72]],
[[ 90, 110, 132],
[156, 182, 210],
[240, 272, 306]],
[[342, 380, 420],
[462, 506, 552],
[600, 650, 702]]])
Although many NumPy functions (including sum) have their own axis argument to specify which axis to use:
In [12]: np.sum(A, axis=2)
Out[12]:
array([[ 3, 12, 21],
[30, 39, 48],
[57, 66, 75]])
numpy provides a number of different ways of looping over 1 or more dimensions.
Your example:
func = np.sum
for fiber_index in np.ndindex(A.shape[:-1]):
print func(fiber_index)
print A[fiber_index]
produces something like:
(0, 0)
[0 1 2]
(0, 1)
[3 4 5]
(0, 2)
[6 7 8]
...
generates all index combinations over the 1st 2 dim, giving your function the 1D fiber on the last.
Look at the code for ndindex. It's instructive. I tried to extract it's essence in https://stackoverflow.com/a/25097271/901925.
It uses as_strided to generate a dummy matrix over which an nditer iterate. It uses the 'multi_index' mode to generate an index set, rather than elements of that dummy. The iteration itself is done with a __next__ method. This is the same style of indexing that is currently used in numpy compiled code.
http://docs.scipy.org/doc/numpy-dev/reference/arrays.nditer.html
Iterating Over Arrays has good explanation, including an example of doing so in cython.
Many functions, among them sum, max, product, let you specify which axis (axes) you want to iterate over. Your example, with sum, can be written as:
np.sum(A, axis=-1)
np.sum(A, axis=(1,2)) # sum over 2 axes
An equivalent is
np.add.reduce(A, axis=-1)
np.add is a ufunc, and reduce specifies an iteration mode. There are many other ufunc, and other iteration modes - accumulate, reduceat. You can also define your own ufunc.
xnx suggests
np.apply_along_axis(np.sum, 2, A)
It's worth digging through apply_along_axis to see how it steps through the dimensions of A. In your example, it steps over all possible i,j in a while loop, calculating:
outarr[(i,j)] = np.sum(A[(i, j, slice(None))])
Including slice objects in the indexing tuple is a nice trick. Note that it edits a list, and then converts it to a tuple for indexing. That's because tuples are immutable.
Your iteration can applied along any axis by rolling that axis to the end. This is a 'cheap' operation since it just changes the strides.
def with_ndindex(A, func, ax=-1):
# apply func along axis ax
A = np.rollaxis(A, ax, A.ndim) # roll ax to end (changes strides)
shape = A.shape[:-1]
B = np.empty(shape,dtype=A.dtype)
for ii in np.ndindex(shape):
B[ii] = func(A[ii])
return B
I did some timings on 3x3x3, 10x10x10 and 100x100x100 A arrays. This np.ndindex approach is consistently a third faster than the apply_along_axis approach. Direct use of np.sum(A, -1) is much faster.
So if func is limited to operating on a 1D fiber (unlike sum), then the ndindex approach is a good choice.
In efficient sorted Cartesian product of 2 sorted array of integers a lazy algorithm is suggested to generate ordered cartesian products for two sorted integer arrays.
I curious to know if there is a generalisation of this algorithm to more arrays.
For example say we have 5 sorted arrays of doubles
(0.7, 0.2, 0.1)
(0.6, 0.3, 0.1)
(0.5, 0.25, 0.25)
(0.4, 0.35, 0.25)
(0.35, 0.35, 0.3)
I am interested in generating the ordered cartesian product without having to calculate all possible combinations.
Appreciate any ideas on how a possible lazy cartesian product algorithm would possibly extend to dimensions beyond 2.
This problem appears to be an enumeration instance of uniform-cost-search (see for ex. https://en.wikipedia.org/wiki/Dijkstra%27s_algorithm ). Your state-space is defined by the set of current indexes pointing to your sorted arrays. The successor function is an enumeration of possible index increments for every array. For your given example of 5 arrays, the initial state is (0, 0, 0, 0, 0).
There is no goal state check function as we need to go through all possibilities. The result is guaranteed to be sorted if all the input arrays are sorted.
Assuming we have m arrays of length n each, then the complexity of this method is O((n^m).log(n(m-1)).
Here is a sample implementation in python:
from heapq import heappush, heappop
def cost(s, lists):
prod = 1
for ith, x in zip(s, lists):
prod *= x[ith]
return prod
def successor(s, lists):
successors = []
for k, (i, x) in enumerate(zip(s, lists)):
if i < len(x) - 1:
t = list(s)
t[k] += 1
successors.append(tuple(t))
return successors
def sorted_product(initial_state, lists):
fringe = []
explored = set()
heappush(fringe, (-cost(initial_state, lists), initial_state))
while fringe:
best = heappop(fringe)[1]
yield best
for s in successor(best, lists):
if s not in explored:
heappush(fringe, (-cost(s, lists), s))
explored.add(s)
if __name__ == '__main__':
lists = ((0.7, 0.2, 0.1),
(0.6, 0.3, 0.1),
(0.5, 0.25, 0.25),
(0.4, 0.35, 0.25),
(0.35, 0.35, 0.3))
init_state = tuple([0]*len(lists))
for s in sorted_product(init_state, lists):
s_output = [x[i] for i, x in zip(s, lists)]
v = cost(s, lists)
print '%s %s \t%s' % (s, s_output, cost(s, lists))
So, if you have A(A1, ..., An) and B(B1, ..., Bn).
A < B if and only if
A1 * ... * An < B1 * ... * Bn
I'm assuming that every value is positive, because if we allow negatives, then:
(-50, -100, 1) > (1, 2, 3)
as -50 * (-100) * 1 = 5000 > 6 = 1 * 2 * 3
Even without negative values, the problem is still rather complex. You need a solution which would include a data structure, which would have a depth of k. If (A1, ..., Ak) < (B1, ..., Bk), then we can assume that on other dimensions, a combination of (A1, ..., Ak, ... An) is probably smaller than a combination of (B1, ..., Bk, ..., Bn). As a result, wherever this is not true, the case beats the probability, so those would be the exceptions of the rule. The data-structure should hold:
k
the first k elements of A and B respectively
description of the exceptions from the rule
For any of such exceptions, there might be a combination of (C1, ..., Ck) which is bigger than (B1, ..., Bk), but the bigger combination of (C1, ..., Ck) might still have combinations using values of further dimensions where exceptions of the rule of (A1, ..., Ak) < (C1, ..., Ck) might be still present.
So, if you already know that (A1, ..., Ak) < (B1, ..., Bk), then first you have to check whether there are exceptions by finding the first l dimensions where upon choosing the biggest possible values for A and the smallest possible values for B. If such l exists, then you should find where the exception starts (which dimension, which index). This would describe the exception. When you find an exception, you know that the combination of (A1, ..., Ak, ..., Al) > (B1, ..., Bk, ..., Bl), so here the rule is that A is bigger than B and an exception would be present when B becomes bigger than A.
To reflect this, the data-structure would look like:
class Rule {
int k;
int[] smallerCombinationIndexes;
int[] biggerCombinationIndexes;
List<Rule> exceptions;
}
Whenever you find an exception to a rule, the exception would be generated based on prior knowledge. Needless to say that the complexity greatly increases, but problem is that you have exceptions for the rules, exceptions for the exceptions and so on. The current approach would tell you that if you take two random points, A and B, whether A is smaller than B and it would also tell you that if you take combinations of (A1, ..., Ak) and (B1, ..., Bk), then what is the key indexes where the result of the comparison of (A1, ..., Ak) and (B1, ..., Bk) would change. Depending on your exact needs this idea might be enough or could need extensions. So the answer to your question is: yes, you can extend the lazy algorithm to handle further dimensions, but you need to handle the exceptions of the rules to achieve that.
I have a set of vectors each of which contain both textual and numeric elements. I am looking for similarity measures for such vectors and if possible their implemented frameworks. Any help much appreciated.
To me this is a data modeling problem rather than one of finding an appropriate similiarty metric.
for instance, you can use euclidean distance provided that you
re-scale your data (e.g., mean-centered & unit variance); and
re-code the "textual" elements (by which i assume you mean discrete variables such as a field storing gender with values of male and female)
so for instance, imagine a dataset comprised of data vectors each with four features (columns or fields):
minutes_per_session, sessions_per_week, registered_user, sex
The first two are continuous (aka "numeric") variables--i.e., proper values are 12.5, 4.7 and so on.
the second two are discrete and obviously require transformation.
step 1: recoding discrete variables
The common technique is to re-code each discrete feature into a sequence of features, once feature for each value recorded for that feature (and in which each feature is given the name of a value of that original feature).
hence a single column storing the sex of each user might have values of M and F would be transformed into two features (fields or columns) because sex has two possible values.
so the column of values for user sex:
['M']
['M']
['F']
['M']
['M']
['F']
['F']
['M']
['M']
['M']
becomes two columns
[1, 0]
[1, 0]
[0, 1]
[1, 0]
[1, 0]
[0, 1]
[0, 1]
[1, 0]
[1, 0]
[1, 0]
step 2: re-scaling the data (e.g., mean-centered and unit-variance)
a random-generated 2D array for synthetic data:
array([[ 3., 5., 2., 4.],
[ 9., 2., 0., 8.],
[ 5., 1., 8., 0.],
[ 9., 9., 7., 4.],
[ 3., 1., 6., 2.]])
for each column: calculate the mean
then subtract the mean from each value in that column:
>>> A -= A.mean(axis=0)
>>> A
array([[-2.8, 1.4, -2.6, 0.4],
[ 3.2, -1.6, -4.6, 4.4],
[-0.8, -2.6, 3.4, -3.6],
[ 3.2, 5.4, 2.4, 0.4],
[-2.8, -2.6, 1.4, -1.6]])
for each column:now calculate the *standard deviation*
then divide each value in that column by this std:
>>> A /= A.std(axis=0)
verify:
>>> A.mean(axis=0)
array([ 0., -0., 0., -0.])
>>> A.std(axis=0)
array([ 1., 1., 1., 1.])
so the original array comprised of four columns now has six; pair-wise similarity can be measured by Euclidean distance, like so:
take the first data vectors (rows):
>>> v1, v2 = A1[:2,:]
Euclidean distance, for a 2-feature space:
dist = ( (x2 - x1)**2 + (y2 - y1)**2 )**0.5
>>> sm = NP.sum((v2 - v1)**2)**.5
>>> sm
3.79
A nice metric for textual data is the Levenshtein distance (or edit distance) that counts how much you should change a string to obtain the other string. In a less computationally intensive way, there is the Hamming distance which provides a similar metric but requiring the strings to have the same size. Converting letters to their ASCII representation is unlikely to give relevant results (or it depends your application and your use of the distance) : is "Z" closer to "S" or to "A" ?
Combined with an Euclidean distance for your numeric data (if you expect them to lie in the Euclidean plane... this might not be the case if they represent coordinates on Earth, angles, etc.), you can sum and weight all the squared distances to obtain a final metric.
For instance, you will get d(A,B) = sqrt( weight1*Levenshtein(textA, textB)^2 + weight2*Euclidean(numericA, numericB)^2)
Now the problem arises about how to set such weights. For instance, if you are measuring tiny numeric data in kilometers and you compute edit distances with very long strings, the numeric data will almost be irrelevant, so you would need to weigh them more. This is domain specific, and only you can choose such weights depending on your data and your applications.
At the end, all depends on your applications that you did not specify, and your data that you didn't mention what they represent. An application can be to build an acceleration structure - in such a case, any not-too-stupid metric could work (including converting letters to ASCII numbers) ; or it could be to query a database or display these points, for which it would matter more. For your data, numeric data could represent coordinates on a plane or on the earth (and that would change the metric), and textual data could be a single letter that you want to check how much similar it sounds to another one, or a full text which could be off by a few letters to another text... Without more precision, it's hard to tell.
I have searched for an answer for my question on here but cannot find one, so I apologize in advance if it already exists!
What I am trying to do is create a 3D array of 3-d points in space (x,y,z). I know in a 1D vector you can specify the interval, like 1:5:20, to get a vector from 1 to 20 spaced by 5. What I would like to do is create a 3D array, most likely row by row would be the most efficient, where the spacing is by a unit vector (ix, iy, iz). so, for example,
a(1,1,:) = [1, 1, 1]
uv = [0.5 0.5 0.5]
a(2,2,:) = [1.5, 1.5, 1.5]
etc. I know the numbers are not 'unit vectors', but the idea is there. Is there something along the lines of a = [1, 1, 1] : uv : [end, end, end] ???
You might be interested in a mesh grid.
An example:
[X,Y,Z] = meshgrid(1:0.1:2, 1:0.1:2, 1:0.1:2); %# they can be different
points = [X(:) Y(:) Z(:)];
plot3(points(:,1),points(:,2),points(:,3),'.')
box on, axis equal
xlabel x, ylabel y, zlabel z