Methods of creating a structured array - arrays

I have the following information and I can produce a numpy array of the desired structure. Note that the values x and y have to be determined separately since their ranges may differ so I cannot use:
xy = np.random.random_integers(0,10,size=(N,2))
The extra list[... conversion is necessary for the conversion in order for it to work in Python 3.4, it is not necessary, but not harmful when using Python 2.7.
The following works:
>>> # attempts to formulate [id,(x,y)] with specified dtype
>>> N = 10
>>> x = np.random.random_integers(0,10,size=N)
>>> y = np.random.random_integers(0,10,size=N)
>>> id = np.arange(N)
>>> dt = np.dtype([('ID','<i4'),('Shape',('<f8',(2,)))])
>>> arr = np.array(list(zip(id,np.hstack((x,y)))),dt)
>>> arr
array([(0, [7.0, 7.0]), (1, [7.0, 7.0]), (2, [5.0, 5.0]), (3, [0.0, 0.0]),
(4, [6.0, 6.0]), (5, [6.0, 6.0]), (6, [7.0, 7.0]),
(7, [10.0, 10.0]), (8, [3.0, 3.0]), (9, [7.0, 7.0])],
dtype=[('ID', '<i4'), ('Shape', '<f8', (2,))])
I cleverly thought I could circumvent the above nasty bits by simply creating the array in the desired vertical structure and applying my dtype to it, hoping that it would work. The stacked array is correct in the vertical form
>>> a = np.vstack((id,x,y)).T
>>> a
array([[ 0, 7, 6],
[ 1, 7, 7],
[ 2, 5, 9],
[ 3, 0, 1],
[ 4, 6, 1],
[ 5, 6, 6],
[ 6, 7, 6],
[ 7, 10, 9],
[ 8, 3, 2],
[ 9, 7, 8]])
I tried several ways of trying to reformulate the above array so that my dtype would work and I just can't figure it out (this included vstacking a vstack etc). So my question is...how can I use the vstack version and get it into a format that meets my dtype requirements without having to go through the procedure that I did. I am hoping it is obvious, but I am sliced, stacked and ellipsed myself into an endless loop.
SUMMARY
Many thanks to hpaulj. I have included two incarnations based upon his suggestions for others to consider. The pure numpy solution is substantially faster and a lot cleaner.
"""
Script: pnts_StackExch
Author: Dan.Patterson#carleton.ca
Modified: 2015-08-24
Purpose:
To provide some timing options on point creation in preparation for
point-to-point distance calculations using einsum.
Reference:
http://stackoverflow.com/questions/32224220/
methods-of-creating-a-structured-array
Functions:
decorators: profile_func, timing, arg_deco
main: make_pnts, einsum_0
"""
import numpy as np
import random
import time
from functools import wraps
np.set_printoptions(edgeitems=5,linewidth=75,precision=2,suppress=True,threshold=5)
# .... wrapper funcs .............
def delta_time(func):
"""timing decorator function"""
import time
#wraps(func)
def wrapper(*args, **kwargs):
print("\nTiming function for... {}".format(func.__name__))
t0 = time.time() # start time
result = func(*args, **kwargs) # ... run the function ...
t1 = time.time() # end time
print("Results for... {}".format(func.__name__))
print(" time taken ...{:12.9f} sec.".format(t1-t0))
#print("\n print results inside wrapper or use <return> ... ")
return result # return the result of the function
return wrapper
def arg_deco(func):
"""This wrapper just prints some basic function information."""
#wraps(func)
def wrapper(*args,**kwargs):
print("Function... {}".format(func.__name__))
#print("File....... {}".format(func.__code__.co_filename))
print(" args.... {}\n kwargs. {}".format(args,kwargs))
#print(" docs.... {}\n".format(func.__doc__))
return func(*args, **kwargs)
return wrapper
# .... main funcs ................
#delta_time
#arg_deco
def pnts_IdShape(N=1000000,x_min=0,x_max=10,y_min=0,y_max=10):
"""Make N points based upon a random normal distribution,
with optional min/max values for Xs and Ys
"""
dt = np.dtype([('ID','<i4'),('Shape',('<f8',(2,)))])
IDs = np.arange(0,N)
Xs = np.random.random_integers(x_min,x_max,size=N) # note below
Ys = np.random.random_integers(y_min,y_max,size=N)
a = np.array([(i,j) for i,j in zip(IDs,np.column_stack((Xs,Ys)))],dt)
return IDs,Xs,Ys,a
#delta_time
#arg_deco
def alternate(N=1000000,x_min=0,x_max=10,y_min=0,y_max=10):
""" after hpaulj and his mods to the above and this. See docs
"""
dt = np.dtype([('ID','<i4'),('Shape',('<f8',(2,)))])
IDs = np.arange(0,N)
Xs = np.random.random_integers(0,10,size=N)
Ys = np.random.random_integers(0,10,size=N)
c_stack = np.column_stack((IDs,Xs,Ys))
a = np.ones(N, dtype=dt)
a['ID'] = c_stack[:,0]
a['Shape'] = c_stack[:,1:]
return IDs,Xs,Ys,a
if __name__=="__main__":
"""time testing for various methods
"""
id_1,xs_1,ys_1,a_1 = pnts_IdShape(N=1000000,x_min=0, x_max=10, y_min=0, y_max=10)
id_2,xs_2,ys_2,a_2 = alternate(N=1000000,x_min=0, x_max=10, y_min=0, y_max=10)
Timing results for 1,000,000 points are as follows
Timing function for... pnts_IdShape
Function... **pnts_IdShape**
args.... ()
kwargs. {'N': 1000000, 'y_max': 10, 'x_min': 0, 'x_max': 10, 'y_min': 0}
Results for... pnts_IdShape
time taken ... **0.680652857 sec**.
Timing function for... **alternate**
Function... alternate
args.... ()
kwargs. {'N': 1000000, 'y_max': 10, 'x_min': 0, 'x_max': 10, 'y_min': 0}
Results for... alternate
time taken ... **0.060056925 sec**.

There are 2 ways of filling a structured array (http://docs.scipy.org/doc/numpy/user/basics.rec.html#filling-structured-arrays) - by row (or rows with list of tuples), and by field.
To do this by field, create the empty structured array, and assign values by field name
In [19]: a=np.column_stack((id,x,y))
# same as your vstack().T
In [20]: Y=np.zeros(a.shape[0], dtype=dt)
# empty, ones, etc
In [21]: Y['ID'] = a[:,0]
In [22]: Y['Shape'] = a[:,1:]
# (2,) field takes a 2 column array
In [23]: Y
Out[23]:
array([(0, [8.0, 8.0]), (1, [8.0, 0.0]), (2, [6.0, 2.0]), (3, [8.0, 8.0]),
(4, [3.0, 2.0]), (5, [6.0, 1.0]), (6, [5.0, 6.0]), (7, [7.0, 7.0]),
(8, [6.0, 1.0]), (9, [6.0, 6.0])],
dtype=[('ID', '<i4'), ('Shape', '<f8', (2,))])
On the surface
arr = np.array(list(zip(id,np.hstack((x,y)))),dt)
looks like an ok way of constructing the list of tuples need to fill the array. But result duplicates the values of x instead of using y. I'll have to look at what is wrong.
You can take a view of an array like a if the dtype is compatible - the data buffer for 3 int columns is layed out the same way as one with 3 int fields.
a.view('i4,i4,i4')
But your dtype wants 'i4,f8,f8', a mix of 4 and 8 byte fields, and a mix of int and float. The a buffer will have to be transformed to achieve that. view can't do it. (don't even ask about .astype.)
corrected list of tuples method:
In [35]: np.array([(i,j) for i,j in zip(id,np.column_stack((x,y)))],dt)
Out[35]:
array([(0, [8.0, 8.0]), (1, [8.0, 0.0]), (2, [6.0, 2.0]), (3, [8.0, 8.0]),
(4, [3.0, 2.0]), (5, [6.0, 1.0]), (6, [5.0, 6.0]), (7, [7.0, 7.0]),
(8, [6.0, 1.0]), (9, [6.0, 6.0])],
dtype=[('ID', '<i4'), ('Shape', '<f8', (2,))])
The list comprehension produces a list like:
[(0, array([8, 8])),
(1, array([8, 0])),
(2, array([6, 2])),
....]
For each tuple in the list, the [0] goes in the first field of the dtype, and [1] (a small array), goes in the 2nd.
The tuples could also be constructed with
[(i,[j,k]) for i,j,k in zip(id,x,y)]
dt1 = np.dtype([('ID','<i4'),('Shape',('<i4',(2,)))])
is a view compatible dtype (still 3 integers)
In [42]: a.view(dtype=dt1)
Out[42]:
array([[(0, [8, 8])],
[(1, [8, 0])],
[(2, [6, 2])],
[(3, [8, 8])],
[(4, [3, 2])],
[(5, [6, 1])],
[(6, [5, 6])],
[(7, [7, 7])],
[(8, [6, 1])],
[(9, [6, 6])]],
dtype=[('ID', '<i4'), ('Shape', '<i4', (2,))])

Related

Dictionary output has array inside

I am trying on of the online tutorials to have a dictionary of nine numbers and create another dictionary with statistics, below is the code with the input data, and the result as well
import numpy as np
a = [0, 1, 2, 3, 4, 5, 6, 7, 8]
arr = np.array(a).reshape(3, 3).astype(int)
result = {
"mean": [],
"variance": [],
"standard deviation": [],
"max": [],
"min": [],
"sum": []
}
# Creating a function1
def calculate1(a):
calculate1 = arr.mean(axis = a)
return(calculate1)
result["mean"].append(calculate1(0))
result["mean"].append(calculate1(1))
result["mean"].append(calculate1(None))
# Creating a function2
def calculate2(a):
calculate2 = arr.var(axis = a)
return(calculate2)
result["variance"].append(calculate2(0))
result["variance"].append(calculate2(1))
result["variance"].append(calculate2(None))
# Creating a function3
def calculate3(a):
calculate3 = arr.std(axis = a)
return(calculate3)
result["standard deviation"].append(calculate3(0))
result["standard deviation"].append(calculate3(1))
result["standard deviation"].append(calculate3(None))
# Creating a function4
def calculate4(a):
calculate4 = arr.max(axis = a)
return(calculate4)
result["max"].append(calculate4(0))
result["max"].append(calculate4(1))
result["max"].append(calculate4(None))
# Creating a function5
def calculate5(a):
calculate5 = arr.min(axis = a)
return(calculate5)
result["min"].append(calculate5(0))
result["min"].append(calculate5(1))
result["min"].append(calculate5(None))
# Creating a function6
def calculate6(a):
calculate6 = arr.sum(axis = a)
return(calculate6)
result["sum"].append(calculate6(0))
result["sum"].append(calculate6(1))
result["sum"].append(calculate6(None))
for k, v in result.items():
print(k, v)
And here is the result
mean [array([3., 4., 5.]), array([1., 4., 7.]), 4.0]
variance [array([6., 6., 6.]), array([0.66666667, 0.66666667, 0.66666667]), 6.666666666666667]
standard deviation [array([2.44948974, 2.44948974, 2.44948974]), array([0.81649658, 0.81649658, 0.81649658]), 2.581988897471611]
max [array([6, 7, 8]), array([2, 5, 8]), 8]
min [array([0, 1, 2]), array([0, 3, 6]), 0]
sum [array([ 9, 12, 15]), array([ 3, 12, 21]), 36]
I have two questions here:
1- Is there a way that I can combine or minimize the number of functions to one or something like that. Please note that I (have to) use the function.
2- The output is correct (in values), however I am not sure why the word (array) is printing as well, and when I check the type of the values inside the dictionary, it shows that they are <class 'list'>, so where this array word is coming from?
I tried tolist value and plenty of online suggestions but nothing worked
Any help or suggestion is highly appreciated
You can store your functions inside a dict and then iterate over it:
from pprint import pprint
import numpy as np
def main():
arr = np.random.rand(3, 3)
functions = {
"mean": lambda axis: arr.mean(axis=axis),
"var": lambda axis: arr.var(axis=axis),
"std": lambda axis: arr.std(axis=axis),
"max": lambda axis: arr.max(axis=axis),
"min": lambda axis: arr.min(axis=axis),
"sum": lambda axis: arr.sum(axis=axis),
}
axes = (0, 1, None)
result = {}
for funcname, func in functions.items():
result[funcname] = [func(axis).tolist() for axis in axes]
# Alternatively:
result = {
funcname: [func(axis).tolist() for axis in axes]
for funcname, func in functions.items()
}
pprint(result)
if __name__ == "__main__":
main()
Prints:
{'max': [[0.33149413492721314, 0.9252576833729358, 0.9616249059176883],
[0.37580580905770067, 0.9616249059176883, 0.9252576833729358],
0.9616249059176883],
'mean': [[0.23391570323037428, 0.4063894010374775, 0.6764668740080081],
[0.20197437573445387, 0.4652236940918113, 0.6495739084495947],
0.43892399275862],
'min': [[0.0958037701384552, 0.13431354800720574, 0.37580580905770067],
[0.0958037701384552, 0.15959697173229104, 0.33149413492721314],
0.0958037701384552],
'std': [[0.10039824223253171, 0.3670404461719236, 0.23941075106262735],
[0.1239187264736742, 0.35412651334119355, 0.24424967197333333],
0.3170854368356986],
'sum': [[0.7017471096911229, 1.2191682031124325, 2.029400622024024],
[0.6059231272033616, 1.395671082275434, 1.948721725348784],
3.95031593482758],
'var': [[0.010079807043382115, 0.13471868912608476, 0.057317507724371324],
[0.015355850770857285, 0.12540558745119054, 0.05965790225908093],
0.10054317425328584]}
As for why there is "array" printed, it is because, e.g., np.mean(arr, axis=1) returns a numpy array.

A way to fix some columns of a numpy array to be integer and the rest as float? [duplicate]

I have two different arrays, one with strings and another with ints. I want to concatenate them, into one array where each column has the original datatype. My current solution for doing this (see below) converts the entire array into dtype = string, which seems very memory inefficient.
combined_array = np.concatenate((A, B), axis = 1)
Is it possible to mutiple dtypes in combined_array when A.dtype = string and B.dtype = int?
One approach might be to use a record array. The "columns" won't be like the columns of standard numpy arrays, but for most use cases, this is sufficient:
>>> a = numpy.array(['a', 'b', 'c', 'd', 'e'])
>>> b = numpy.arange(5)
>>> records = numpy.rec.fromarrays((a, b), names=('keys', 'data'))
>>> records
rec.array([('a', 0), ('b', 1), ('c', 2), ('d', 3), ('e', 4)],
dtype=[('keys', '|S1'), ('data', '<i8')])
>>> records['keys']
rec.array(['a', 'b', 'c', 'd', 'e'],
dtype='|S1')
>>> records['data']
array([0, 1, 2, 3, 4])
Note that you can also do something similar with a standard array by specifying the datatype of the array. This is known as a "structured array":
>>> arr = numpy.array([('a', 0), ('b', 1)],
dtype=([('keys', '|S1'), ('data', 'i8')]))
>>> arr
array([('a', 0), ('b', 1)],
dtype=[('keys', '|S1'), ('data', '<i8')])
The difference is that record arrays also allow attribute access to individual data fields. Standard structured arrays do not.
>>> records.keys
chararray(['a', 'b', 'c', 'd', 'e'],
dtype='|S1')
>>> arr.keys
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'numpy.ndarray' object has no attribute 'keys'
A simple solution: convert your data to object 'O' type
z = np.zeros((2,2), dtype='U2')
o = np.ones((2,1), dtype='O')
np.hstack([o, z])
creates the array:
array([[1, '', ''],
[1, '', '']], dtype=object)
Refering Numpy doc, there is a function named numpy.lib.recfunctions.merge_arraysfunction which can be used to merge numpy arrays in different data type into either structured array or record array.
Example:
>>> from numpy.lib import recfunctions as rfn
>>> A = np.array([1, 2, 3])
>>> B = np.array(['a', 'b', 'c'])
>>> b = rfn.merge_arrays((A, B))
>>> b
array([(1, 'a'), (2, 'b'), (3, 'c')], dtype=[('f0', '<i4'), ('f1', '<U1')])
For more detail please refer the link above.

Associating two arrays in an RDD by index

I have an RDD contains two arrays for each row RDD[(Array[Int], Array[Double])]. For each row, the two arrays have similar size of n. However, every row has different size of n, and n could be up to 200. The sample data is as follows:
(Array(1, 3, 5), Array(1.0, 1.0, 2.0))
(Array(6, 3, 1, 9), Array(2.0, 1.0, 2.0, 1.0))
(Array(2, 4), Array(1.0, 3.0))
. . .
I want to combine between those two arrays according to the index for each line. So, the expected output is as follows:
((1,1.0), (3,1.0), (5,2.0))
((6,2.0), (3,1.0), (1,2.0), (9,1.0))
((2,1.0), (4,3.0))
This is my code:
val data = spark.sparkContext.parallelize(Seq( (Array(1, 3, 5),Array(1.0, 1.0, 2.0)), (Array(6, 3, 1,9),Array(2.0, 1.0, 2.0, 1.0)) , (Array(2, 4),Array(1.0, 3.0)) ) )
val pairArr = data.map{x =>
(x._1(0), x._2(0))
}
//pairArr: Array((1,1.0), (6,2.0), (2,1.0))
This code only takes the value of the first index in each row.
Can anybody give me direction how to get the expected output?
Thanks.
You need to zip the two elements in each tuple:
data.map(x => x._1.zip(x._2)).collect
// res1: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))
Or with pattern matching:
data.map{ case (x, y) => x.zip(y) }.collect
// res0: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

replace numpy elements with non-scalar dictionary values

import pandas as pd
import numpy as np
column = np.array([5505, 5505, 5505, 34565, 34565, 65539, 65539])
column = pd.Series(column)
myDict = column.groupby(by = column ).groups
I am creating a dictionary from a pandas df using df.group(by=..) which has the form:
>>> myDict
{5505: Int64Index([0, 1, 2], dtype='int64'), 65539: Int64Index([5, 6], dtype='int64'), 34565: Int64Index([3, 4], dtype='int64')}
I have a numpy array, e.g.
myArray = np.array([34565, 34565, 5505,65539])
and I want to replace each of the array's elements with the dictionary's values.
I have tried several solutions that I have found (e.g. here and here) but these examples have dictionaries with single dictionary values, and I am always getting the error of setting an array element with a sequence. How can I get over this problem?
My intended output is
np.array([3, 4, 3, 4, 0, 1, 2, 5, 6])
One approach based on np.searchsorted -
# Extract dict info
k = list(myDict.keys())
v = list(myDict.values())
# Use argsort of k to find search sorted indices from myArray in keys
# Index into the values of dict based on those indices for output
sidx = np.argsort(k)
idx = sidx[np.searchsorted(k,myArray,sorter=sidx)]
out_arr = np.concatenate([v[i] for i in idx])
Sample input, output -
In [369]: myDict
Out[369]:
{5505: Int64Index([0, 1, 2], dtype='int64'),
34565: Int64Index([3, 4], dtype='int64'),
65539: Int64Index([5, 6], dtype='int64')}
In [370]: myArray
Out[370]: array([34565, 34565, 5505, 65539])
In [371]: out_arr
Out[371]: array([3, 4, 3, 4, 0, 1, 2, 5, 6])

How to sum up every column of a Scala array?

If I have an array of array (similar to a matrix) in Scala, what's the efficient way to sum up each column of the matrix? For example, if my array of array is like below:
val arr = Array(Array(1, 100, ...), Array(2, 200, ...), Array(3, 300, ...))
and I want to sum up each column (e.g., sum up the first element of all sub-arrays, sum up the second element of all sub-arrays, etc.) and get a new array like below:
newArr = Array(6, 600, ...)
How can I do this efficiently in Spark Scala?
There is a suitable .transpose method on List that can help here, although I can't say what its efficiency is like:
arr.toList.transpose.map(_.sum)
(then call .toArray if you specifically need the result as an array).
Using breeze Vector:
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.map(breeze.linalg.Vector(_)).reduce(_ + _)
res0: breeze.linalg.Vector[Int] = DenseVector(6, 600)
If your input is sparse you may consider using breeze.linalg.SparseVector.
In practice a linear algebra vector library as mentioned by #zero323 will often be the better choice.
If you can't use a vector library, I suggest writing a function col2sum that can sum two columns -- even if they are not the same length -- and then use Array.reduce to extend this operation to N columns. Using reduce is valid because we know that sums are not dependent on order of operations (i.e. 1+2+3 == 3+2+1 == 3+1+2 == 6) :
def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}
def colsum(a:Array[Array[Int]]):Array[Int] = {
a.reduce(col2sum)
}
val z = Array(Array(1, 2, 3, 4, 5), Array(2, 4, 6, 8, 10), Array(1, 9));
colsum(z)
--> Array[Int] = Array(4, 15, 9, 12, 15)
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300 ))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.flatten.zipWithIndex.groupBy(c => (c._2 + 1) % 2)
.map(a => a._1 -> a._2.foldLeft(0)((sum, i) => sum + i._1))
res40: scala.collection.immutable.Map[Int,Int] = Map(2 -> 600, 1 -> 6, 0 -> 15)
flatten array and zipWithIndex to get index and groupBy to map new array as column array, foldLeft to sum the column array.

Resources