Associating two arrays in an RDD by index - arrays

I have an RDD contains two arrays for each row RDD[(Array[Int], Array[Double])]. For each row, the two arrays have similar size of n. However, every row has different size of n, and n could be up to 200. The sample data is as follows:
(Array(1, 3, 5), Array(1.0, 1.0, 2.0))
(Array(6, 3, 1, 9), Array(2.0, 1.0, 2.0, 1.0))
(Array(2, 4), Array(1.0, 3.0))
. . .
I want to combine between those two arrays according to the index for each line. So, the expected output is as follows:
((1,1.0), (3,1.0), (5,2.0))
((6,2.0), (3,1.0), (1,2.0), (9,1.0))
((2,1.0), (4,3.0))
This is my code:
val data = spark.sparkContext.parallelize(Seq( (Array(1, 3, 5),Array(1.0, 1.0, 2.0)), (Array(6, 3, 1,9),Array(2.0, 1.0, 2.0, 1.0)) , (Array(2, 4),Array(1.0, 3.0)) ) )
val pairArr = data.map{x =>
(x._1(0), x._2(0))
}
//pairArr: Array((1,1.0), (6,2.0), (2,1.0))
This code only takes the value of the first index in each row.
Can anybody give me direction how to get the expected output?
Thanks.

You need to zip the two elements in each tuple:
data.map(x => x._1.zip(x._2)).collect
// res1: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))
Or with pattern matching:
data.map{ case (x, y) => x.zip(y) }.collect
// res0: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

Related

scipy curve_fit with arrays TypeError: only length-1 arrays can be converted to Python scalars

I am trying to create the curve fit with scipy for the energy eigenvalues calculated from a 4x4 Hamiltonian matrix. In the following error "energies" corresponds to the function in which I define the Hamiltonian, "xdata" is an array given after and out of the function and corresponds to k and "e" is the energy eigenvalues that a get.
The error seems to be at the Hamiltonian matrix. However if I run the code without the curve_fit everything works fine.
I have also tried using np.array according to other questions I found here but again it doesn't work.
If a give a specific xdata in the curve fit, like xdata[0], the code works but it doesn't help me much since I want the fit using all values.
Does anyone know what is the problem? Thank you all in advance!
Traceback (most recent call last):
File "fitest.py", line 70, in <module>
popt, pcov = curve_fit(energies,xdata, e)#,
File "/eb/software/Python/2.7.12-intel-2016b/lib/python2.7/site-packages/scipy/optimize/minpack.py", line 651, in curve_fit
res = leastsq(func, p0, args=args, full_output=1, **kwargs)
File "/eb/software/Python/2.7.12-intel-2016b/lib/python2.7/site-packages/scipy/optimize/minpack.py", line 377, in leastsq
shape, dtype = _check_func('leastsq', 'func', func, x0, args, n)
File "/eb/software/Python/2.7.12-intel-2016b/lib/python2.7/site-packages/scipy/optimize/minpack.py", line 26, in _check_func
res = atleast_1d(thefunc(*((x0[:numinputs],) + args)))
File "/eb/software/Python/2.7.12-intel-2016b/lib/python2.7/site-packages/scipy/optimize/minpack.py", line 453, in _general_function
return function(xdata, *params) - ydata
File "fitest.py", line 23, in energies
[ 0.0, 0.0, 0.0, ep-2*V4*cos(kpt*d) ]],dtype=complex)
TypeError: only length-1 arrays can be converted to Python scalars
Code:
from numpy import sin, cos, array
from scipy.optimize import curve_fit
from numpy import *
from numpy.linalg import *
def energies(kpt, a=1.0, b=2.0, c=3.0, f=4.0):
e1=-15.0
e2=-10.0
d=1.0
v0=(-2.0/d**2)
V1=a*v0
V2=b*v0
V3=c*v0
V4=d*v0
basis=('|S, s>', '|S,px>', '|S, py>', '|S,pz>')
h=array([[ e1-2*V1*cos(kpt*d), -2*V2*1j*sin(kpt*d), 0.0, 0.0 ],
[ 2*V2*1j*sin(kpt*d), e2-2*V3*cos(kpt*d), 0.0, 0.0],
[ 0.0, 0.0, e2-2*V4*cos(kpt*d), 0.0],
[ 0.0, 0.0, 0.0, e2-2*V4*cos(kpt*d) ]],dtype=complex)
e,psi=eigh(h)
return e
print energies(kpt=0.0)
k2=0.4*2*pi/2.05
print energies(kpt=k2)
xdata = array([0.0,k2])
print xdata
popt, pcov = curve_fit(energies, xdata, e)
print " "
print popt
print " "
Your problem has nothing to do with your fit, you run into the same problem, if you perform
print energies(xdata)
The reason for this error message is that you put an array kpt into h as an array element and then tell numpy, to transform this array kpt into a complex number. Numpy is kind enough to transform an array of length one into a scalar, which then can be transformed into a complex number. This explains, why you didn't get an error message with xdata[0]. You can easily reproduce your problem like this
import numpy as np
#all fine with an array of length one
xa = np.asarray([1])
a = np.asarray([[xa, 2, 3], [4, 5, 6]])
print a
print a.astype(complex)
#can't apply dtype = complex to an array with two elements
xb = np.asarray([1, 2])
b = np.asarray([[xb, 2, 3], [4, 5, 6]])
print b
print b.astype(complex)
Idk, what you were trying to achieve with your energies function, so I can only speculate, what you were aiming at, when constructing the h array. Maybe a 3D array like this?
kpt = np.asarray([1, 2, 3])
h = np.zeros(16 * len(kpt), dtype = complex).reshape(len(kpt), 4, 4)
h[:, 0, 0] = 2 * kpt + 1
h[:, 0, 1] = kpt ** 2
h[:, 3, 2] = np.sin(kpt)
print h

Create Array of Arrays with different sizes in scala [duplicate]

How do I create an array of multiple dimensions?
For example, I want an integer or double matrix, something like double[][] in Java.
I know for a fact that arrays changed in Scala 2.8 and that the old arrays are deprecated, but are there multiple ways to do it now and if yes, which is best?
Like so:
scala> Array.ofDim[Double](2, 2, 2)
res2: Array[Array[Array[Double]]] = Array(Array(Array(0.0, 0.0), Array(0.0, 0.0)), Array(Array(0.0, 0.0), Array(0.0, 0.0)))
scala> {val (x, y) = (2, 3); Array.tabulate(x, y)( (x, y) => x + y )}
res3: Array[Array[Int]] = Array(Array(0, 1, 2), Array(1, 2, 3))
It's deprecated. Companion object exports factory methods ofDim:
val cube = Array.ofDim[Float](8, 8, 8)
How to create and use a multi-dimensional array in Scala?
var dd : Array[(Int, (Double, Double))] = Array((1,(0.0,0.0)))

Binning then sorting arrays in each bin but keeping their indices together

I have two arrays and the indices of these arrays are related. So x[0] is related to y[0], so they need to stay organized. I have binned the x array into two bins as shown in the code below.
x = [1,4,7,0,5]
y = [.1,.7,.6,.8,.3]
binx = [0,4,9]
index = np.digitize(x,binx)
Giving me the following:
In [1]: index
Out[1]: array([1, 2, 2, 1, 2])
So far so good. (I think)
The y array is a parameter telling me how well measured the x data point is, so .9 is better than .2, so I'm using the next code to sort out the best of the y array:
y.sort()
ysorted = y[int(len(y) * .5):]
which gives me:
In [2]: ysorted
Out[2]: [0.6, 0.7, 0.8]
giving me the last 50% of the array. Again, this is what I want.
My question is how do I combine these two operations? From each bin, I need to get the best 50% and put these new values into a new x and new y array. Again, keeping the indices of each array organized. Or is there an easier way to do this? I hope this makes sense.
Many numpy functions have arg... variants that don't operate "by value" but rather "by index". In your case argsort does what you want:
order = np.argsort(y)
# order is an array of indices such that
# y[order] is sorted
top50 = order[len(order) // 2 :]
top50x = x[top50]
# now top50x are the x corresponding 1-to-1 to the 50% best y
You should make a list of pairs from your x and y lists
It can be achieved with the zip function:
x = [1,4,7,0,5]
y = [.1,.7,.6,.8,.3]
values = zip(x, y)
values
[(1, 0.1), (4, 0.7), (7, 0.6), (0, 0.8), (5, 0.3)]
To sort such a list of pairs by a specific element of each pair you may use the sort's key parameter:
values.sort(key=lambda pair: pair[1])
[(1, 0.1), (5, 0.3), (7, 0.6), (4, 0.7), (0, 0.8)]
Then you may do whatever you want with this sorted list of pairs.

Multidimensional Array zip array in scala

I have two array like:
val one = Array(1, 2, 3, 4)
val two = Array(4, 5, 6, 7)
var three = one zip two map{case(a, b) => a * b}
It's ok.
But I have a multidimensional Array and a one-dimensional array now, like this:
val mulOne = Array(Array(1, 2, 3, 4),Array(5, 6, 7, 8),Array(9, 10, 11, 12))
val one_arr = Array(1, 2, 3, 4)
I would like to multiplication them, how can I do this in scala?
You could use:
val tmpArr = mulOne.map(_ zip one_arr).map(_.map{case(a,b) => a*b})
This would give you Array(Array(1*1, 2*2, 3*3, 4*4), Array(5*1, 6*2, 7*3, 8*4), Array(9*1, 10*2, 11*3, 12*4)).
Here mulOne.map(_ zip one_arr) maps each internal array of mulOne with corresponding element of one_arr to create pairs like: Array(Array((1,1), (2,2), (3,3), (4,4)), ..) (Note: I have used placeholder syntax). In the next step .map(_.map{case(a,b) => a*b}) multiplies each elements of pair to give
output like: Array(Array(1, 4, 9, 16),..)
Then you could use:
tmpArr.map(_.reduce(_ + _))
to get sum of all internal Arrays to get Array(30, 70, 110)
Try this
mulOne.map{x => (x, one_arr)}.map { case(one, two) => one zip two map{case(a, b) => a * b} }
mulOne.map{x => (x, one_arr)} => for every array inside mulOne, create a tuple with content of one_arr.
.map { case(one, two) => one zip two map{case(a, b) => a * b} } is basically the operation that you performed in your first example on every tuple that were created in the first step.
Using a for comprehension like this,
val prods =
for { xs <- mulOne
zx = one_arr zip xs
(a,b) <- zx
} yield a*b
and so
prods.sum
delivers the final result.

How to sum up every column of a Scala array?

If I have an array of array (similar to a matrix) in Scala, what's the efficient way to sum up each column of the matrix? For example, if my array of array is like below:
val arr = Array(Array(1, 100, ...), Array(2, 200, ...), Array(3, 300, ...))
and I want to sum up each column (e.g., sum up the first element of all sub-arrays, sum up the second element of all sub-arrays, etc.) and get a new array like below:
newArr = Array(6, 600, ...)
How can I do this efficiently in Spark Scala?
There is a suitable .transpose method on List that can help here, although I can't say what its efficiency is like:
arr.toList.transpose.map(_.sum)
(then call .toArray if you specifically need the result as an array).
Using breeze Vector:
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.map(breeze.linalg.Vector(_)).reduce(_ + _)
res0: breeze.linalg.Vector[Int] = DenseVector(6, 600)
If your input is sparse you may consider using breeze.linalg.SparseVector.
In practice a linear algebra vector library as mentioned by #zero323 will often be the better choice.
If you can't use a vector library, I suggest writing a function col2sum that can sum two columns -- even if they are not the same length -- and then use Array.reduce to extend this operation to N columns. Using reduce is valid because we know that sums are not dependent on order of operations (i.e. 1+2+3 == 3+2+1 == 3+1+2 == 6) :
def col2sum(x:Array[Int],y:Array[Int]):Array[Int] = {
x.zipAll(y,0,0).map(pair=>pair._1+pair._2)
}
def colsum(a:Array[Array[Int]]):Array[Int] = {
a.reduce(col2sum)
}
val z = Array(Array(1, 2, 3, 4, 5), Array(2, 4, 6, 8, 10), Array(1, 9));
colsum(z)
--> Array[Int] = Array(4, 15, 9, 12, 15)
scala> val arr = Array(Array(1, 100), Array(2, 200), Array(3, 300 ))
arr: Array[Array[Int]] = Array(Array(1, 100), Array(2, 200), Array(3, 300))
scala> arr.flatten.zipWithIndex.groupBy(c => (c._2 + 1) % 2)
.map(a => a._1 -> a._2.foldLeft(0)((sum, i) => sum + i._1))
res40: scala.collection.immutable.Map[Int,Int] = Map(2 -> 600, 1 -> 6, 0 -> 15)
flatten array and zipWithIndex to get index and groupBy to map new array as column array, foldLeft to sum the column array.

Resources