Pyspark 2.1.0 wrapped array to array - arrays

I have a Spark (Python) dataframe with two columns: a user ID and then an array of arrays, which is represented in Spark as a wrapped array like so:
[WrappedArray(9, 10, 11, 12), WrappedArray(20, 21, 22, 23, 24, 25, 26)]
In its usual representation this would look like this:
[[9, 10, 11, 12], [20, 21, 22, 23, 24, 25, 26]]
I want to perform operations on each of the subarrays, for example take a third list and check whether any of its values is in the first sub-array, but I can't seem to find solutions for pyspark 2.0 (only Scala-specific older solutions like this and this).
How does one access (and in general work with) wrapped arrays? What is an efficient way to do what I described above?

You can treat each wrapped array as individual list . in your example, if you want to which elements from 2nd wrapped array is present in first array, you could do something like -
# Prepare data
data = [[10001,[9, 10, 11, 12],[20, 10, 9, 23, 24, 25, 26]],
[10002,[8, 1, 2, 3],[49, 3, 6, 5, 6]],
]
rdd = sc.parallelize(data)
df = rdd.map(
lambda row : row+[
[x for x in row[2] if x in row[1]]
]
).toDF(["userID","array1","array2","commonElements"])
df.show()
output :
+------+---------------+--------------------+--------------+
|userID| array1| array2|commonElements|
+------+---------------+--------------------+--------------+
| 10001|[9, 10, 11, 12]|[20, 10, 9, 23, 2...| [10, 9]|
| 10002| [8, 1, 2, 3]| [49, 3, 6, 5, 6]| [3]|
+------+---------------+--------------------+--------------+

Related

Formatting arrays with multiplication Numpy Python

I have 3 arrays down below a and b combine to make a_and_b. a is multiplied by a_multiplier and b gets multiplied by b_multiplier. How would I be able to modify a_and_b after the multiplier has been implemented in it.
Code:
import numpy as np
a_multiplier = 3
b_multiplier = 5
a = np.array([5,32,1,4])
b = np.array([1,5,11,3])
a_and_b = np.array([5,1,32,5,1,11,4,3])
Expected Output:
[15, 5, 96, 25, 3, 55, 12, 15]
first learn how to use the multiply:
In [187]: a = np.array([5,32,1,4])
In [188]: a*3
Out[188]: array([15, 96, 3, 12])
In [189]: b = np.array([1,5,11,3])
In [190]: b*5
Out[190]: array([ 5, 25, 55, 15])
One way to combine the 2 arrays:
In [191]: np.stack((a*3, b*5),axis=1)
Out[191]:
array([[15, 5],
[96, 25],
[ 3, 55],
[12, 15]])
which can be easily turned into the desired 1d array:
In [192]: np.stack((a*3, b*5),axis=1).ravel()
Out[192]: array([15, 5, 96, 25, 3, 55, 12, 15])

What type of sorting is this?

I have written a simple sorting algorithm and would like to know what type is it?
It just maps the initial array's elements into an empty array with the indexes of the values of the initial elements.
my #arg = (5, 14, 12, 9, 1, 17, 3, 19, 20, 4, 6, 15, 8, 18, 7, 2, 10, 13, 11, 16);
my #out;
map { $out[$_] = $_ } #arg;
print join " ", #out;
Sure there can be added shrinking to the output array as in the real world there can be holes in indexes.
Also, this example can be extended for working with doubles. For this sake, I would suggest using other data structures (i.e.: trees or linked lists)
UPDATE
Benchmarks:
Rate uniqsort bubble mapping perlsort
uniqsort 82274/s -- -29% -87% -90%
bubble 115925/s 41% -- -81% -86%
mapping 614399/s 647% 430% -- -25%
perlsort 814352/s 890% 602% 33% --
&uniqsort - using List::MoreUtils via uniq sort #arr;
&bubble - the basic bubble sort
&mapping - this one
&perlsort - using sort {$a<=>$b} #arr
First of all, that doesn't sort the values because it produces the following:
undef, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20
Note the extra element. It's a partially broken (no filtering of empty elements), partially specialized (no duplicates allowed) version of pigeonhole sort.
By the way,
#out[#arg] = #arg;
should be faster than
map { $out[$_] = $_ } #arg;

Get coordinates in a 2D array? [duplicate]

This question already has answers here:
How do I get indices of N maximum values in a NumPy array?
(21 answers)
Closed 1 year ago.
I have this [116, 116] array, and I would like to get the coordinates/indices of the 10 maximum values present in that array.
How can I achieve that?
Thanks!
Let's create a test array arr as:
array([[ 1, 2, 141, 4, 5, 6],
[ 7, 143, 9, 10, 11, 12],
[ 13, 14, 15, 145, 17, 18],
[ 19, 20, 21, 22, 23, 24],
[ 25, 26, 27, 28, 29, 30]])
To find cordinates of e.g. 3 maximum values, run:
ind = np.divmod(np.flip(np.argsort(arr, axis=None)[-3:]), arr.shape[1])
The result is a 2-tuple with row and column coordinates:
(array([2, 1, 0], dtype=int64), array([3, 1, 2], dtype=int64))
To test it, you can print indicated elements:
arr[ind]
getting:
array([145, 143, 141])
Now replace -3 with -10 and you will get coordinates of 10
max elements.
See this answer using np.argpartition.

Translate VBA array into Python

USING IDLE/Python 3.5.1
May I first of all begin by saying I am a reasonably experienced programmer in VBA but am on day 2 of Python. I assure you I have conducted many searches on this question but the 30 or so documents I have read do not seem to explain my problem.
May I also please request that any answers given are properly formatted code for Python 3.5.1 rather than helpful pointers to other documentation or links?
The Problem
I am running a report and outputting results as I go. I need to store the results (presumably in an array) during this so that I can refer to them afterwards. The report (and the populating of the array) can be rerun multiple times so please bear that in mind if using concepts like 'append' when building the array. The array has dimensions [25,4] - a maximum of 25 records with four items in each.
Day X Y Z Total
1 2 3 4 9
2 3 4 5 12 ...
(Purists: The total needs to be recorded rather than calculated because of rounding.)
I could solve the problem myself if someone could translate this bit of code into Python (from VBA for illustration purposes). I do not want to import the arrays module unless it's the only way. Note: Variable l is a loop that makes the array get built twice to demonstrate that the array needs to be capable of rebuilding from scratch rather than being created just the once.
Sub sArray()
Dim a(25, 4)
For l = 1 To 2
For i = 1 To 25
For j = 1 To 4
a(i, j) = Int(100 * Rnd(1)) + 1
Debug.Print a(i, j);
Next j
Next i
Next l
End Sub
Thanks,
Tom
I am not sure I got your question correctly...
If you want to make an array (list i a better term in this case) of size [25,4] this is one way to go:
import random
a = [[int(100*random.random())+1 for j in range(4)] for i in range(25)]
>>> print a
[[74, 17, 36, 75],
[1, 79, 33, 90],
[58, 66, 47, 95],
[35, 40, 87, 38],
[43, 46, 34, 66],
[69, 34, 26, 49],
[56, 83, 44, 14],
[2, 44, 54, 97],
[50, 21, 39, 60],
[13, 94, 12, 48],
[36, 13, 2, 71],
[77, 44, 31, 11],
[56, 26, 30, 39],
[17, 13, 83, 84],
[54, 37, 34, 18],
[5, 54, 88, 100],
[22, 77, 70, 21],
[51, 88, 26, 97],
[69, 33, 86, 48],
[42, 66, 38, 78],
[71, 43, 96, 23],
[6, 46, 100, 29],
[32, 86, 15, 48],
[96, 84, 8, 56],
[29, 64, 69, 79]]
if you want to show that "the array needs to be capable of rebuilding from scratch rather than being created just the once" (why would you need this??)
for l in range(2):
a = [[int(100*random.random())+1 for j in range(4)] for i in range(25)]
Also, the way of generating random numbers is odd (I have translated you method). To get the same result in python, just use random.randint(1,100) to generate random integers from 1 (i think you do not want to have 0 there) to whatever number you like.
If I have correctly understood from your comments, this is what you want:
def report(g=25):
array = []
for _ in range(g):
x = random.randint(1,10)
y = random.randint(1,10)
z = random.randint(1,10)
total = x+y+x
row = [x,y,z,total]
print(row)
array.append(row)
return array
result = report()
#prints all the rows while computing
>>> result #stores the "array"
[8, 4, 3, 20]
[10, 7, 4, 27]
[2, 4, 5, 8]
[8, 5, 8, 21]
[9, 7, 2, 25]
[2, 2, 3, 6]
[5, 8, 6, 18]
[8, 6, 1, 22]
[7, 6, 4, 20]
[7, 2, 10, 16]
[6, 5, 9, 17]
[3, 8, 8, 14]
[9, 1, 9, 19]
[1, 7, 7, 9]
[6, 6, 2, 18]
[9, 10, 1, 28]
[4, 6, 2, 14]
[6, 1, 6, 13]
[4, 1, 3, 9]
[5, 3, 5, 13]
[7, 5, 2, 19]
[9, 5, 7, 23]
[2, 5, 8, 9]
[3, 10, 4, 16]
[5, 6, 5, 16]

Performing complicated matrix manipulation operations with cblas_sgemm in order to carry out multiplication

I have 100 3x3x3 matrices that I would like to multiply with another large matrix of size 3x5x5 (similar to convolving one image with multiple filters, but not quite).
For the sake of explanation, this is what my large matrix looks like:
>>> x = np.arange(75).reshape(3, 5, 5)
>>> x
array([[[ 0, 1, 2, 3, 4],
[ 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]],
[[25, 26, 27, 28, 29],
[30, 31, 32, 33, 34],
[35, 36, 37, 38, 39],
[40, 41, 42, 43, 44],
[45, 46, 47, 48, 49]],
[[50, 51, 52, 53, 54],
[55, 56, 57, 58, 59],
[60, 61, 62, 63, 64],
[65, 66, 67, 68, 69],
[70, 71, 72, 73, 74]]])
In memory, I assume all sub matrices in the large matrix are stored in contiguous locations (please correct me if I'm wrong). What I want to do is, from this 3x5x5 matrix, I want to extract 3 5x3 columns from each sub-matrix of the large matrix and then join them horizontally to get a 5x9 matrix (I apologise if this part is not clear, I can explain in more detail if need be). If I were using numpy, I'd do:
>>> k = np.hstack(np.vstack(x)[:, 0:3].reshape(3, 5, 3))
>>> k
array([[ 0, 1, 2, 25, 26, 27, 50, 51, 52],
[ 5, 6, 7, 30, 31, 32, 55, 56, 57],
[10, 11, 12, 35, 36, 37, 60, 61, 62],
[15, 16, 17, 40, 41, 42, 65, 66, 67],
[20, 21, 22, 45, 46, 47, 70, 71, 72]])
However, I'm not using python so I do not have any access to the numpy functions that I need in order to reshape the data blocks into a form I want to carry out multiplication... I can only directly call the cblas_sgemm function (from the BLAS library) in C, where k corresponds to input B.
Here's my call to cblas_sgemm:
cblas_sgemm( CblasRowMajor, CblasNoTrans, CblasTrans,
100, 5, 9,
1.0,
A, 9,
B, 9, // this is actually wrong, since I don't know how to specify the right parameter
0.0,
result, 5);
Basically, the ldb attribute is the offender here, because my data is not blocked the way I need it to be. I have tried different things, but I am not able to get cblas_sgemm to understand how I want it to read and understand my data.
In short, I don't know how to tell cblas_sgemm to read x like k.Is there a way I can smartly reshape my data in python before sending it to C, so that cblas_sgemm can work the way I want it to?
I will transpose k by setting CblasTrans, so during multiplication, B is 9x5. My matrix A is of shape 100x9. Hope that helps.
Any help would be appreciated. Thanks!
In short, I don't know how to tell cblas_sgemm to read x like k.
You can't. You'll have to make a copy.
Consider k:
In [20]: k
Out[20]:
array([[ 0, 1, 2, 25, 26, 27, 50, 51, 52],
[ 5, 6, 7, 30, 31, 32, 55, 56, 57],
[10, 11, 12, 35, 36, 37, 60, 61, 62],
[15, 16, 17, 40, 41, 42, 65, 66, 67],
[20, 21, 22, 45, 46, 47, 70, 71, 72]])
In a two-dimensional array, the spacing of the elements in memory must be the same in each axis. You know from how x was created that the consecutive elements in memory are 0, 1, 2, 3, 4, ..., but your first row of k contains 0, 1, 2, 25, 26, ..... The is no spacing between 1 and 2 (i.e. the memory address increases by the size of one element of the array), but there is a large jump in memory between 2 and 25. So you'll have to make a copy to create k.
Having said that, there is an alternative method to efficiently achieve your desired final result using a bit of reshaping (without copying) and numpy's einsum function.
Here's an example. First define x and A:
In [52]: x = np.arange(75).reshape(3, 5, 5)
In [53]: A = np.arange(90).reshape(10, 9)
Here's my understanding of what you want to achieve; A.dot(k.T) is the desired result:
In [54]: k = np.hstack(np.vstack(x)[:, 0:3].reshape(3, 5, 3))
In [55]: A.dot(k.T)
Out[55]:
array([[ 1392, 1572, 1752, 1932, 2112],
[ 3498, 4083, 4668, 5253, 5838],
[ 5604, 6594, 7584, 8574, 9564],
[ 7710, 9105, 10500, 11895, 13290],
[ 9816, 11616, 13416, 15216, 17016],
[11922, 14127, 16332, 18537, 20742],
[14028, 16638, 19248, 21858, 24468],
[16134, 19149, 22164, 25179, 28194],
[18240, 21660, 25080, 28500, 31920],
[20346, 24171, 27996, 31821, 35646]])
Here's how you can get the same result by slicing x and reshaping A:
In [56]: x2 = x[:,:,:3]
In [57]: A2 = A.reshape(-1, 3, 3)
In [58]: einsum('ijk,jlk', A2, x2)
Out[58]:
array([[ 1392, 1572, 1752, 1932, 2112],
[ 3498, 4083, 4668, 5253, 5838],
[ 5604, 6594, 7584, 8574, 9564],
[ 7710, 9105, 10500, 11895, 13290],
[ 9816, 11616, 13416, 15216, 17016],
[11922, 14127, 16332, 18537, 20742],
[14028, 16638, 19248, 21858, 24468],
[16134, 19149, 22164, 25179, 28194],
[18240, 21660, 25080, 28500, 31920],
[20346, 24171, 27996, 31821, 35646]])

Resources