Related
I have 3 arrays down below a and b combine to make a_and_b. a is multiplied by a_multiplier and b gets multiplied by b_multiplier. How would I be able to modify a_and_b after the multiplier has been implemented in it.
Code:
import numpy as np
a_multiplier = 3
b_multiplier = 5
a = np.array([5,32,1,4])
b = np.array([1,5,11,3])
a_and_b = np.array([5,1,32,5,1,11,4,3])
Expected Output:
[15, 5, 96, 25, 3, 55, 12, 15]
first learn how to use the multiply:
In [187]: a = np.array([5,32,1,4])
In [188]: a*3
Out[188]: array([15, 96, 3, 12])
In [189]: b = np.array([1,5,11,3])
In [190]: b*5
Out[190]: array([ 5, 25, 55, 15])
One way to combine the 2 arrays:
In [191]: np.stack((a*3, b*5),axis=1)
Out[191]:
array([[15, 5],
[96, 25],
[ 3, 55],
[12, 15]])
which can be easily turned into the desired 1d array:
In [192]: np.stack((a*3, b*5),axis=1).ravel()
Out[192]: array([15, 5, 96, 25, 3, 55, 12, 15])
I have the following array
import numpy as np
single_array =
[[ 1 80 80 80]
[ 2 80 80 89]
[ 3 52 50 90]
[ 4 39 34 54]
[ 5 37 47 32]
[ 6 42 42 27]
[ 7 42 52 27]
[ 8 38 33 28]
[ 9 42 37 42]]
and want to create another array with all unique sums of 2 rows within this single_array so that 1+2 and 2+1 are treated as duplicates and are only included once.
First I would like to update the 0th column of the array to multiply each value by 10 (so I can identify the corresponding matching), then I want to add up every 2 rows and append them into the new array.
Output should look like this:
double_array=
[[12 160 160 169]
[13 132 130 170]
[14 119 114 134]
...
[98 80 70 70]]
Can I use itertools.combinations to get a 3D array with two unique combinations and then add the rows on the corresponding 3rd axis?
This
import numpy as np
from itertools import combinations
single_array = np.array(
[[ 1, 80, 80, 80],
[ 2, 80, 80, 89],
[ 3, 52, 50, 90],
[ 4, 39, 34, 54],
[ 5, 37, 47, 32],
[ 6, 42, 42, 27],
[ 7, 42, 52, 27],
[ 8, 38, 33, 28],
[ 9, 42, 37, 42]]
)
np.vstack([single_array[i] * np.array([10, 1, 1, 1]) + single_array[j]
for i, j in combinations(range(single_array.shape[0]), 2)])
does what you ask for in terms of specified input and output; I'm not sure if it's what you actually need. I don't think it will scale to big inputs.
A 3D array to find this sum would be ragged (first "layer" would be 9 deep, next one 8, etc.); you could maybe get around this with NaNs or masking. It also wouldn't scale that well for big inputs: you'd be allocating twice as much memory as you need, and then have to index out ragged layers to get your final output.
If you have to do this fast for big arrays, I suggest a pre-allocated output array and a for-loop with Numba:
from numba import jit
#jit(nopython=True)
def unique_row_sums(a):
n = a.shape[0]
b = np.empty((n*(n-1)//2, a.shape[1]))
s = np.array([10, 1, 1, 1])
k = 0
for i in range(n):
for j in range(i+1, n):
b[k] = s * a[i] + a[j]
k += 1
return b
In my not-too-careful testing with IPython's %timeit, this took about 4µs versus 152µs for the itertools-based version with your data, and should scale better.
I have a large dask array containing approx 300 million records and 3 numeric columns
It looks like roughly like (first few records):
2345 947 23
12 234 924
9 8 0
349 276 345
etc...
I would like to add say 100 on to all the values contained in column 2 such that I get the below dask array. Any ideas?
2345 1047 23
12 334 924
9 108 0
349 376 345
etc...
The easiest way might to just be switch it over to a DataFrame and do the assignment there before switching back to an array:
df = darr.to_dask_dataframe(columns=["a", "b", "c"])
df["b"] += 100
darr = df.to_dask_array()
darr.compute()
This also has the benefit of being fairly obvious as to what is happening.
I also took a shot at this using a generalized ufunc -- I couldn't get da.apply_gufunc to work for me in combination with np.add.at but I'm still working to grok ufuncs myself so there's likely a faster or more compact way to do it but this appears to work:
import numpy as np
import dask.array as da
darr = da.array([
[2345, 947, 23],
[12, 234, 924],
[9, 8, 0],
[349, 276, 345]])
def add_at(arr, at, val):
np.add.at(arr, at, val)
return arr
gufunc_add_at = da.gufunc(add_at,
signature="(i),(),()->(i)",
output_dtypes=darr.dtype,
vectorize=True)
gufunc_add_at(darr, 1, 100).compute()
This is a bit clunky but seems to work
import dask.array as da
darr = da.array([
[2345, 947, 23],
[12, 234, 924],
[9, 8, 0],
[349, 276, 345]])
print(darr.compute())
x=darr[:,0].reshape(4,1).compute()
y=(darr[:,1] + 100).reshape(4,1).compute()
z=darr[:,2].reshape(4,1).compute()
t= da.stack([x,y,z], axis=1).reshape(4,3)
t.compute()
Output:
[[2345 947 23]
[ 12 234 924]
[ 9 8 0]
[ 349 276 345]]
array([[2345, 1047, 23],
[ 12, 334, 924],
[ 9, 108, 0],
[ 349, 376, 345]])
This is possibly an improvement to my first answer
from dask.array import from_array, add
from numpy import array
darr = da.array([
[2345, 947, 23],
[12, 234, 924],
[9, 8, 0],
[349, 276, 345]])
vector = from_array(array([[0],[100],[0]]))
add(darr.T, vector).T.compute()
Output
array([[2345, 1047, 23],
[ 12, 334, 924],
[ 9, 108, 0],
[ 349, 376, 345]])
I‘m pretty new to PySpark with some Python experience. I‘m already able to filter rows of a dataframe and have written udf's that calculate results from arrays in DataFrame cells with an int or double as result. No, I need an array as output and after hours I haven't found a useful example.
Here is the probleme:
The DataFrame has the following scheme, where number is the number of entries of the arrays of the same DataFrame row:
DataFrame[number: int, code: array<string>, d1: array<double>, d2: array<double>]
Here is an example of the DataFrame called df1:
[4 ,['correct', 'correct', 'wrong', 'correct'], [33, 42, 35, 76], [12, 35, 15, 16]]
[2 ,['correct', 'wrong'], [47, 43], [13, 17]]
Now only if I have a 'correct' in the i‘s position of the code-column of a DataFrame row I want to keep the i‘s position of d1 and d2. Additionally I want to have a new numberNew with the left over number of positions. The resulting structure and DataFrame „df2“ should look like this:
DataFrame[number: int, numberNew: int, code: array<string>, d1: array<double>, d2: array<double>]
[4 , 3, ['correct', 'correct', 'correct'], [33, 42, 76], [12, 35, 16]]
[2 , 1, ['correct'], [47], [13]]
Among several other things (and based on an in Python successful solution) I tried the following code:
def filterDF(number, code, d1, d2):
dataFiltered = []
numberNew = 0
for i in range(number):
if code[i] == 'correct':
dataFiltered.append([d1[i],d2[i]])
countNew += 1
newTable = {'countNew' : countNew, 'data' : dataFiltered}
newDf = pd.DataFrame(newTable)
return newDf
from pyspark.sql.types import ArrayType
filterDFudf = sqlContext.udf.register("filterDF", filterDF, "Array<double>")
df2 = df1.select(df1.number, filterDFudf(df1.number, df1.code, df1.d1, df1.d2)).alias('dataNew')
I got a pretty long and not really helpful error message. I.e. there was the following information:
TypeError: 'float' object has no attribute 'getitem'
It would be fantastic if someone here could show me how to solve this.
For an alternative solution, you can also make use of the list comprehension in python for your function:
def get_filtered_data(code, d1, d2):
indices = [i for i, s in enumerate(code) if 'correct' in s]
d1_ = [d1[index] for index in indices]
d2_ = [d2[index] for index in indices]
return [len(indices), d1_, d2_]
udf_get_filtered_data = udf(get_filtered_data, ArrayType(StringType()))
df = df.withColumn('filtered_data', udf_get_filtered_data('code', 'd1', 'd2'))
df.show() returns the following
+------+--------------------+----------------+----------------+--------------------+
|number| code| d1| d2| filtered_data|
+------+--------------------+----------------+----------------+--------------------+
| 4|[correct, correct...|[33, 42, 35, 76]|[12, 35, 15, 16]|[3, [33, 42, 76],...|
| 2| [correct, wrong]| [47, 43]| [13, 17]| [1, [47], [13]]|
+------+--------------------+----------------+----------------+--------------------+
By the way, if you use
dataFiltered.append([d1[i],d2[i]])
It will not give you the desired result you specified ([33, 42, 76], [12, 35, 16]). Rather, it will give you ([33,12], [42,35], [76,16])
This answer above gives you the correct results in d1 and d2 in a separate list as mentioned in the question.
You cannot return Pandas data frame from udf like this (there are other variants which supports this, but these don't match your logic), and the schema doesn't match the output anyway. Redefine your function like this:
def filterDF(number, code, d1, d2):
dataFiltered = []
countNew = 0
for i in range(number):
if code[i] == 'correct':
dataFiltered.append([d1[i],d2[i]])
countNew += 1
return (countNew, dataFiltered)
filterDFudf = sqlContext.udf.register(
"filterDF", filterDF,
"struct<countNew: long, data: array<array<long>>>"
)
Test:
df = sqlContext.createDataFrame([
(4 ,['correct', 'correct', 'wrong', 'correct'], [33, 42, 35, 76], [12, 35, 15, 16]),
(2 ,['correct', 'wrong'], [47, 43], [13, 17])
]).toDF("number", "code", "d1", "d2")
df.select(filterDFudf("number", "code", "d1", "d2")).show()
# +------------------------------+
# |filterDF(number, code, d1, d2)|
# +------------------------------+
# | [3, [[33, 12], [4...|
# | [1, [[47, 13]]]|
# +------------------------------+
Imagine we have the following array of 3 arrays, covering the range 1 to 150:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 ... 41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
[51, 52, 53, 54, 55, 56, 57, 58, 59, 60 ... 92, 93, 94, 95, 96, 97, 98, 99, 100, 107]
[71, 73, 84, 101, 102, 103, 104, 105, 106, 108 ... 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]
I want to build an array that stores in which array we find the values 1 to 150. The result must be then:
[1 1 1 ... 1 2 2 2 ... 2 3 2 3 2 ... 3 3 3 ... 3],
where each element corresponds to 1, 2, 3, ... ,150. The obtained array gives then the array-membership of the elements 1 to 150. The code must be applied for any number of arrays (so not only 3 arrays).
You can use an array comprehension. Here is an example with three vectors containing the range 1:10:
A = [1, 3, 4, 5, 7]
B = [2, 8, 9]
C = [6, 10]
Now we can write a comprehension using in with a fallback error to guard :
julia> [x in A ? 1 : x in B ? 2 : 3 for x in 1:10]
10-element Array{Int64,1}:
1
⋮
3
Perhaps also include a fallback error, in case the input is wrong
julia> [x in A ? 1 : x in B ? 2 : x in C ? 3 : error("not found") for x in 1:10]
10-element Array{Int64,1}:
1
⋮
3
Trade memory for search in this case:
Make an array to record which array each value is in.
# example arrays
N=100; A=rand(1:N,30);
B = rand(1:N,40);
C = rand(1:N,35);
# record array containing each value:
A=1,B=2,C=3;
not found=0;
arrayin = zeros(Int32, max(maximum(A),maximum(B),maximum(C)));
arrayin[A] .= 1;
arrayin[B] .= 2;
arrayin[C] .=3;