On Pyspark, I defined an UDF as follow:
from pyspark.sql.functions import udf
from scipy.spatial.distance import cdist
def closest_point(point, points):
""" Find closest point from a list of points. """
return points[cdist([point], points).argmin()]
udf_closest_point = udf(closest_point)
dfC1 = dfC1.withColumn("closest", udf_closest_point(dfC1.point, dfC1.points))
And my data looks like this:
point = [0.2,0.5] or [0.1,0.6] - array of float
points = [[0,1],[1,0],[1,1],[0,0]] - array of array of float
closest = for example, '[0, 1]' - a string (which is one of the value
from point converted as string)
What should I change for my UDF to bring back an array of float instead of a string?
You can specify the return type of UDF as array of floats ArrayType(FloatType()):
from pyspark.sql.types import ArrayType, FloatType
udf_closest_point = udf(closest_point, ArrayType(FloatType()))
Related
I have a requirement to query a column in a pyspark.sql.dataframe.DataFrame. I wish to create a string array from that column. I am using numpty arrays to achieve this however the result I get is an array of arrays
import numpy as np
df = spark.read.load(parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
print(type(data_array),data_array)
for x in data_array:
str = x[0]
print(type(x))
The output I get from my first print is:
<class 'numpy.ndarray'> [['London']
['New York']
['Paris']
['Rome']
['Berlin']]
And from the second Print I get
<class 'numpy.ndarray'>
So my question: is it possible to get these values as string array or failing that can I create a dynamic which I add the values of str in my for loop to as strings?
Things I've tried.
use asarray instead of array, as you can see I get the same.
data_array = list(data_array), well I get a list but its not usable as it contains all the meta too.
Open to suggestions and additional reading rather than full solutions.
Thanks.
The power of the post.
import numpy as np
df = spark.read.load('parquetfiles/part-00000-e7dad738-8895-45e8-9926-39c9d677b999-c000.snappy.parquet', format='parquet')
data_array = np.asarray(df.select('name').collect())
cases = []
for x in data_array:
str = x[0]
cases.append(str)
how can I write a code that shows me the index of where the Newdate1 and Newdate2 is located within Setups. The value for Newdate1 within Setups is the second index which outputs 1 for result. The np.where function does not work however. How could I do this without a for loop?
import numpy as np
Setups = np.array(['2017-09-15T07:11:00.000000000', '2017-09-15T11:25:00.000000000',
'2017-09-15T12:11:00.000000000', '2017-12-22T03:14:00.000000000',
'2017-12-22T03:26:00.000000000', '2017-12-22T03:31:00.000000000',
'2017-12-22T03:56:00.000000000'],dtype="datetime64[ns]")
Newdate1 = np.array(['2017-09-15T07:11:00.000000000'], dtype="datetime64[ns]")
Newdate2 = np.array(['2017-12-22T03:26:00.000000000'], dtype="datetime64[ns]")
result = np.where(Setups == Newdate1)
result2 = np.where(Setups == Newdate2)
Expected Output:
result: 1
result2: 4
use np.in1d to pass the array to be searched within another array and get the indices using np.where.
import numpy as np
Setups = np.array(['2017-09-15T07:11:00.000000000', '2017-09-15T11:25:00.000000000',
'2017-09-15T12:11:00.000000000', '2017-12-22T03:14:00.000000000',
'2017-12-22T03:26:00.000000000', '2017-12-22T03:31:00.000000000',
'2017-12-22T03:56:00.000000000'],dtype="datetime64[ns]")
newdates = np.array(['2017-09-15T07:11:00.000000000','2017-12-22T03:26:00.000000000'],dtype="datetime64[ns]")
print(np.where(np.in1d(Setups,newdates)))
output:
(array([0, 4]),)
I have this code and my aim to calculate the sin of my raster in the power of 0.8.
import os
os.chdir('D:/NOA/Soil_Erosion/test_Project/Workspace/Input_Data_LS_Factor')
import rasterio
import math
data = rasterio.open('Slope_degrees_clipped.tif')
band = data.read(1) # array of float32 with size (3297,2537)
w = band.shape[0]
print(w)
h = band.shape[1]
print(h)
dtypes =data.dtypes[0]
band_calc = math.sin(band)**0.8 # the formula I would like to calculate
However, the following error pops up:
only size-1 arrays can be converted to Python scalars / Rasterio
May you know how I should fix this?
P.S. I tried to vectorize it (np.vectorize()) but it does not work as it needs a real number.
When I use the np.ndarray.flatten(band) the same error occurs.
I found the solution on Geographic Information Systems:
import os
os.chdir('D:/NOA/Soil_Erosion/test_Project/Workspace/Input_Data_LS_Factor')
import rasterio
import math
data = rasterio.open('Slope_degrees_clipped.tif')
from rasterio.plot import show
show(data)
band = data.read(1) # array of float32 with size (3297,2537)
w = band.shape[0]
print(w)
h = band.shape[1]
print(h)
dtypes =data.dtypes[0]
Calculate the sine of the raster in the power of 0.8
import numpy as np
band_calc2 = np.sin(band)**0.8 # the formula I would like to calculate
"""
another way to do it
band_calc = [ [] for i in range(len(band)) ]
for i,row in enumerate(band):
for element in row:
band_calc[i].append(math.sin(element*math.pi/180)**0.8)
"""
How do I convert a Python array into a NumPy array, retaining the mixed datatypes, but replacing the tuples (parentheses) with square brackets instead? You will notice that the first 3 columns start off as int, float, float and the last column is a string. But in Block 3, all of them become strings!
Below is my output:
[(29606, 30.120779 , -97.309574 , 'DPCS')
(29606, 30.2312951 , -97.6918021 , 'DPCS')
(29606, 30.1682102 , -97.6160325 , 'DPCS')
(40880, 40.56634232, -83.10456486, 'RN')
(40880, 40.58765221, -83.14444627, 'RN')
(40880, 40.58286847, -83.12839945, 'RN')]
Block 2
[[29606, 30.120779, -97.309574, 'DPCS'], [29606, 30.2312951, -97.6918021, 'DPCS'], [29606, 30.1682102, -97.6160325, 'DPCS'], [40880, 40.5663423172498, -83.1045648601189, 'RN'], [40880, 40.5876522144065, -83.1444462730164, 'RN'], [40880, 40.5828684683826, -83.1283994529175, 'RN']]
Block 3
[['29606' '30.120779' '-97.309574' 'DPCS']
['29606' '30.2312951' '-97.6918021' 'DPCS']
['29606' '30.1682102' '-97.6160325' 'DPCS']
['40880' '40.5663423172498' '-83.1045648601189' 'RN']
['40880' '40.5876522144065' '-83.1444462730164' 'RN']
['40880' '40.5828684683826' '-83.1283994529175' 'RN']]
Process finished with exit code 0
The above comes from code:
import numpy
import pandas
from geopy.distance import great_circle
import utility_functions as uf
import timeit
from scipy.spatial.distance import cdist, euclidean
import itertools
from itertools import groupby
import numpy_indexed as npi
# normalization thresholds
DISTANCE_LOWER_THRESH = 0
DISTANCE_UPPER_THRESH = 50
#class for scoring and updating the matrix of scores between workers (rows) and patients (columns).
class WorkerPatientScores:
def __init__(self, dist_weight=1):
self.a = []
self.a = ([(29606, 30.120779, -97.309574, 'DPCS'),
(29606, 30.2312951, -97.6918021, 'DPCS'),
(29606, 30.1682102, -97.6160325, 'DPCS'),
(40880, 40.5663423172498, -83.1045648601189, 'RN'),
(40880, 40.5876522144065, -83.1444462730164, 'RN'),
(40880, 40.5828684683826, -83.1283994529175, 'RN')])
dt = numpy.dtype('int, float, float, object') # datatypes
ndarray = numpy.array(self.a, dtype=dt)
print(ndarray)
ndarray2 = [[i[0], i[1], i[2], i[3]] for i in ndarray]
print("Block 2")
print(ndarray2)
# Below removes previous datatypes
ndarray3 = numpy.array(ndarray2)
print("Block 3")
print(ndarray3)
When I instead change the above LOC to:
ndarray3 = numpy.array(ndarray2, dtype=dt)
I get the error:
ValueError: invalid literal for int() with base 10: 'DPCS'
ndarray is a valid structured array with 4 fields.
ndarray2 (misnamed) is a list of lists. You iterate on the elements (rows) of ndarray, and for each extract the field elements.
ndarray3 uses the common format, the string.
Note that self.a is a list of tuples. That's critical when creating a structured array.
alist = [(i[0], i[1], i[2], i[3]) for i in ndarray]
np.array(alist, dtype=dt)
should work. alist is a list of tuples.
ndarray.tolist() also produces that list of tuples.
np.array(..., object) works with either a list of lists or list of tuples.
Object dtype arrays have their place, but aren't processed in the same way as structured arrays, nor in the same way as numeric arrays. Each has their place.
I figured this out!
ndarray3 = numpy.array(ndarray2, dtype=object)
I was just solving a problem using python, and my codes are:
from math import sin,pi
import numpy
import numpy as np
import pylab
N=20
x = np.linspace(0,1, N)
def v(x):
return 100*sin(pi*x)
#set up initial condition
u0 = [0.0] # Boundary conditions at t= 0
for i in range(1,N):
u0[i] = v(x[i])
And I would want to plot the results by updating v(x) in range(0, N) after. it looks simple but perhaps you guys could help since it gives me an error, like
Traceback (most recent call last):
File "/home/universe/Desktop/Python/sample.py", line 13, in <module>
u0[i] = v(x[i])
IndexError: list assignment index out of range
You could change u0[i] = v(x[i]) to u0.append(v(x[i])). But you should write more elegantly as
u0 = [v(xi) for xi in x]
Indices i are bug magnets.
Since you are using numpy, I'd suggest using np.vectorize. That way you can pass the array x directly to the function and the function will return an array of the same size with the function applied on each element of the input array.
from math import sin,pi
import numpy
import numpy as np
import pylab
N=20
x = np.linspace(0,1, N)
def v(x):
return 100*sin(pi*x)
vectorized_v = np.vectorize(v) #so that the function takes an array of x's and returns an array again
u0 = vectorized_v(x)
Out:
array([ 0.00000000e+00, 1.64594590e+01, 3.24699469e+01,
4.75947393e+01, 6.14212713e+01, 7.35723911e+01,
8.37166478e+01, 9.15773327e+01, 9.69400266e+01,
9.96584493e+01, 9.96584493e+01, 9.69400266e+01,
9.15773327e+01, 8.37166478e+01, 7.35723911e+01,
6.14212713e+01, 4.75947393e+01, 3.24699469e+01,
1.64594590e+01, 1.22464680e-14])
u is a list with one element, so you can't assign values to indices that don't exist. Instead make u a dictionary
u = {}
u[0] = 0.0
for i in range(1,N):
u[i] = v(x[i])