from snowflake.snowpark.functions import udf
import numpy as np
import math
#udf(packages=["numpy"])
def quantile_udf(x: float) -> float:
return np.quantile(x,0.95)
#udf(packages=["numpy"])
def mean_udf(x: float) -> float:
return np.mean(x)
tf = df_operation.groupBy('STORE_ID').function(mean_udf("REG_SLS_U"))
tf is <function snowflake.snowpark.relational_grouped_dataframe.RelationalGroupedDataFrame.function.<locals>.<lambda>(*cols)>
How to call the UDF function in a dataframe and print the data in the tf. The same code is working with pandas.
Did you try tf.show()? It prints a small number of rows of your data. Or tf.collect() which returns all the rows without printing.
https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.DataFrame.show.html
https://docs.snowflake.com/en/developer-guide/snowpark/reference/python/api/snowflake.snowpark.DataFrame.collect.html
Looking at your code, it seems you would like to do something like this
from snowflake.snowpark.functions import udf
import numpy as np
import math
#udf(packages=["numpy"])
def mean_udf(x: float) -> float:
return np.mean(x)
df_groupby = df_operation.groupBy('STORE_ID')
tf = df_groupby.select(mean_udf("REG_SLS_U").as_("tf"))
tf is <snowflake.snowpark.dataframe.DataFrame at 0x131c1eb20>
Checkout some examples here
Related
I have this code and my aim to calculate the sin of my raster in the power of 0.8.
import os
os.chdir('D:/NOA/Soil_Erosion/test_Project/Workspace/Input_Data_LS_Factor')
import rasterio
import math
data = rasterio.open('Slope_degrees_clipped.tif')
band = data.read(1) # array of float32 with size (3297,2537)
w = band.shape[0]
print(w)
h = band.shape[1]
print(h)
dtypes =data.dtypes[0]
band_calc = math.sin(band)**0.8 # the formula I would like to calculate
However, the following error pops up:
only size-1 arrays can be converted to Python scalars / Rasterio
May you know how I should fix this?
P.S. I tried to vectorize it (np.vectorize()) but it does not work as it needs a real number.
When I use the np.ndarray.flatten(band) the same error occurs.
I found the solution on Geographic Information Systems:
import os
os.chdir('D:/NOA/Soil_Erosion/test_Project/Workspace/Input_Data_LS_Factor')
import rasterio
import math
data = rasterio.open('Slope_degrees_clipped.tif')
from rasterio.plot import show
show(data)
band = data.read(1) # array of float32 with size (3297,2537)
w = band.shape[0]
print(w)
h = band.shape[1]
print(h)
dtypes =data.dtypes[0]
Calculate the sine of the raster in the power of 0.8
import numpy as np
band_calc2 = np.sin(band)**0.8 # the formula I would like to calculate
"""
another way to do it
band_calc = [ [] for i in range(len(band)) ]
for i,row in enumerate(band):
for element in row:
band_calc[i].append(math.sin(element*math.pi/180)**0.8)
"""
Trying to simply apple numba #njit (No Python mode) for speed in numba but running into errors I do not understand.
Want to declare an array of size n =100, and in the loop want to set each array member with index i in range (0,100) equal to r**2+5
Why the big stack of errors from numba ?
# -*- coding: utf-8 -*-
"""
Spyder Editor
This is a temporary script file.
"""
import numpy as np
from numba import njit
n=100
r=.5
Values=np.zeros(n, dtype=np.float64)
#njit
def func(n):
for i in range(0,n):
Values[i]=r**2+5
return(Values)
print(func(n))
You could do it with a bit of modification your code as follows:
import numpy as np
from numba import njit
#njit
def func(n):
r = .5
Values = np.zeros(n, dtype=np.float64)
for i in range(0, n):
Values[i] = r ** 2 + 5
return (Values)
Or you could do it with much cleaner and pythonic way of list comprehensions. i.e bulk assigning as you called it.
#njit
def func1(n):
vals = np.array([(0.5**2 + 5) for r in range(n)])
return vals
I am embarassed with this. I would like to transform this array to pandas dataframe with one column let's say called "feature" and one value: [135, 2270.24]:
array([[[135, 2270.24]]], dtype=object)
I tried this but returns ValueError: Must pass 2-d input
df = pd.DataFrame(C, columns = ['feature']) with C the array.
I'm not entirely sure I follow exactly what you're asking for. But if my interpretation is correct you're looking for something like this?
import pandas as pd
import numpy as np
# setup
val = np.array([[[135, 2270.24]]])
# logic
data = [{'feature': val[0][0]}]
df = pd.DataFrame(data)
Output df:
feature
0 [135.0, 2270.24]
On Pyspark, I defined an UDF as follow:
from pyspark.sql.functions import udf
from scipy.spatial.distance import cdist
def closest_point(point, points):
""" Find closest point from a list of points. """
return points[cdist([point], points).argmin()]
udf_closest_point = udf(closest_point)
dfC1 = dfC1.withColumn("closest", udf_closest_point(dfC1.point, dfC1.points))
And my data looks like this:
point = [0.2,0.5] or [0.1,0.6] - array of float
points = [[0,1],[1,0],[1,1],[0,0]] - array of array of float
closest = for example, '[0, 1]' - a string (which is one of the value
from point converted as string)
What should I change for my UDF to bring back an array of float instead of a string?
You can specify the return type of UDF as array of floats ArrayType(FloatType()):
from pyspark.sql.types import ArrayType, FloatType
udf_closest_point = udf(closest_point, ArrayType(FloatType()))
I'm working on a Pandas DF question and I am having trouble converting some Pandas data into a usable format to create a Scatter Plot.
Here is the code below, please let me know what I am doing wrong and how I can correct it going forward. Honest criticism is needed as I am a beginner.
# Import Data
df = pd.read_csv(filepath + 'BaltimoreData.csv')
df = df.dropna()
print(df.head(20))
# These are two categories within the data
df.plot(df['Bachelors degree'], df['Median Income'])
# Plotting the Data
df.plot(kind = 'scatter', x = 'Bachelor degree', y = 'Median Income')
df.plot(kind = 'density')
Simply plot x on y as below, where df is your dataframe and x and y are your dependent and independent variables:
import matplotlib.pyplot as plt
import pandas
plt.scatter(x=df['Bachelors degree'], y=df['Median Income'])
plt.show()
You can use scatter plot from pandas.
import pandas
import matplotlib.pyplot as plt
plt.style.use('ggplot')
df.plot.scatter(x='Bachelors degree', y='Median Income');
plt.show()