I am trying to use UDTF in snowpark but not able to do partition by column.
what I want the sql query is something like this :
select mcount.result from CUSTOMER, table(map_count(name) over (partition by name)) mcount;
Here "map_count" is my JavaScript UDTF.
Below is the code snippet in Snowpark :
val session = Session.builder.configs(configs).create
val df = session.table("CUSTOMER")
val window = Window.partitionBy(col("name"))
val result = df.join(TableFunction("map_count"), col("name"))
//result.show()
Any suggestion how to use window partition by with table function? Is this even supported in snowpark?
Unfortunately, this is not currently supported in Snowpark. But we are working on it.
Today (version Python 0.8.0) it works as follows (example is calculating the median of a group/partition), i.e. acts as an UDAF:
from statistics import median
from snowflake.snowpark.types import *
class MyMedian:
values = []
def __init__(self):
self.values = []
def process(self, value: float):
self.values.append(value)
#no return value
for _ in range(0):
yield
def end_partition(self):
yield ("partition_summary",median(self.values))
output_schema = StructType([
StructField("label", StringType()),
StructField("median", FloatType())
])
my_median = udtf(
MyMedian,
output_schema=output_schema,
input_types=[FloatType()]
)
example_df = session.create_dataframe(
[["A", 2.0],
["A", 2.0],
["A", 4.0],
["B", -1.0],
["B", 0.0],
["B", 1.0]],
StructType([
StructField("Key", StringType()),
StructField("Value", FloatType())
])
)
example_df.show()
-------------------
|"KEY" |"VALUE" |
-------------------
|A |2.0 |
|A |2.0 |
|A |4.0 |
|B |-1.0 |
|B |0.0 |
|B |1.0 |
-------------------
Now the usage uf my_median:
example_df.join_table_function(my_median("VALUE").over(partition_by=col("KEY")))\
.show()
------------------------------------------------
|"KEY" |"VALUE" |"LABEL" |"MEDIAN" |
------------------------------------------------
|A |NULL |partition_total |2.0 |
|B |NULL |partition_total |0.0 |
------------------------------------------------
I think that for now the workaround will be to use sql to do the invocation, like in the example below.
I created a dummy customer table and a dummy Javascript Table UDF.
And then I invoked it using SQL.
Obviously when the DF API is ready this will be unnecessary and the DataFrame API is cleaner.
import com.snowflake.snowpark.functions._
session.sql("ALTER SESSION SET QUERY_TAG='TEST_1'")
session.sql("""
CREATE OR REPLACE FUNCTION MAP_COUNT(NAME STRING) RETURNS TABLE (NUM FLOAT)
LANGUAGE JAVASCRIPT AS
$$
{
processRow: function (row, rowWriter, context) {
this.ccount = this.ccount + 1;
},
finalize: function (rowWriter, context) {
rowWriter.writeRow({NUM: this.ccount});
},
initialize: function(argumentInfo, context) {
this.ccount = 0;
}
}
$$;
""").show()
session.sql("""
CREATE OR REPLACE TABLE CUSTOMER (
CUST_ID INTEGER,
CUST_NAME TEXT
)
""").show()
session.sql("INSERT INTO CUSTOMER SELECT 1, 'John'").show()
session.sql("INSERT INTO CUSTOMER SELECT 2, 'John'").show()
session.sql("INSERT INTO CUSTOMER SELECT 3, 'John'").show()
session.sql("INSERT INTO CUSTOMER SELECT 4, 'Mary'").show()
session.sql("INSERT INTO CUSTOMER SELECT 5, 'Mary'").show()
import com.snowflake.snowpark.functions._
val df = session.table("CUSTOMER")
val window = Window.partitionBy(col("CUST_NAME"))
val res = session.sql("select CUST_NAME,NUM FROM CUSTOMER, TABLE(MAP_COUNT(CUST_NAME) OVER (PARTITION BY CUST_NAME ORDER BY CUST_NAME))")
res.show()
// Output will be
//-----------------------
//|"CUST_NAME" |"NUM" |
//-----------------------
//|Mary |2.0 |
//|John |3.0 |
I manage to extract a time-series of prices from a web-portal. The data arrives in a json format, and I convert them into a pandas dataFrame.
Unfortunately, the data for the different bands come in a text string, and I can't seem to extract them out properly.
The below is the json data I extract
I convert them into a pandas dataframe using this code
data = pd.DataFrame(r.json()['prices'])
and get them like this
I need to extract (for example) the data in the column ClosePrice out, so that I can do data analysis and cleansing on them.
I tried using
data['closePrice'].str.split(',', expand=True).rename(columns = lambda x: "string"+str(x+1))
but it doesn't really work.
Is there any way to either
a) when I convert the json to dataFrame, such that the prices within the closePrice, bidPrice etc are extracted in individual columns OR
b) if they were saved in the dataFrame, extract the text strings within them, such that I can extract the prices (e.g. the bid, ask and lastTraded) within the text string?
A relatively brute force way, using links from other stackOverflow.
# load and extract the json data
s = requests.Session()
r = s.post(url + '/session', json=data)
loc = <url>
dat1 = s.get(loc)
dat1 = pd.DataFrame(dat1.json()['prices'])
# convert the object list into individual columns
dat2 = pd.DataFrame()
dat2[['bidC','askC', 'lastP']] = pd.DataFrame(dat1.closePrice.values.tolist(), index= dat1.index)
dat2[['bidH','askH', 'lastH']] = pd.DataFrame(dat1.highPrice.values.tolist(), index= dat1.index)
dat2[['bidL','askL', 'lastL']] = pd.DataFrame(dat1.lowPrice.values.tolist(), index= dat1.index)
dat2[['bidO','askO', 'lastO']] = pd.DataFrame(dat1.openPrice.values.tolist(), index= dat1.index)
dat2['tStamp'] = pd.to_datetime(dat1.snapshotTime)
dat2['volume'] = dat1.lastTradedVolume
get the equivalent below
Use pandas.json_normalize to extract the data from the dict
import pandas as pd
data = r.json()
# print(data)
{'prices': [{'closePrice': {'ask': 1.16042, 'bid': 1.16027, 'lastTraded': None},
'highPrice': {'ask': 1.16052, 'bid': 1.16041, 'lastTraded': None},
'lastTradedVolume': 74,
'lowPrice': {'ask': 1.16038, 'bid': 1.16026, 'lastTraded': None},
'openPrice': {'ask': 1.16044, 'bid': 1.16038, 'lastTraded': None},
'snapshotTime': '2018/09/28 21:49:00',
'snapshotTimeUTC': '2018-09-28T20:49:00'}]}
df = pd.json_normalize(data['prices'])
Output:
| | lastTradedVolume | snapshotTime | snapshotTimeUTC | closePrice.ask | closePrice.bid | closePrice.lastTraded | highPrice.ask | highPrice.bid | highPrice.lastTraded | lowPrice.ask | lowPrice.bid | lowPrice.lastTraded | openPrice.ask | openPrice.bid | openPrice.lastTraded |
|---:|-------------------:|:--------------------|:--------------------|-----------------:|-----------------:|:------------------------|----------------:|----------------:|:-----------------------|---------------:|---------------:|:----------------------|----------------:|----------------:|:-----------------------|
| 0 | 74 | 2018/09/28 21:49:00 | 2018-09-28T20:49:00 | 1.16042 | 1.16027 | | 1.16052 | 1.16041 | | 1.16038 | 1.16026 | | 1.16044 | 1.16038 | |
This is in Spark 2.1, Given this input file:
`order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And the following dataframes:
val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))
df3 has the following content:
+------+---------------------------+
|userId|array |
+------+---------------------------+
|1 |[[1,202.3,1], [2,343.99,1]]|
|2 |[[3,399.99,2]] |
+------+---------------------------+
and structure:
root
|-- userId: long (nullable = true)
|-- array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
Now assuming I am given df3:
I would like to compute sum of array.price for each userId, taking advantage of having the array per userId rows.
I would add this computation in a new column in the resulting dataframe. Like if I had done df3.withColumn( "sum", lit(0)), but with lit(0) replaced by my computation.
It would have assume to be straighforward, but I am stuck on both. I didnt find any way to access the array as whole do the computation per row (with a foldLeft for example).
I would like to compute sum of array.price for each userId, taking advantage of having the array
Unfortunately having an array works against you here. Neither Spark SQL nor DataFrame DSL provides tools that could be used directly to handle this task on array of an arbitrary size without decomposing (explode) first.
You can use an UDF:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
or convert to statically typed Dataset:
df3
.as[(Long, Seq[(Long, Double, Long)])]
.map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
.toDF("userId", "array", "totalPrice").show
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
As mentioned above you decompose and aggregate:
import org.apache.spark.sql.functions.{sum, first}
df3
.withColumn("price", explode($"array.price"))
.groupBy($"userId")
.agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)
+------+----------+--------------------+
|userId|sum(price)| array|
+------+----------+--------------------+
| 1| 546.29|[[1,202.3,1], [2,...|
| 2| 399.99| [[3,399.99,2]]|
+------+----------+--------------------+
but it is expensive and doesn't use the existing structure.
There is an ugly trick you could use:
import org.apache.spark.sql.functions.{coalesce, lit, max, size}
val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
.map(i => coalesce($"array.price".getItem(i), lit(0.0)))
.foldLeft(lit(0.0))(_ + _)
df3.withColumn("totalPrice", totalPrice)
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
but it is more a curiosity than a real solution.
Spark 2.4.0 and above
You can now use the AGGREGATE functionality.
df3.createOrReplaceTempView("orders")
spark.sql(
"""
|SELECT
| *,
| AGGREGATE(`array`, 0.0, (accumulator, item) -> accumulator + item.price) AS totalPrice
|FROM
| orders
|""".stripMargin).show()
I use spark-shell to do the below operations.
Recently loaded a table with an array column in spark-sql .
Here is the DDL for the same:
create table test_emp_arr{
dept_id string,
dept_nm string,
emp_details Array<string>
}
the data looks something like this
+-------+-------+-------------------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+-------------------------------+
| 10|Finance|[Jon, Snow, Castle, Black, Ned]|
| 20| IT| [Ned, is, no, more]|
+-------+-------+-------------------------------+
I can query the emp_details column something like this :
sqlContext.sql("select emp_details[0] from emp_details").show
Problem
I want to query a range of elements in the collection :
Expected query to work
sqlContext.sql("select emp_details[0-2] from emp_details").show
or
sqlContext.sql("select emp_details[0:2] from emp_details").show
Expected output
+-------------------+
| emp_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
In pure Scala, if i have an array something as :
val emp_details = Array("Jon","Snow","Castle","Black")
I can get the elements from 0 to 2 range using
emp_details.slice(0,3)
returns me
Array(Jon, Snow,Castle)
I am not able to apply the above operation of the array in spark-sql.
Thanks
Since Spark 2.4 you can use slice function. In Python):
pyspark.sql.functions.slice(x, start, length)
Collection function: returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
...
New in version 2.4.
from pyspark.sql.functions import slice
df = spark.createDataFrame([
(10, "Finance", ["Jon", "Snow", "Castle", "Black", "Ned"]),
(20, "IT", ["Ned", "is", "no", "more"])
], ("dept_id", "dept_nm", "emp_details"))
df.select(slice("emp_details", 1, 3).alias("empt_details")).show()
+-------------------+
| empt_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
In Scala
def slice(x: Column, start: Int, length: Int): Column
Returns an array containing all the elements in x from index start (or starting from the end if start is negative) with the specified length.
import org.apache.spark.sql.functions.slice
val df = Seq(
(10, "Finance", Seq("Jon", "Snow", "Castle", "Black", "Ned")),
(20, "IT", Seq("Ned", "is", "no", "more"))
).toDF("dept_id", "dept_nm", "emp_details")
df.select(slice($"emp_details", 1, 3) as "empt_details").show
+-------------------+
| empt_details|
+-------------------+
|[Jon, Snow, Castle]|
| [Ned, is, no]|
+-------------------+
The same thing can be of course done in SQL
SELECT slice(emp_details, 1, 3) AS emp_details FROM df
Important:
Please note, that unlike Seq.slice, values are indexed from zero and the second argument is length, not end position.
Here is a solution using a User Defined Function which has the advantage of working for any slice size you want. It simply builds a UDF function around the scala builtin slice method :
import sqlContext.implicits._
import org.apache.spark.sql.functions._
val slice = udf((array : Seq[String], from : Int, to : Int) => array.slice(from,to))
Example with a sample of your data :
val df = sqlContext.sql("select array('Jon', 'Snow', 'Castle', 'Black', 'Ned') as emp_details")
df.withColumn("slice", slice($"emp_details", lit(0), lit(3))).show
Produces the expected output
+--------------------+-------------------+
| emp_details| slice|
+--------------------+-------------------+
|[Jon, Snow, Castl...|[Jon, Snow, Castle]|
+--------------------+-------------------+
You can also register the UDF in your sqlContext and use it like this
sqlContext.udf.register("slice", (array : Seq[String], from : Int, to : Int) => array.slice(from,to))
sqlContext.sql("select array('Jon','Snow','Castle','Black','Ned'),slice(array('Jon','Snow','Castle','Black','Ned'),0,3)")
You won't need lit anymore with this solution
Edit2: For who wants to avoid udf at the expense of readability ;-)
If you really want to do it in one step, you will have to use Scala to create a lambda function returning an sequence of Column and wrap it in an array. This is a bit involved, but it's one step:
val df = List(List("Jon", "Snow", "Castle", "Black", "Ned")).toDF("emp_details")
df.withColumn("slice", array((0 until 3).map(i => $"emp_details"(i)):_*)).show(false)
+-------------------------------+-------------------+
|emp_details |slice |
+-------------------------------+-------------------+
|[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]|
+-------------------------------+-------------------+
The _:* works a bit of magic to pass an list to a so-called variadic function (array in this case, which construct the sql array). But I would advice against using this solution as is. put the lambda function in a named function
def slice(from: Int, to: Int) = array((from until to).map(i => $"emp_details"(i)):_*))
for code readability. Note that in general, sticking to Column expressions (without using `udf) has better performances.
Edit: In order to do it in a sql statement (as you ask in your question...), following the same logic you would generate the sql query using scala logic (not saying it's the most readable)
def sliceSql(emp_details: String, from: Int, to: Int): String = "Array(" + (from until to).map(i => "emp_details["+i.toString+"]").mkString(",") + ")"
val sqlQuery = "select emp_details,"+ sliceSql("emp_details",0,3) + "as slice from emp_details"
sqlContext.sql(sqlQuery).show
+-------------------------------+-------------------+
|emp_details |slice |
+-------------------------------+-------------------+
|[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]|
+-------------------------------+-------------------+
note that you can replace until by to in order to provide the last element taken rather than the element at which the iteration stops.
You can use the function array to build a new Array out of the three values:
import org.apache.spark.sql.functions._
val input = sqlContext.sql("select emp_details from emp_details")
val arr: Column = col("emp_details")
val result = input.select(array(arr(0), arr(1), arr(2)) as "emp_details")
val result.show()
// +-------------------+
// | emp_details|
// +-------------------+
// |[Jon, Snow, Castle]|
// | [Ned, is, no]|
// +-------------------+
use selecrExpr() and split() function in apache spark.
for example :
fs.selectExpr("((split(emp_details, ','))[0]) as e1,((split(emp_details, ','))[1]) as e2,((split(emp_details, ','))[2]) as e3);
Here is my generic slice UDF, support array with any type. A little bit ugly because you need to know the element type in advance.
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
def arraySlice(arr: Seq[AnyRef], from: Int, until: Int): Seq[AnyRef] =
if (arr == null) null else arr.slice(from, until)
def slice(elemType: DataType): UserDefinedFunction =
udf(arraySlice _, ArrayType(elemType)
fs.select(slice(StringType)($"emp_details", 1, 2))
For those of you stuck using Spark < 2.4 and don't have the slice function, here is a solution in pySpark (Scala would be very similar) that does not use udfs. Instead it uses the spark sql functions concat_ws, substring_index, and split.
This will only work with string arrays. To make it work with arrays of other types, you will have to cast them into strings first, then cast back to the original type after you have 'sliced' the array.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = (SparkSession.builder
.master('yarn')
.appName("array_slice")
.getOrCreate()
)
emp_details = [
["Jon", "Snow", "Castle", "Black", "Ned"],
["Ned", "is", "no", "more"]
]
df1 = spark.createDataFrame(
[tuple([emp]) for emp in emp_details],
["emp_details"]
)
df1.show(truncate=False)
+-------------------------------+
|emp_details |
+-------------------------------+
|[Jon, Snow, Castle, Black, Ned]|
|[Ned, is, no, more] |
+-------------------------------+
last_string = 2
df2 = (
df1
.withColumn('last_string', (F.lit(last_string)))
.withColumn('concat', F.concat_ws(" ", F.col('emp_details')))
.withColumn('slice', F.expr("substring_index(concat, ' ', last_string + 1)" ))
.withColumn('slice', F.split(F.col('slice'), ' '))
.select('emp_details', 'slice')
)
df2.show(truncate=False)
+-------------------------------+-------------------+
|emp_details |slice |
+-------------------------------+-------------------+
|[Jon, Snow, Castle, Black, Ned]|[Jon, Snow, Castle]|
|[Ned, is, no, more] |[Ned, is, no] |
+-------------------------------+-------------------+
Use nested split:
split(split(concat_ws(',',emp_details),concat(',',emp_details[3]))[0],',')
scala> import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.SparkSession
scala> val spark=SparkSession.builder().getOrCreate()
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession#1d637673
scala> val df = spark.read.json("file:///Users/gengmei/Desktop/test/test.json")
18/12/11 10:09:32 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
df: org.apache.spark.sql.DataFrame = [dept_id: bigint, dept_nm: string ... 1 more field]
scala> df.createOrReplaceTempView("raw_data")
scala> df.show()
+-------+-------+--------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+--------------------+
| 10|Finance|[Jon, Snow, Castl...|
| 20| IT| [Ned, is, no, more]|
+-------+-------+--------------------+
scala> val df2 = spark.sql(
| s"""
| |select dept_id,dept_nm,split(split(concat_ws(',',emp_details),concat(',',emp_details[3]))[0],',') as emp_details from raw_data
| """)
df2: org.apache.spark.sql.DataFrame = [dept_id: bigint, dept_nm: string ... 1 more field]
scala> df2.show()
+-------+-------+-------------------+
|dept_id|dept_nm| emp_details|
+-------+-------+-------------------+
| 10|Finance|[Jon, Snow, Castle]|
| 20| IT| [Ned, is, no]|
+-------+-------+-------------------+