Spark explode nested JSON with Array in Scala - arrays

Lets say i loaded a json file into Spark 1.6 via
sqlContext.read.json("/hdfs/")
it gives me a Dataframe with following schema:
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- checked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- color: array (nullable = true)
| |-- element: string (containsNull = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
The DF has only one row with an Array of all my Items inside.
+--------------------+--------------------+--------------------+
| id_e| checked_e| color_e|
+--------------------+--------------------+--------------------+
|[0218797c-77a6-45...|[false, true, tru...|[null, null, null...|
+--------------------+--------------------+--------------------+
I want to have a DF with the arrays exploded into one item per line.
+--------------------+-----+-------+
| id|color|checked|
+--------------------+-----+-------+
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
...
So far i achieved this by creating a temporary table from the array DF and used sql with lateral view explode on these lines.
val results = sqlContext.sql("
SELECT id, color, checked from temptable
lateral view explode(checked_e) temptable as checked
lateral view explode(id_e) temptable as id
lateral view explode(color_e) temptable as color
")
Is there any way to achieve this directly with Dataframe functions without using SQL? I know there is something like df.explode(...) but i could not get it to work with my Data
EDIT: It seems the explode isnt what i really wanted in the first place.
I want a new dataframe that has each item of the arrays line by line. The explode function actually gives back way more lines than my initial dataset has.

The following solution should work.
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val df = sqlContext.createDataFrame(data)
val udf3 = udf[Seq[(Int, Int, Int)], Seq[Int], Seq[Int], Seq[Int]]{
case (a, b, c) => (a,b, c).zipped.toSeq
}
val df3 = df.select(udf3($"_1", $"_2", $"_3").alias("udf3"))
val exploded = df3.select(explode($"udf3").alias("col3"))
exploded.withColumn("first", $"col3".getItem("_1"))
.withColumn("second", $"col3".getItem("_2"))
.withColumn("third", $"col3".getItem("_3")).show
While it is more straightforward if using normal Scala code directly. It might be more efficient too. Spark could not help anyway if there is only one row.
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val seqExploded = data.flatMap{
case (a: Seq[Int], b: Seq[Int], c: Seq[Int]) => (a, b, c).zipped.toSeq
}
val dfTheSame=sqlContext.createDataFrame(seqExploded)
dfTheSame.show

It should be simple like this:
df.withColumn("id", explode(col("id_e")))
.withColumn("checked", explode(col("checked_e")))
.withColumn("color", explode(col("color_e")))

Related

How to create a new array of substrings from string array column in a spark dataframe

I have a spark dataframe. One of the columns is an array type consisting of an array of text strings of varying lengths. I am looking for a way to add a new column that is an array of the unique left 8 characters of those strings.
df.printSchema()
root
(...)
|-- arr_agent: array (nullable = true)
| |-- element: string (containsNull = true)
example data from column "arr_agent":
["NRCANL2AXXX", "NRCANL2A"]
["UTRONL2U", "BKRBNL2AXXX", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "REUWNL2A002", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "UTRONL2UXXX", "BKRBNL2A"]
["MQBFDEFFYYY", "MQBFDEFFZZZ", "MQBFDEFF" ]
What I need to have in the new column:
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A"]
["NRCANL2A"]
["UTRONL2U", "BKRBNL2A", "REUWNL2A", "REUWNL2N"]
["UTRONL2U", "BKRBNL2A"]
["MQBFDEFF" ]
I already tried to define a udf that does it for me.
from pyspark.sql import functions as F
from pyspark.sql import types as T
def make_list_of_unique_prefixes(text_array, prefix_length=8):
out_arr = set(t[0:prefix_length] for t in text_array)
return(out_arr)
make_list_of_unique_prefixes_udf = F.udf(lambda x,y=8: make_list_of_unique_prefixes(x,y), T.ArrayType(T.StringType()))
But then calling:
df.withColumn("arr_prefix8s", F.collect_set( make_list_of_unique_prefixes_udf(F.col("arr_agent") )))
Throws an error
AnalysisException: grouping expressions sequence is empty,
Any tips would be appreciated.
thanks
You can solve this using higher order functions available from spark 2.4+ using transform and substring and then take array distinct:
from pyspark.sql import functions as F
n = 8
out = df.withColumn("New",F.expr(f"array_distinct(transform(arr_agent,x->substring(x,0,{n})))"))
out.show(truncate=False)
+-----------------------------------------------------+----------------------------------------+
|arr_agent |New |
+-----------------------------------------------------+----------------------------------------+
|[NRCANL2AXXX, NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, BKRBNL2AXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[NRCANL2A] |[NRCANL2A] |
|[UTRONL2U, REUWNL2A002, BKRBNL2A, REUWNL2A, REUWNL2N]|[UTRONL2U, REUWNL2A, BKRBNL2A, REUWNL2N]|
|[UTRONL2U, UTRONL2UXXX, BKRBNL2A] |[UTRONL2U, BKRBNL2A] |
|[MQBFDEFFYYY, MQBFDEFFZZZ, MQBFDEFF] |[MQBFDEFF] |
+-----------------------------------------------------+----------------------------------------+

Merging column with array from multiple rows

I'm trying to merge the data from a dataset as follow:
id
sms
longDescription
OtherFields
123
contentSms
ContentDesc
xxx
123
contentSms2
ContentDesc2
xxx
123
contentSms3
ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
the sms and longDescription have the following structure:
sms:array
|----element:struct
|----content:string
|----languageId:string
The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):
id
sms
longDescription
OtherFields
123
contentSms, ContentSms2,contentSms3
ContentDesc,ContentDesc2,ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
I've tried using
x = df.select("*").groupBy("id").agg( collect_list("sms"))
but the result is :
collect_list(longDescription): array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = true)
| | | |-- content: string (nullable = true)
| | | |-- languageId: string (nullable = true)
which is an array too much, as the goal is to have an array of struct in order to have the following result:
sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]
You're looking for flatten function:
x = df.groupBy("id").agg(flatten(collect_list("sms")))

Creating a column of array using another column of array in a Spark Dataframe (Scala)

I am new to both Scala and Spark. I am trying to transform an input read from files as Double into Float (which is safe in this application) so as to reduce the memory usage. I have been able to do that with a column of Double.
Current approach for a single element:
import org.apache.spark.sql.functions.{col, udf}
val tcast = udf((s: Double) => s.toFloat)
val myDF = Seq(
(1.0, Array(0.1, 2.1, 1.2)),
(8.0, Array(1.1, 2.1, 3.2)),
(9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").show
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").schema
But currently stuck with transforming array of doubles to floats. Any help would be appreciated.
You can use selectExpr, like:
val myDF = Seq(
(1.0, Array(0.1, 2.1, 1.2)),
(8.0, Array(1.1, 2.1, 3.2)),
(9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")
myDF.printSchema()
// output:
root
|-- time: double (nullable = false)
|-- crds: array (nullable = true)
| |-- element: double (containsNull = false)
val df = myDF.selectExpr("cast(time as float) time", "cast(crds as array<float>) as crds")
df.show()
+----+---------------+
|time| crds|
+----+---------------+
| 1.0|[0.1, 2.1, 1.2]|
| 8.0|[1.1, 2.1, 3.2]|
| 9.0|[1.1, 1.1, 2.2]|
+----+---------------+
df.printSchema()
root
|-- time: float (nullable = false)
|-- crds: array (nullable = true)
| |-- element: float (containsNull = true)

How to extract array<bigint> in hive table in spark properly?

I have a hive table which has a column(c4) with array<bigint> type. Now, I want to extract this column with spark. So, here is the code snippet:
val query = """select c1, c2, c3, c4 from
some_table where some_condition"""
val rddHive = hiveContext.sql(query).rdd.map{ row =>
//is there any other ways to extract wid_list(String here seems not work)
//no compile error and no runtime error
val w = if (row.isNullAt(3)) List() else row.getAs[scala.collection.mutable.WrappedArray[String]]("wid_list").toList
w
}
-> rddHive: org.apache.spark.rdd.RDD[List[String]] = MapPartitionsRDD[7] at map at <console>:32
rddHive.map(x => x(0).getClass.getSimpleName).take(1)
-> Array[String] = Array[Long]
So, I extract c4 with getAs[scala.collection.mutable.WrappedArray[String]], while the original data type is array<int>. However, there is no compile error or runtime error. Data I extracted is still bigint(Long) type. So, what happened here(why no compiler error or runtime error)? What is the proper way to extract array<int> as List[String]in Spark?
==================add more information====================
hiveContext.sql(query).printSchema
root
|-- c1: string (nullable = true)
|-- c2: integer (nullable = true)
|-- c3: string (nullable = true)
|-- c4: array (nullable = true)
| |-- element: long (containsNull = true)
hiveContext.sql(query).show(3)
+--------+----+----------------+--------------------+
| c1| c2| c3| c4|
+--------+----+----------------+--------------------+
| c1111| 1|5511798399.22222|[21772244666, 111...|
| c1112| 1|5511798399.88888|[11111111, 111111...|
| c1113| 2| 5555117114.3333|[77777777777, 112...|

How to sum values of a struct in a nested array in a Spark dataframe?

This is in Spark 2.1, Given this input file:
`order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And the following dataframes:
val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))
df3 has the following content:
+------+---------------------------+
|userId|array |
+------+---------------------------+
|1 |[[1,202.3,1], [2,343.99,1]]|
|2 |[[3,399.99,2]] |
+------+---------------------------+
and structure:
root
|-- userId: long (nullable = true)
|-- array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
Now assuming I am given df3:
I would like to compute sum of array.price for each userId, taking advantage of having the array per userId rows.
I would add this computation in a new column in the resulting dataframe. Like if I had done df3.withColumn( "sum", lit(0)), but with lit(0) replaced by my computation.
It would have assume to be straighforward, but I am stuck on both. I didnt find any way to access the array as whole do the computation per row (with a foldLeft for example).
I would like to compute sum of array.price for each userId, taking advantage of having the array
Unfortunately having an array works against you here. Neither Spark SQL nor DataFrame DSL provides tools that could be used directly to handle this task on array of an arbitrary size without decomposing (explode) first.
You can use an UDF:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
or convert to statically typed Dataset:
df3
.as[(Long, Seq[(Long, Double, Long)])]
.map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
.toDF("userId", "array", "totalPrice").show
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
As mentioned above you decompose and aggregate:
import org.apache.spark.sql.functions.{sum, first}
df3
.withColumn("price", explode($"array.price"))
.groupBy($"userId")
.agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)
+------+----------+--------------------+
|userId|sum(price)| array|
+------+----------+--------------------+
| 1| 546.29|[[1,202.3,1], [2,...|
| 2| 399.99| [[3,399.99,2]]|
+------+----------+--------------------+
but it is expensive and doesn't use the existing structure.
There is an ugly trick you could use:
import org.apache.spark.sql.functions.{coalesce, lit, max, size}
val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
.map(i => coalesce($"array.price".getItem(i), lit(0.0)))
.foldLeft(lit(0.0))(_ + _)
df3.withColumn("totalPrice", totalPrice)
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
but it is more a curiosity than a real solution.
Spark 2.4.0 and above
You can now use the AGGREGATE functionality.
df3.createOrReplaceTempView("orders")
spark.sql(
"""
|SELECT
| *,
| AGGREGATE(`array`, 0.0, (accumulator, item) -> accumulator + item.price) AS totalPrice
|FROM
| orders
|""".stripMargin).show()

Resources