I have a hive table which has a column(c4) with array<bigint> type. Now, I want to extract this column with spark. So, here is the code snippet:
val query = """select c1, c2, c3, c4 from
some_table where some_condition"""
val rddHive = hiveContext.sql(query).rdd.map{ row =>
//is there any other ways to extract wid_list(String here seems not work)
//no compile error and no runtime error
val w = if (row.isNullAt(3)) List() else row.getAs[scala.collection.mutable.WrappedArray[String]]("wid_list").toList
w
}
-> rddHive: org.apache.spark.rdd.RDD[List[String]] = MapPartitionsRDD[7] at map at <console>:32
rddHive.map(x => x(0).getClass.getSimpleName).take(1)
-> Array[String] = Array[Long]
So, I extract c4 with getAs[scala.collection.mutable.WrappedArray[String]], while the original data type is array<int>. However, there is no compile error or runtime error. Data I extracted is still bigint(Long) type. So, what happened here(why no compiler error or runtime error)? What is the proper way to extract array<int> as List[String]in Spark?
==================add more information====================
hiveContext.sql(query).printSchema
root
|-- c1: string (nullable = true)
|-- c2: integer (nullable = true)
|-- c3: string (nullable = true)
|-- c4: array (nullable = true)
| |-- element: long (containsNull = true)
hiveContext.sql(query).show(3)
+--------+----+----------------+--------------------+
| c1| c2| c3| c4|
+--------+----+----------------+--------------------+
| c1111| 1|5511798399.22222|[21772244666, 111...|
| c1112| 1|5511798399.88888|[11111111, 111111...|
| c1113| 2| 5555117114.3333|[77777777777, 112...|
Related
I'm trying to merge the data from a dataset as follow:
id
sms
longDescription
OtherFields
123
contentSms
ContentDesc
xxx
123
contentSms2
ContentDesc2
xxx
123
contentSms3
ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
the sms and longDescription have the following structure:
sms:array
|----element:struct
|----content:string
|----languageId:string
The aim is to capture the data with the same Id and merge the column sms and longDescription into one array with multiple struct( with the languageID as key):
id
sms
longDescription
OtherFields
123
contentSms, ContentSms2,contentSms3
ContentDesc,ContentDesc2,ContentDesc3
xxx
456
contentSms4
ContentDesc
xxx
I've tried using
x = df.select("*").groupBy("id").agg( collect_list("sms"))
but the result is :
collect_list(longDescription): array (nullable = false)
| |-- element: array (containsNull = false)
| | |-- element: struct (containsNull = true)
| | | |-- content: string (nullable = true)
| | | |-- languageId: string (nullable = true)
which is an array too much, as the goal is to have an array of struct in order to have the following result:
sms: [{content: 'aze', languageId:'en-GB'},{content: 'rty', languageId:'fr-BE'},{content: 'poiu', languageId:'nl-BE'}]
You're looking for flatten function:
x = df.groupBy("id").agg(flatten(collect_list("sms")))
I am new to both Scala and Spark. I am trying to transform an input read from files as Double into Float (which is safe in this application) so as to reduce the memory usage. I have been able to do that with a column of Double.
Current approach for a single element:
import org.apache.spark.sql.functions.{col, udf}
val tcast = udf((s: Double) => s.toFloat)
val myDF = Seq(
(1.0, Array(0.1, 2.1, 1.2)),
(8.0, Array(1.1, 2.1, 3.2)),
(9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").show
myDF.withColumn("timeF", tcast(col("time"))).drop("time").withColumnRenamed("timeF", "time").schema
But currently stuck with transforming array of doubles to floats. Any help would be appreciated.
You can use selectExpr, like:
val myDF = Seq(
(1.0, Array(0.1, 2.1, 1.2)),
(8.0, Array(1.1, 2.1, 3.2)),
(9.0, Array(1.1, 1.1, 2.2))
).toDF("time", "crds")
myDF.printSchema()
// output:
root
|-- time: double (nullable = false)
|-- crds: array (nullable = true)
| |-- element: double (containsNull = false)
val df = myDF.selectExpr("cast(time as float) time", "cast(crds as array<float>) as crds")
df.show()
+----+---------------+
|time| crds|
+----+---------------+
| 1.0|[0.1, 2.1, 1.2]|
| 8.0|[1.1, 2.1, 3.2]|
| 9.0|[1.1, 1.1, 2.2]|
+----+---------------+
df.printSchema()
root
|-- time: float (nullable = false)
|-- crds: array (nullable = true)
| |-- element: float (containsNull = true)
I have a data frame :
+--------------------------------------+------------------------------------------------------------+
|item |item_codes |
+--------------------------------------+------------------------------------------------------------+
|loose fit long sleeve swim shirt women|["2237741011","1046622","1040660","7147440011","7141123011"]|
+--------------------------------------+------------------------------------------------------------+
And schema looks like this =
root
|-- item: string (nullable = true)
|-- item_codes: string (nullable = true)
How can I convert the column item_codes string to Array[String] in Scala ?
You can remove quotes/square brackets using regexp_replace, followed by a split to generate the ArrayType column:
val df = Seq(
("abc", "[\"2237741011\",\"1046622\",\"1040660\",\"7147440011\",\"7141123011\"]")
).toDF("item", "item_codes")
df.
withColumn("item_codes", split(regexp_replace($"item_codes", """\[?\"\]?""", ""), "\\,")).
show(false)
// +----+------------------------------------------------------+
// |item|item_codes |
// +----+------------------------------------------------------+
// |abc |[2237741011, 1046622, 1040660, 7147440011, 7141123011]|
// +----+------------------------------------------------------+
You can use the split method after doing some "preprocessing"
val col_names = Seq("item", "item_codes")
val data = Seq(("loose fit long sleeve swim shirt women", """["2237741011","1046622","1040660","7147440011","7141123011"]"""))
val df = spark.createDataFrame(data).toDF(col_names: _*)
// chop off first 2 and last 2 character and split at ","
df.withColumn("item_codes", split(expr("substring(item_codes, 3, length(item_codes)-4)"), """","""")).printSchema
If your format can change you might be more flexible using a regexp as leo suggestes chopping off everything that is not a digit or a , and split at ,
This is in Spark 2.1, Given this input file:
`order.json
{"id":1,"price":202.30,"userid":1}
{"id":2,"price":343.99,"userid":1}
{"id":3,"price":399.99,"userid":2}
And the following dataframes:
val order = sqlContext.read.json("order.json")
val df2 = order.select(struct("*") as 'order)
val df3 = df2.groupBy("order.userId").agg( collect_list( $"order").as("array"))
df3 has the following content:
+------+---------------------------+
|userId|array |
+------+---------------------------+
|1 |[[1,202.3,1], [2,343.99,1]]|
|2 |[[3,399.99,2]] |
+------+---------------------------+
and structure:
root
|-- userId: long (nullable = true)
|-- array: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: long (nullable = true)
| | |-- price: double (nullable = true)
| | |-- userid: long (nullable = true)
Now assuming I am given df3:
I would like to compute sum of array.price for each userId, taking advantage of having the array per userId rows.
I would add this computation in a new column in the resulting dataframe. Like if I had done df3.withColumn( "sum", lit(0)), but with lit(0) replaced by my computation.
It would have assume to be straighforward, but I am stuck on both. I didnt find any way to access the array as whole do the computation per row (with a foldLeft for example).
I would like to compute sum of array.price for each userId, taking advantage of having the array
Unfortunately having an array works against you here. Neither Spark SQL nor DataFrame DSL provides tools that could be used directly to handle this task on array of an arbitrary size without decomposing (explode) first.
You can use an UDF:
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.{col, udf}
val totalPrice = udf((xs: Seq[Row]) => xs.map(_.getAs[Double]("price")).sum)
df3.withColumn("totalPrice", totalPrice($"array"))
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
or convert to statically typed Dataset:
df3
.as[(Long, Seq[(Long, Double, Long)])]
.map{ case (id, xs) => (id, xs, xs.map(_._2).sum) }
.toDF("userId", "array", "totalPrice").show
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
As mentioned above you decompose and aggregate:
import org.apache.spark.sql.functions.{sum, first}
df3
.withColumn("price", explode($"array.price"))
.groupBy($"userId")
.agg(sum($"price"), df3.columns.tail.map(c => first(c).alias(c)): _*)
+------+----------+--------------------+
|userId|sum(price)| array|
+------+----------+--------------------+
| 1| 546.29|[[1,202.3,1], [2,...|
| 2| 399.99| [[3,399.99,2]]|
+------+----------+--------------------+
but it is expensive and doesn't use the existing structure.
There is an ugly trick you could use:
import org.apache.spark.sql.functions.{coalesce, lit, max, size}
val totalPrice = (0 to df3.agg(max(size($"array"))).as[Int].first)
.map(i => coalesce($"array.price".getItem(i), lit(0.0)))
.foldLeft(lit(0.0))(_ + _)
df3.withColumn("totalPrice", totalPrice)
+------+--------------------+----------+
|userId| array|totalPrice|
+------+--------------------+----------+
| 1|[[1,202.3,1], [2,...| 546.29|
| 2| [[3,399.99,2]]| 399.99|
+------+--------------------+----------+
but it is more a curiosity than a real solution.
Spark 2.4.0 and above
You can now use the AGGREGATE functionality.
df3.createOrReplaceTempView("orders")
spark.sql(
"""
|SELECT
| *,
| AGGREGATE(`array`, 0.0, (accumulator, item) -> accumulator + item.price) AS totalPrice
|FROM
| orders
|""".stripMargin).show()
Lets say i loaded a json file into Spark 1.6 via
sqlContext.read.json("/hdfs/")
it gives me a Dataframe with following schema:
root
|-- id: array (nullable = true)
| |-- element: string (containsNull = true)
|-- checked: array (nullable = true)
| |-- element: string (containsNull = true)
|-- color: array (nullable = true)
| |-- element: string (containsNull = true)
|-- type: array (nullable = true)
| |-- element: string (containsNull = true)
The DF has only one row with an Array of all my Items inside.
+--------------------+--------------------+--------------------+
| id_e| checked_e| color_e|
+--------------------+--------------------+--------------------+
|[0218797c-77a6-45...|[false, true, tru...|[null, null, null...|
+--------------------+--------------------+--------------------+
I want to have a DF with the arrays exploded into one item per line.
+--------------------+-----+-------+
| id|color|checked|
+--------------------+-----+-------+
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
|0218797c-77a6-45f...| null| false|
...
So far i achieved this by creating a temporary table from the array DF and used sql with lateral view explode on these lines.
val results = sqlContext.sql("
SELECT id, color, checked from temptable
lateral view explode(checked_e) temptable as checked
lateral view explode(id_e) temptable as id
lateral view explode(color_e) temptable as color
")
Is there any way to achieve this directly with Dataframe functions without using SQL? I know there is something like df.explode(...) but i could not get it to work with my Data
EDIT: It seems the explode isnt what i really wanted in the first place.
I want a new dataframe that has each item of the arrays line by line. The explode function actually gives back way more lines than my initial dataset has.
The following solution should work.
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val df = sqlContext.createDataFrame(data)
val udf3 = udf[Seq[(Int, Int, Int)], Seq[Int], Seq[Int], Seq[Int]]{
case (a, b, c) => (a,b, c).zipped.toSeq
}
val df3 = df.select(udf3($"_1", $"_2", $"_3").alias("udf3"))
val exploded = df3.select(explode($"udf3").alias("col3"))
exploded.withColumn("first", $"col3".getItem("_1"))
.withColumn("second", $"col3".getItem("_2"))
.withColumn("third", $"col3".getItem("_3")).show
While it is more straightforward if using normal Scala code directly. It might be more efficient too. Spark could not help anyway if there is only one row.
val data = Seq((Seq(1,2,3),Seq(4,5,6),Seq(7,8,9)))
val seqExploded = data.flatMap{
case (a: Seq[Int], b: Seq[Int], c: Seq[Int]) => (a, b, c).zipped.toSeq
}
val dfTheSame=sqlContext.createDataFrame(seqExploded)
dfTheSame.show
It should be simple like this:
df.withColumn("id", explode(col("id_e")))
.withColumn("checked", explode(col("checked_e")))
.withColumn("color", explode(col("color_e")))