GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data - arrays

I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?

Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+

I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})

Related

Slicing the first row of a Dataframe into an Array[String]

import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf,SparkContext}
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.expressions.Window
import scala.runtime.ScalaRunTime.{array_apply, array_update}
import scala.collection.mutable.Map
object SimpleApp {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("SimpleApp").setMaster("local")
val sc = new SparkContext(conf)
val input = "file:///home/shahid/Desktop/sample1.csv"
val hdfsOutput = "hdfs://localhost:9001/output.csv"
val localOutput = "file:///home/shahid/Desktop/output"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").load(input)
var colLen = df.columns.length
val df1 = df.filter(!(col("_c1") === ""))
I am capturing the top row into a val named headerArr.
val headerArr = df1.head
I wanted this val to be Array[String].
println("class = "+headerArr.getClass)
What can I do to either typecast this headerArr into an Array[String] or get this top row directly into an Array[String].
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9001"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/output.csv"),true)
df1.write.csv(hdfsOutput)
val fileTemp = new File("/home/shahid/Desktop/output/")
if (fileTemp.exists)
FileUtils.deleteDirectory(fileTemp)
df1.write.csv(localOutput)
sc.stop()
}
}
I have tried using df1.first also but both return the same type.
The result of the above code on the console is as follows :-
class = class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
Help needed.
Thankyou for you time. xD
Given the following dataframe:
val df = spark.createDataFrame(Seq(("a", "hello"), ("b", "world"))).toDF("id", "word")
df.show()
+---+-----+
| id| word|
+---+-----+
| a|hello|
| b|world|
+---+-----+
You can get the first row as you already mentioned and then turn this result into a Seq, which is actually backed by a subtype of Array and that you can then "cast" to an array without copying:
// returns: WrappedArray(a, hello)
df.first.toSeq.asInstanceOf[Array[_]]
Casting is usually not a good practice in a language with very good static typing as Scala, so you'd probably want to stick to the Seq unless you really have a need for an Array.
Notice that thus far we always ended up not with an array of strings but with an array of objects, since the Row object in Spark has to accommodate for various types. If you want to get to a collection of strings you can iterate the fields and extract the strings:
// returns: Vector(a, hello)
for (i <- 0 until df.first.length) yield df.first.getString(i)
This of course will cause a ClassCastException if the Row contains non-strings. Depending on your needs, you may also want to consider using a Try to silently drop non-strings within the for-comprehension:
import scala.util.Try
// same return type as before
// non-string members will be filtered out of the end result
for {
i <- 0 until df.first.length
field <- Try(df.first.getString(i)).toOption
} yield field
Until now we returned an IndexedSeq, which is suitable for efficient random access (i.e. has constant access time to any item in the collection) and in particular a Vector. Again, you may really need to return an Array. To return an Array[String] you may want to call toArray on the Vector, which unfortunately copies the whole thing.
You can skip this step and directly output an Array[String] by explicitly using flatMap instead of relying on the for-comprehension and using collection.breakOut:
// returns: Array[String] -- silently keeping strings only
0.until(df.first.length).
flatMap(i => Try(df.first.getString(i)).toOption)(collection.breakOut)
To learn more about builders and collection.breakOut you may want to have a read here.
well my problem didn't solve with the best way but i tried a way out :-
val headerArr = df1.first
var headerArray = new Array[String](colLen)
for(i <- 0 until colLen){
headerArray(i)=headerArr(i).toString
}
But still I am open for new suggestions.
Although I am slicing the dataframe into a var of class = org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema and then transfering the elements to Array[String] with an iteration.

Spark - Difference between array() and Array()

I was converting in the Spark Shell (1.6) a List of strings into an array like this:
val mapData = List("column1", "column2", "column3")
val values = array(mapData.map(col): _*)
The type of values is:
values: org.apache.spark.sql.Column = array(column1,column2,column3)
Everything fine, but when I start developing in Eclipse I got the error:
not found: value array
So I changed to this:
val values = Array(mapData.map(col): _*)
The problem I faced then was that the type of value now changed and the udf which was consuming it doesn't accept this new type:
values: Array[org.apache.spark.sql.Column] = Array(column1, column2,
column3)
Why I am not able to use array() in my IDE as in the Shell (what import am I missing)? and why array produce a org.apache.spark.sql.Column without the Array[] wrapper?
Edit: The udf function:
def replaceFirstMapOfArray =
udf((p: Seq[Map[String, String]], o: Seq[Map[String, String]]) =>
{
if((null != o && null !=p)){
if ( o.size == 1 ) p
else p ++ o.drop(1)
}else{
o
}
})
val mapData = List("column1", "column2", "column3")
val values = array(mapData.map(col): _*)
Here,
Array or List is the collection of objects
where as array in array(mapData.map(col): _*) is a spark function that creates a new column with type array for the same datatype columns.
For this to be used you need to import
import org.apache.spark.sql.functions.array
You can see here about the array
/**
* Creates a new array column. The input columns must all have the same data type.
* #group normal_funcs
* #since 1.4.0
*/
#scala.annotation.varargs
def array(cols: Column*): Column = withExpr {
CreateArray(cols.map(_.expr))
}

Converting datatypes in Spark/Scala

I have a variable in scala called a which is as below
scala> a
res17: Array[org.apache.spark.sql.Row] = Array([0_42], [big], [baller], [bitch], [shoe] ..)
It is an array of lists which contains a single word.
I would like to convert it to a single array consisting of sequence of strings like shown below
Array[Seq[String]] = Array(WrappedArray(0_42,big,baller,shoe,?,since,eluid.........
Well the reason why I am trying to create an array of single wrapped array is I want to run word2vec model in spark using MLLIB.
The fit() function in this only takes iterable string.
scala> val model = word2vec.fit(b)
<console>:41: error: inferred type arguments [String] do not conform to method fit's type parameter bounds [S <: Iterable[String]]
The sample data you're listing is not an array of lists, but an array of Rows. An array of a single WrappedArray you're trying to create also doesn't seem to serve any meaningful purpose.
If you want to create an array of all the word strings in your Array[Row] data structure, you can simply use a map like in the following:
val df = Seq(
("0_42"), ("big"), ("baller"), ("bitch"), ("shoe"), ("?"), ("since"), ("eliud"), ("win")
).toDF("word")
val a = df.rdd.collect
// a: Array[org.apache.spark.sql.Row] = Array(
// [0_42], [big], [baller], [bitch], [shoe], [?], [since], [eliud], [win]
// )
import org.apache.spark.sql.Row
val b = a.map{ case Row(w: String) => w }
// b: Array[String] = Array(0_42, big, baller, bitch, shoe, ?, since, eliud, win)
[UPDATE]
If you do want to create an array of a single WrappedArray, here's one approach:
val b = Array( a.map{ case Row(w: String) => w }.toSeq )
// b: Array[Seq[String]] = Array(WrappedArray(
// 0_42, big, baller, bitch, shoe, ?, since, eliud, win
// ))
I finally got it working by doing the following
val db=a.map{ case Row(word: String) => word }
val model = word2vec.fit( b.map(l=>Seq(l)))

Scala play api for JSON - getting Array of some case class from stringified JSON?

From our code, we call some service and get back stringified JSON as a result. The stringified JSON is of an array of "SomeItem", which just has four fields in it - 3 Longs and 1 String
Ex:
[
{"id":33,"count":40000,"someOtherCount":0,"someString":"stuffHere"},
{"id":35,"count":23000,"someOtherCount":0,"someString":"blah"},
...
]
I've been using the play API to read values out using implicit Writes / Reads. But I'm having trouble getting it to work for Arrays.
For example, I've been try to parse the value out of the response, and then convert it to the SomeItem case class array, but it's failing:
val sanityCheckValue: JsValue: Json.parse(response.body)
val Array[SomeItem] = Json.fromJson(sanityCheckValue)
I have
implicit val someItemReads = Json.reads[SomeItem]
But it looks like it's not working. I've tried to set up a Json.reads[Array[SomeItem]] as well, but no luck.
Should this be working? Any tips on how to get this to work?
import play.api.libs.json._
case class SomeItem(id: Long, count: Long, someOtherCount: Long, someString: String)
object SomeItem {
implicit val format = Json.format[SomeItem]
}
object PlayJson {
def main(args: Array[String]): Unit = {
val strJson =
"""
|[
| {"id":33,"count":40000,"someOtherCount":0,"someString":"stuffHere"},
| {"id":35,"count":23000,"someOtherCount":0,"someString":"blah"}
|]
""".stripMargin
val listOfSomeItems: Array[SomeItem] = Json.parse(strJson).as[Array[SomeItem]]
listOfSomeItems.foreach(println)
}
}

How to use Slick's mapped tables with foreign keys?

I'm struggling with Slick's lifted embedding and mapped tables. The API feels strange to me, maybe just because it is structured in a way that's unfamiliar to me.
I want to build a Task/Todo-List. There are two entities:
Task: Each task has a an optional reference to the next task. That way a linked list is build. The intention is that the user can order the tasks by his priority. This order is represented by the references from task to task.
TaskList: Represents a TaskList with a label and a reference to the first Task of the list.
case class Task(id: Option[Long], title: String, nextTask: Option[Task])
case class TaskList(label: String, firstTask: Option[Task])
Now I tried to write a data access object (DAO) for these two entities.
import scala.slick.driver.H2Driver.simple._
import slick.lifted.MappedTypeMapper
implicit val session: Session = Database.threadLocalSession
val queryById = Tasks.createFinderBy( t => t.id )
def task(id: Long): Option[Task] = queryById(id).firstOption
private object Tasks extends Table[Task]("TASKS") {
def id = column[Long]("ID", O.PrimaryKey, O.AutoInc)
def title = column[String]("TITLE")
def nextTaskId = column[Option[Long]]("NEXT_TASK_ID")
def nextTask = foreignKey("NEXT_TASK_FK", nextTaskId, Tasks)(_.id)
def * = id ~ title ~ nextTask <> (Task, Task.unapply _)
}
private object TaskLists extends Table[TaskList]("TASKLISTS") {
def label = column[String]("LABEL", O.PrimaryKey)
def firstTaskId = column[Option[Long]]("FIRST_TASK_ID")
def firstTask = foreignKey("FIRST_TASK_FK", firstTaskId, Tasks)(_.id)
def * = label ~ firstTask <> (Task, Task.unapply _)
}
Unfortunately it does not compile. The problems are in the * projection of both tables at nextTask respective firstTask.
could not find implicit value for evidence parameter of type
scala.slick.lifted.TypeMapper[scala.slick.lifted.ForeignKeyQuery[SlickTaskRepository.this.Tasks.type,justf0rfun.bookmark.model.Task]]
could not find implicit value for evidence parameter of type scala.slick.lifted.TypeMapper[scala.slick.lifted.ForeignKeyQuery[SlickTaskRepository.this.Tasks.type,justf0rfun.bookmark.model.Task]]
I tried to solve that with the following TypeMapper but that does not compile, too.
implicit val taskMapper = MappedTypeMapper.base[Option[Long], Option[Task]](
option => option match {
case Some(id) => task(id)
case _ => None
},
option => option match {
case Some(task) => task.id
case _ => None
})
could not find implicit value for parameter tm: scala.slick.lifted.TypeMapper[Option[justf0rfun.bookmark.model.Task]]
not enough arguments for method base: (implicit tm: scala.slick.lifted.TypeMapper[Option[justf0rfun.bookmark.model.Task]])scala.slick.lifted.BaseTypeMapper[Option[Long]]. Unspecified value parameter tm.
Main question: How to use Slick's lifted embedding and mapped tables the right way? How to I get this to work?
Thanks in advance.
The short answer is: Use ids instead of object references and use Slick queries to dereference ids. You can put the queries into methods for re-use.
That would make your case classes look like this:
case class Task(id: Option[Long], title: String, nextTaskId: Option[Long])
case class TaskList(label: String, firstTaskId: Option[Long])
I'll publish an article about this topic at some point and link it here.

Resources