Slicing the first row of a Dataframe into an Array[String] - arrays

import org.apache.spark.sql.functions.broadcast
import org.apache.spark.sql.SparkSession._
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.SparkContext._
import org.apache.spark.{SparkConf,SparkContext}
import java.io.File
import org.apache.commons.io.FileUtils
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.sql.expressions.Window
import scala.runtime.ScalaRunTime.{array_apply, array_update}
import scala.collection.mutable.Map
object SimpleApp {
def main(args: Array[String]){
val conf = new SparkConf().setAppName("SimpleApp").setMaster("local")
val sc = new SparkContext(conf)
val input = "file:///home/shahid/Desktop/sample1.csv"
val hdfsOutput = "hdfs://localhost:9001/output.csv"
val localOutput = "file:///home/shahid/Desktop/output"
val sqlContext = new SQLContext(sc)
val df = sqlContext.read.format("com.databricks.spark.csv").load(input)
var colLen = df.columns.length
val df1 = df.filter(!(col("_c1") === ""))
I am capturing the top row into a val named headerArr.
val headerArr = df1.head
I wanted this val to be Array[String].
println("class = "+headerArr.getClass)
What can I do to either typecast this headerArr into an Array[String] or get this top row directly into an Array[String].
val fs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://localhost:9001"), sc.hadoopConfiguration)
fs.delete(new org.apache.hadoop.fs.Path("/output.csv"),true)
df1.write.csv(hdfsOutput)
val fileTemp = new File("/home/shahid/Desktop/output/")
if (fileTemp.exists)
FileUtils.deleteDirectory(fileTemp)
df1.write.csv(localOutput)
sc.stop()
}
}
I have tried using df1.first also but both return the same type.
The result of the above code on the console is as follows :-
class = class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
Help needed.
Thankyou for you time. xD

Given the following dataframe:
val df = spark.createDataFrame(Seq(("a", "hello"), ("b", "world"))).toDF("id", "word")
df.show()
+---+-----+
| id| word|
+---+-----+
| a|hello|
| b|world|
+---+-----+
You can get the first row as you already mentioned and then turn this result into a Seq, which is actually backed by a subtype of Array and that you can then "cast" to an array without copying:
// returns: WrappedArray(a, hello)
df.first.toSeq.asInstanceOf[Array[_]]
Casting is usually not a good practice in a language with very good static typing as Scala, so you'd probably want to stick to the Seq unless you really have a need for an Array.
Notice that thus far we always ended up not with an array of strings but with an array of objects, since the Row object in Spark has to accommodate for various types. If you want to get to a collection of strings you can iterate the fields and extract the strings:
// returns: Vector(a, hello)
for (i <- 0 until df.first.length) yield df.first.getString(i)
This of course will cause a ClassCastException if the Row contains non-strings. Depending on your needs, you may also want to consider using a Try to silently drop non-strings within the for-comprehension:
import scala.util.Try
// same return type as before
// non-string members will be filtered out of the end result
for {
i <- 0 until df.first.length
field <- Try(df.first.getString(i)).toOption
} yield field
Until now we returned an IndexedSeq, which is suitable for efficient random access (i.e. has constant access time to any item in the collection) and in particular a Vector. Again, you may really need to return an Array. To return an Array[String] you may want to call toArray on the Vector, which unfortunately copies the whole thing.
You can skip this step and directly output an Array[String] by explicitly using flatMap instead of relying on the for-comprehension and using collection.breakOut:
// returns: Array[String] -- silently keeping strings only
0.until(df.first.length).
flatMap(i => Try(df.first.getString(i)).toOption)(collection.breakOut)
To learn more about builders and collection.breakOut you may want to have a read here.

well my problem didn't solve with the best way but i tried a way out :-
val headerArr = df1.first
var headerArray = new Array[String](colLen)
for(i <- 0 until colLen){
headerArray(i)=headerArr(i).toString
}
But still I am open for new suggestions.
Although I am slicing the dataframe into a var of class = org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema and then transfering the elements to Array[String] with an iteration.

Related

GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data

I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})

What's the practical use of KeyedStream#max

I have a simple Flink application to illustrate the usage of KeyedStream#max
import com.huawei.flink.time.Box
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.scala.{StreamExecutionEnvironment, _}
object KeyStreamMaxTest {
val env = StreamExecutionEnvironment.getExecutionEnvironment
def main(args: Array[String]): Unit = {
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime)
env.setParallelism(1)
env.setMaxParallelism(1)
val ds = env.fromElements(("X,Red,10"), ("Y,Blue,10"), ("Z,Black, 22"), ("U,Green,22"), ("N,Blue,25"), ("M,Green,23"))
val ds2 = ds.map { line =>
val Array(name, color, size) = line.split(",")
Box(name.trim, color.trim, size.trim.toInt)
}.keyBy(_.color).max("size")
ds2.print()
env.execute()
}
}
The output is:
Box(X,Red,10)
Box(Y,Blue,10)
Box(Z,Black,22)
Box(U,Green,22)
Box(Y,Blue,25) -- I thought this should be ("N,Blue,25")
Box(U,Green,23)
Looks Flink only replaces the sizeļ¼Œ but keeps name and color unchanged,
I would ask what's the practical usage for this behavior? I could only imagine that it is natural to get the whole record that has the max size.
Sometimes all you need to know is the max value, for each key, of one field. I believe max is able to provide this information while doing less work than the more generally useful maxBy, which returns the whole record that has the max size.

Converting datatypes in Spark/Scala

I have a variable in scala called a which is as below
scala> a
res17: Array[org.apache.spark.sql.Row] = Array([0_42], [big], [baller], [bitch], [shoe] ..)
It is an array of lists which contains a single word.
I would like to convert it to a single array consisting of sequence of strings like shown below
Array[Seq[String]] = Array(WrappedArray(0_42,big,baller,shoe,?,since,eluid.........
Well the reason why I am trying to create an array of single wrapped array is I want to run word2vec model in spark using MLLIB.
The fit() function in this only takes iterable string.
scala> val model = word2vec.fit(b)
<console>:41: error: inferred type arguments [String] do not conform to method fit's type parameter bounds [S <: Iterable[String]]
The sample data you're listing is not an array of lists, but an array of Rows. An array of a single WrappedArray you're trying to create also doesn't seem to serve any meaningful purpose.
If you want to create an array of all the word strings in your Array[Row] data structure, you can simply use a map like in the following:
val df = Seq(
("0_42"), ("big"), ("baller"), ("bitch"), ("shoe"), ("?"), ("since"), ("eliud"), ("win")
).toDF("word")
val a = df.rdd.collect
// a: Array[org.apache.spark.sql.Row] = Array(
// [0_42], [big], [baller], [bitch], [shoe], [?], [since], [eliud], [win]
// )
import org.apache.spark.sql.Row
val b = a.map{ case Row(w: String) => w }
// b: Array[String] = Array(0_42, big, baller, bitch, shoe, ?, since, eliud, win)
[UPDATE]
If you do want to create an array of a single WrappedArray, here's one approach:
val b = Array( a.map{ case Row(w: String) => w }.toSeq )
// b: Array[Seq[String]] = Array(WrappedArray(
// 0_42, big, baller, bitch, shoe, ?, since, eliud, win
// ))
I finally got it working by doing the following
val db=a.map{ case Row(word: String) => word }
val model = word2vec.fit( b.map(l=>Seq(l)))

Saving users and items features to HDFS in Spark Collaborative filtering RDD

I want to extract users and items features (latent factors) from the result of collaborative filtering using ALS in Spark. The code I have so far:
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.MatrixFactorizationModel
import org.apache.spark.mllib.recommendation.Rating
// Load and parse the data
val data = sc.textFile("myhdfs/inputdirectory/als.data")
val ratings = data.map(_.split(',') match { case Array(user, item, rate) =>
Rating(user.toInt, item.toInt, rate.toDouble)
})
// Build the recommendation model using ALS
val rank = 10
val numIterations = 10
val model = ALS.train(ratings, rank, numIterations, 0.01)
// extract users latent factors
val users = model.userFeatures
// extract items latent factors
val items = model.productFeatures
// save to HDFS
users.saveAsTextFile("myhdfs/outputdirectory/users") // does not work as expected
items.saveAsTextFile("myhdfs/outputdirectory/items") // does not work as expected
However, what gets written to HDFS is not what I expect. I expected each line to have a tuple (userId, Array_of_doubles). Instead I see the following:
[myname#host dir]$ hadoop fs -cat myhdfs/outputdirectory/users/*
(1,[D#3c3137b5)
(3,[D#505d9755)
(4,[D#241a409a)
(2,[D#c8c56dd)
.
.
It is dumping the hash value of the array instead of the entire array. I did the following to print the desired values:
for (user <- users) {
val (userId, lf) = user
val str = "user:" + userId + "\t" + lf.mkString(" ")
println(str)
}
This does print what I want but I can't then write to HDFS (this prints on the console).
What should I do to get the complete array written to HDFS properly?
Spark version is 1.2.1.
#JohnTitusJungao is right and also the following lines works as expected :
users.saveAsTextFile("myhdfs/outputdirectory/users")
items.saveAsTextFile("myhdfs/outputdirectory/items")
And this is the reason, userFeatures returns an RDD[(Int,Array[Double])]. The array values are denoted by the symbols you see in the output e.g. [D#3c3137b5 , D for double, followed by # and hex code which is created using the Java toString method for this type of objects. More on that here.
val users: RDD[(Int, Array[Double])] = model.userFeatures
To solve that you'll need to make the array as a string :
val users: RDD[(Int, String)] = model.userFeatures.mapValues(_.mkString(","))
The same goes for items.

Yield an ArrayBuffer (or other mutable Collection type) from a for loop in Scala

Within the confines of a single matrix related method that works with large multidimensional arrays performance and memory usage are critical. We have a need to mutate elements of the array in place and thus are working with ArrayBuffer's (not Array's).
Given this use case is there a way to use for .. yield that would generate an ArrayBuffer (or at the least a mutable collection) instead of immutable?
The following code displays the intent - though it does not compile:
def classify(inarr: Array[Double], arrarr: Array[Array[Double]], labels: Array[String], K: Int): String = {
...
var diffmat: ArrayBuffer[ArrayBuffer[Double]] = for (row <- arrarr) yield {
(ArrayBuffer[Double]() /: (row zip inarr)) {
(outrow, cell) => outrow += cell._1 - cell._2
}
}
The compilation error is :
Expression Array[ArrayBuffer[Double]] does not conform to expected type ArrayBuffer[ArrayBuffer[Double]]
Ah... a case for the "magick sprinkles" of breakOut. Not only does it give you the collection type you want - it does it efficiently, without wasting an extra transformation.
object Foo {
import scala.collection.mutable.ArrayBuffer
import scala.collection.breakOut
val inarr: Array[Double] = Array()
val arrarr: Array[Array[Double]] = Array()
var diffmat: ArrayBuffer[ArrayBuffer[Double]] = (for (row <- arrarr) yield {
(ArrayBuffer[Double]() /: (row zip inarr)) {
(outrow, cell) => outrow += cell._1 - cell._2
}
})(breakOut)
}
The definitive writeup (IMHO) of this is Daniel Sobral's answer.

Resources