Spark - Difference between array() and Array() - arrays

I was converting in the Spark Shell (1.6) a List of strings into an array like this:
val mapData = List("column1", "column2", "column3")
val values = array(mapData.map(col): _*)
The type of values is:
values: org.apache.spark.sql.Column = array(column1,column2,column3)
Everything fine, but when I start developing in Eclipse I got the error:
not found: value array
So I changed to this:
val values = Array(mapData.map(col): _*)
The problem I faced then was that the type of value now changed and the udf which was consuming it doesn't accept this new type:
values: Array[org.apache.spark.sql.Column] = Array(column1, column2,
column3)
Why I am not able to use array() in my IDE as in the Shell (what import am I missing)? and why array produce a org.apache.spark.sql.Column without the Array[] wrapper?
Edit: The udf function:
def replaceFirstMapOfArray =
udf((p: Seq[Map[String, String]], o: Seq[Map[String, String]]) =>
{
if((null != o && null !=p)){
if ( o.size == 1 ) p
else p ++ o.drop(1)
}else{
o
}
})

val mapData = List("column1", "column2", "column3")
val values = array(mapData.map(col): _*)
Here,
Array or List is the collection of objects
where as array in array(mapData.map(col): _*) is a spark function that creates a new column with type array for the same datatype columns.
For this to be used you need to import
import org.apache.spark.sql.functions.array
You can see here about the array
/**
* Creates a new array column. The input columns must all have the same data type.
* #group normal_funcs
* #since 1.4.0
*/
#scala.annotation.varargs
def array(cols: Column*): Column = withExpr {
CreateArray(cols.map(_.expr))
}

Related

GenericRowWithSchema ClassCastException in Spark 3 Scala UDF for Array data

I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})

Create Tuple out of Array(Array[String) of Varying Sizes using Scala

I am new to scala and I am trying to make a Tuple pair out an RDD of type Array(Array[String]) that looks like:
(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)
I am trying to create Tuple Pairs out of these arrays so that the first element of each array is key and and any other part of the array is a value. For example the output would look like:
122abc 223cde
122abc 334vbn
122abc 445das
221bca 321dsa
231dsa 653asd
231dsa 698poq
231dsa 897qwa
I can't figure out how to separate the first element from each array and then map it to every other element.
If I'm reading it correctly, the core of your question has to do with separating the head (first element) of the inner arrays from the tail (remaining elements), which you can use the head and tail methods. RDDs behave a lot like Scala lists, so you can do this all with what looks like pure Scala code.
Given the following input RDD:
val input: RDD[Array[Array[String]]] = sc.parallelize(
Seq(
Array(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
)
)
)
The following should do what you want:
val output: RDD[(String,String)] =
input.flatMap { arrArrStr: Array[Array[String]] =>
arrArrStr.flatMap { arrStrs: Array[String] =>
arrStrs.tail.map { value => arrStrs.head -> value }
}
}
And in fact, because of how the flatMap/map is composed, you could re-write it as a for-comprehension.:
val output: RDD[(String,String)] =
for {
arrArrStr: Array[Array[String]] <- input
arrStr: Array[String] <- arrArrStr
str: String <- arrStr.tail
} yield (arrStr.head -> str)
Which one you go with is ultimately a matter of personal preference (though in this case, I prefer the latter, as you don't have to indent code as much).
For verification:
output.collect().foreach(println)
Should print out:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
This is a classic fold operation; but folding in Spark is calling aggregate:
// Start with an empty array
data.aggregate(Array.empty[(String, String)]) {
// `arr.drop(1).map(e => (arr.head, e))` will create tuples of
// all elements in each row and the first element.
// Append this to the aggregate array.
case (acc, arr) => acc ++ arr.drop(1).map(e => (arr.head, e))
}
The solution is a non-Spark environment:
scala> val data = Array(Array("122abc","223cde","334vbn","445das"),Array("221bca","321dsa"),Array("231dsa","653asd","698poq","897qwa"))
scala> data.foldLeft(Array.empty[(String, String)]) { case (acc, arr) =>
| acc ++ arr.drop(1).map(e => (arr.head, e))
| }
res0: Array[(String, String)] = Array((122abc,223cde), (122abc,334vbn), (122abc,445das), (221bca,321dsa), (231dsa,653asd), (231dsa,698poq), (231dsa,897qwa))
Convert your input element to seq and all and then try to write the wrapper which will give you List(List(item1,item2), List(item1,item2),...)
Try below code
val seqs = Seq("122abc","223cde","334vbn","445das")++
Seq("221bca","321dsa")++
Seq("231dsa","653asd","698poq","897qwa")
Write a wrapper to convert seq into a pair of two
def toPairs[A](xs: Seq[A]): Seq[(A,A)] = xs.zip(xs.tail)
Now send your seq as params and it it will give your pair of two
toPairs(seqs).mkString(" ")
After making it to string you will get the output like
res8: String = (122abc,223cde) (223cde,334vbn) (334vbn,445das) (445das,221bca) (221bca,321dsa) (321dsa,231dsa) (231dsa,653asd) (653asd,698poq) (698poq,897qwa)
Now you can convert your string, however, you want.
Using df and explode.
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
val df2 = df.withColumn("key", 'arr(0)).withColumn("values",explode('arr)).filter('key =!= 'values).drop('arr).withColumn("tuple",struct('key,'values))
df2.show(false)
df2.rdd.map( x => Row( (x(0),x(1)) )).collect.foreach(println)
Output:
+------+------+---------------+
|key |values|tuple |
+------+------+---------------+
|122abc|223cde|[122abc,223cde]|
|122abc|334vbn|[122abc,334vbn]|
|122abc|445das|[122abc,445das]|
|221bca|321dsa|[221bca,321dsa]|
|231dsa|653asd|[231dsa,653asd]|
|231dsa|698poq|[231dsa,698poq]|
|231dsa|897qwa|[231dsa,897qwa]|
+------+------+---------------+
[(122abc,223cde)]
[(122abc,334vbn)]
[(122abc,445das)]
[(221bca,321dsa)]
[(231dsa,653asd)]
[(231dsa,698poq)]
[(231dsa,897qwa)]
Update1:
Using paired rdd
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x => { val y = x.getAs[mutable.WrappedArray[String]]("arr")(0); (y,x)} )
val pair = new PairRDDFunctions(rdd1)
pair.flatMapValues( x => x.getAs[mutable.WrappedArray[String]]("arr") )
.filter( x=> x._1 != x._2)
.collect.foreach(println)
Results:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)

Scala convert WrappedArray or Array[Any] to Array[String]

I've been trying to convert an RDD to a dataframe. For that, the types need to be defined and not Any. I'm using spark MLLib PrefixSpan, that's where freqSequence.sequence is from. I start with a dataframe that contains Session_IDs, views and purchases as String-Arrays:
viewsPurchasesGrouped: org.apache.spark.sql.DataFrame =
[session_id: decimal(29,0), view_product_ids: array[string], purchase_product_ids: array[string]]
I then calculate frequent patterns and need them in a dataframe so I can write them to a Hive table.
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row => Array(Array(row(1)), Array(row(2)) ))
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
val freqSequencesRdd = sc.parallelize(model.freqSequences.collect())
case class FreqSequences(views: Array[String], purchases: Array[String], support: Long)
val viewsPurchasesDf = freqSequencesRdd.map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
viewsPurchasesDf.toDF() // optional
When I try to run this, views and purchases are "Any" instead of "Array[String]". I've desperately tried to convert them around, but the best I get is Array[Any]. I think I need to map the contents to a String, I've tried e.g. this: How to get an element in WrappedArray: result of Dataset.select("x").collect()? and this: How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala) and thousands of other Stackoverflow questions...
I really don't know how to solve this. I guess I'm already converting the initial dataframe/RDD to much, but can't understand where.
I think the problem is that you have a DataFrame, which retains no static type information. When you take an item out of a Row, you have to tell it explicitly which type you expect to get.
Untested, but inferred from the information you gave:
import scala.collection.mutable.WrappedArray
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
Array(row.getAs[WrappedArray[String]](1).toArray),
Array(row.getAs[WrappedArray[String]](2).toArray)
)
)
I solved the problem. For reference, this works:
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
row.getSeq[Long](1).toArray,
row.getSeq[Long](2).toArray
)
)
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
case class FreqSequences(views: Long, purchases: Long, frequence: Long)
val ps_frequences = model.freqSequences.filter(fs => fs.sequence.length > 1).map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
ps_frequences.toDF()

Converting datatypes in Spark/Scala

I have a variable in scala called a which is as below
scala> a
res17: Array[org.apache.spark.sql.Row] = Array([0_42], [big], [baller], [bitch], [shoe] ..)
It is an array of lists which contains a single word.
I would like to convert it to a single array consisting of sequence of strings like shown below
Array[Seq[String]] = Array(WrappedArray(0_42,big,baller,shoe,?,since,eluid.........
Well the reason why I am trying to create an array of single wrapped array is I want to run word2vec model in spark using MLLIB.
The fit() function in this only takes iterable string.
scala> val model = word2vec.fit(b)
<console>:41: error: inferred type arguments [String] do not conform to method fit's type parameter bounds [S <: Iterable[String]]
The sample data you're listing is not an array of lists, but an array of Rows. An array of a single WrappedArray you're trying to create also doesn't seem to serve any meaningful purpose.
If you want to create an array of all the word strings in your Array[Row] data structure, you can simply use a map like in the following:
val df = Seq(
("0_42"), ("big"), ("baller"), ("bitch"), ("shoe"), ("?"), ("since"), ("eliud"), ("win")
).toDF("word")
val a = df.rdd.collect
// a: Array[org.apache.spark.sql.Row] = Array(
// [0_42], [big], [baller], [bitch], [shoe], [?], [since], [eliud], [win]
// )
import org.apache.spark.sql.Row
val b = a.map{ case Row(w: String) => w }
// b: Array[String] = Array(0_42, big, baller, bitch, shoe, ?, since, eliud, win)
[UPDATE]
If you do want to create an array of a single WrappedArray, here's one approach:
val b = Array( a.map{ case Row(w: String) => w }.toSeq )
// b: Array[Seq[String]] = Array(WrappedArray(
// 0_42, big, baller, bitch, shoe, ?, since, eliud, win
// ))
I finally got it working by doing the following
val db=a.map{ case Row(word: String) => word }
val model = word2vec.fit( b.map(l=>Seq(l)))

How do I remove duplicate values from my Multidimensional array in a Scala way?

I'm trying extract some values from a String. The string contains several lines with values. The values on each line are number, firstname, last name. Then I want to filter by a given pattern and remove the duplicate numbers.
This is my test:
test("Numbers should be unique") {
val s = Cool.prepareListAccordingToPattern(ALLOWED_PATTERN, "1234,örjan,nilsson\n4321,eva-lisa,nyman\n1234,eva,nilsson")
assert(s.length == 2, "Well that didn't work.. ")
info("Chopping seems to work. Filtered duplicate numbers. Expected 1234:4321, got: "+s(0)(0)+":"+s(1)(0))
}
The methods:
def prepareListAccordingToPattern(allowedPattern: String, s: String) : Array[Array[String]] = {
val lines = chop("\n", s)
val choppedUp = lines.map(line =>
chop(",", line)).filter(array =>
array.length == 3 && array(0).matches(allowedPattern)
)
choppedUp
}
def chop(splitSymbol: String, toChop: String) : Array[String] = {
toChop.split(splitSymbol)
}
My test fails as expected since I receive back a multidimensional array with duplicates:
[0]["1234","örjan","nilsson"]
[1]["4321","eva-lisa","nyman"]
[2]["1234","eva","nilsson"]
What I would like to do is to filter out the duplicated numbers, in this case "1234"
so that I get back:
[0]["1234","örjan","nilsson"]
[1]["4321","eva-lisa","nyman"]
How should I do this in a scala way? Maybe I could attack this problem differently?
val arr = Array(
Array("1234","rjan","nilsson"),
Array("4321","eva-lisa","nyman"),
Array("1234","eva","nilsson")
)
arr.groupBy( _(0)).map{ case (_, vs) => vs.head}.toArray
// Array(Array(1234, rjan, nilsson), Array(4321, eva-lisa, nyman))
If you have a collection of elements (in this case Array of Array[String]) and want to get single element with every value of some property (in this case property is the first string from Array[String]) you should group elements of collection based on this property (arr.groupBy( _(0))) and then somehow select one element from every group. In this case we picked up the first element (Array[String]) from every group.
If you want to select any (not necessary first) element for every group you could convert every element (Array[String]) to the key-value pair ((String, Array[String])) where key is the value of target property, and then convert this collection of pairs to Map:
val myMap = arr.map{ a => a(0) -> a }.toMap
// Map(1234 -> Array(1234, eva, nilsson), 4321 -> Array(4321, eva-lisa, nyman))
myMap.values.toArray
// Array(Array(1234, eva, nilsson), Array(4321, eva-lisa, nyman))
In this case you'll get the last element from every group.
A bit implicit, but should work:
val arr = Array(
Array("1234","rjan","nilsson"),
Array("4321","eva-lisa","nyman"),
Array("1234","eva","nilsson")
)
arr.view.reverse.map(x => x.head -> x).toMap.values
// Iterable[Array[String]] = MapLike(Array(1234, rjan, nilsson), Array(4321, eva-lisa, nyman))
Reverse here to override "eva","nilsson" with "rjan","nilsson", not vice versa

Resources