I would like to figure out the most pragmatic way to accept an array (or list) and append to the data structure. Then finally return the new data structure.
Something like this:
def template(array: Array[String]): Array[Nothing] = {
val staging_path = "s3//clone-staging/"
var path_list = Array()
//iterate through each of the items in the array and append to the new string.
for(outputString <- array){
var new_path = staging_path.toString + outputString
println(new_path)
//path_list I thought would add these new staging_path to the array
path_list +: new_path
}
path_list(4)
}
However, calling a single index of the data structure as a shanty way of checking existence, path_list(4) returns an Out of Bounds.
Thanks.
I think you just want to use map here:
val staging_path = "s3//clone-staging/"
val dirs = Array("one", "two", "three", "four", "five")
val paths = dirs.map(dir => staging_path + dir)
println(paths)
// result: paths: Array[String] = Array(s3//clone-staging/one, s3//clone-staging/two, s3//clone-staging/three, s3//clone-staging/four, s3//clone-staging/five)
println(paths.length)
// result: 5
In functional programming land you are generally trying to avoid mutations. Instead, think of it as transforming your input array into a new array.
I am new to scala and I am trying to make a Tuple pair out an RDD of type Array(Array[String]) that looks like:
(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)
I am trying to create Tuple Pairs out of these arrays so that the first element of each array is key and and any other part of the array is a value. For example the output would look like:
122abc 223cde
122abc 334vbn
122abc 445das
221bca 321dsa
231dsa 653asd
231dsa 698poq
231dsa 897qwa
I can't figure out how to separate the first element from each array and then map it to every other element.
If I'm reading it correctly, the core of your question has to do with separating the head (first element) of the inner arrays from the tail (remaining elements), which you can use the head and tail methods. RDDs behave a lot like Scala lists, so you can do this all with what looks like pure Scala code.
Given the following input RDD:
val input: RDD[Array[Array[String]]] = sc.parallelize(
Seq(
Array(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
)
)
)
The following should do what you want:
val output: RDD[(String,String)] =
input.flatMap { arrArrStr: Array[Array[String]] =>
arrArrStr.flatMap { arrStrs: Array[String] =>
arrStrs.tail.map { value => arrStrs.head -> value }
}
}
And in fact, because of how the flatMap/map is composed, you could re-write it as a for-comprehension.:
val output: RDD[(String,String)] =
for {
arrArrStr: Array[Array[String]] <- input
arrStr: Array[String] <- arrArrStr
str: String <- arrStr.tail
} yield (arrStr.head -> str)
Which one you go with is ultimately a matter of personal preference (though in this case, I prefer the latter, as you don't have to indent code as much).
For verification:
output.collect().foreach(println)
Should print out:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
This is a classic fold operation; but folding in Spark is calling aggregate:
// Start with an empty array
data.aggregate(Array.empty[(String, String)]) {
// `arr.drop(1).map(e => (arr.head, e))` will create tuples of
// all elements in each row and the first element.
// Append this to the aggregate array.
case (acc, arr) => acc ++ arr.drop(1).map(e => (arr.head, e))
}
The solution is a non-Spark environment:
scala> val data = Array(Array("122abc","223cde","334vbn","445das"),Array("221bca","321dsa"),Array("231dsa","653asd","698poq","897qwa"))
scala> data.foldLeft(Array.empty[(String, String)]) { case (acc, arr) =>
| acc ++ arr.drop(1).map(e => (arr.head, e))
| }
res0: Array[(String, String)] = Array((122abc,223cde), (122abc,334vbn), (122abc,445das), (221bca,321dsa), (231dsa,653asd), (231dsa,698poq), (231dsa,897qwa))
Convert your input element to seq and all and then try to write the wrapper which will give you List(List(item1,item2), List(item1,item2),...)
Try below code
val seqs = Seq("122abc","223cde","334vbn","445das")++
Seq("221bca","321dsa")++
Seq("231dsa","653asd","698poq","897qwa")
Write a wrapper to convert seq into a pair of two
def toPairs[A](xs: Seq[A]): Seq[(A,A)] = xs.zip(xs.tail)
Now send your seq as params and it it will give your pair of two
toPairs(seqs).mkString(" ")
After making it to string you will get the output like
res8: String = (122abc,223cde) (223cde,334vbn) (334vbn,445das) (445das,221bca) (221bca,321dsa) (321dsa,231dsa) (231dsa,653asd) (653asd,698poq) (698poq,897qwa)
Now you can convert your string, however, you want.
Using df and explode.
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
val df2 = df.withColumn("key", 'arr(0)).withColumn("values",explode('arr)).filter('key =!= 'values).drop('arr).withColumn("tuple",struct('key,'values))
df2.show(false)
df2.rdd.map( x => Row( (x(0),x(1)) )).collect.foreach(println)
Output:
+------+------+---------------+
|key |values|tuple |
+------+------+---------------+
|122abc|223cde|[122abc,223cde]|
|122abc|334vbn|[122abc,334vbn]|
|122abc|445das|[122abc,445das]|
|221bca|321dsa|[221bca,321dsa]|
|231dsa|653asd|[231dsa,653asd]|
|231dsa|698poq|[231dsa,698poq]|
|231dsa|897qwa|[231dsa,897qwa]|
+------+------+---------------+
[(122abc,223cde)]
[(122abc,334vbn)]
[(122abc,445das)]
[(221bca,321dsa)]
[(231dsa,653asd)]
[(231dsa,698poq)]
[(231dsa,897qwa)]
Update1:
Using paired rdd
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x => { val y = x.getAs[mutable.WrappedArray[String]]("arr")(0); (y,x)} )
val pair = new PairRDDFunctions(rdd1)
pair.flatMapValues( x => x.getAs[mutable.WrappedArray[String]]("arr") )
.filter( x=> x._1 != x._2)
.collect.foreach(println)
Results:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
This question already has an answer here:
Function returns an empty List in Spark
(1 answer)
Closed 4 years ago.
I've a following code :-
case class event(imei: String, date: String, gpsdt: String, entrygpsdt: String,lastgpsdt: String)
object recalculate extends Serializable {
def main(args: Array[String]) {
val sc = SparkContext.getOrCreate(conf)
val rdd = sc.cassandraTable("db", "table").select("imei", "date", "gpsdt").where("imei=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2), entry(3))
var lastgpsdt = "2018-04-06 10:10:10"
var updatedValues = new Array[event](rdd.count().toInt)
var index = 0
rdd.foreach(f => {
val imei = f.get[String]("imei")
val date = f.get[String]("date")
val gpsdt = f.get[String]("gpsdt")
updatedValues(index) = new event(imei, date, gpsdt,lastgpsdt)
println(updatedValues(index).toString())
index = index + 1
lastgpsdt = gpsdt
})
println("updates values are " + updatedValues.toString())
}}
So, here I'm trying to create an array of event class answer save values in array on each iteration and want to access the array outside foreach block. My issue is when I'm trying to access the array it gives null pointer exception and i checked it shows the array is empty. Although I have declared the array as var still why not able to access outside. Suggestions please, Thanks.
If you want to get Array[event] then I don't think that is the right approach
Here is what you can do for alternative
case class event(imei: String, date: String, gpsdt: String,
entrygpsdt: String,lastgpsdt: String)
val result = rdd.map(row => {
val imei = row.getString(0)
val date = row.getString(1)
val gpsdt = row.getString(2)
//create case class as you want
event(imei, date, gpsdt, lastgpsdt ,"2018-04-06 10:10:10")
})
.collect()
The result you obtain is Array[event]
Collect is also preferred only when your data size is small and can fit in a driver.
Hope this helps!
I was converting in the Spark Shell (1.6) a List of strings into an array like this:
val mapData = List("column1", "column2", "column3")
val values = array(mapData.map(col): _*)
The type of values is:
values: org.apache.spark.sql.Column = array(column1,column2,column3)
Everything fine, but when I start developing in Eclipse I got the error:
not found: value array
So I changed to this:
val values = Array(mapData.map(col): _*)
The problem I faced then was that the type of value now changed and the udf which was consuming it doesn't accept this new type:
values: Array[org.apache.spark.sql.Column] = Array(column1, column2,
column3)
Why I am not able to use array() in my IDE as in the Shell (what import am I missing)? and why array produce a org.apache.spark.sql.Column without the Array[] wrapper?
Edit: The udf function:
def replaceFirstMapOfArray =
udf((p: Seq[Map[String, String]], o: Seq[Map[String, String]]) =>
{
if((null != o && null !=p)){
if ( o.size == 1 ) p
else p ++ o.drop(1)
}else{
o
}
})
val mapData = List("column1", "column2", "column3")
val values = array(mapData.map(col): _*)
Here,
Array or List is the collection of objects
where as array in array(mapData.map(col): _*) is a spark function that creates a new column with type array for the same datatype columns.
For this to be used you need to import
import org.apache.spark.sql.functions.array
You can see here about the array
/**
* Creates a new array column. The input columns must all have the same data type.
* #group normal_funcs
* #since 1.4.0
*/
#scala.annotation.varargs
def array(cols: Column*): Column = withExpr {
CreateArray(cols.map(_.expr))
}
I have a variable in scala called a which is as below
scala> a
res17: Array[org.apache.spark.sql.Row] = Array([0_42], [big], [baller], [bitch], [shoe] ..)
It is an array of lists which contains a single word.
I would like to convert it to a single array consisting of sequence of strings like shown below
Array[Seq[String]] = Array(WrappedArray(0_42,big,baller,shoe,?,since,eluid.........
Well the reason why I am trying to create an array of single wrapped array is I want to run word2vec model in spark using MLLIB.
The fit() function in this only takes iterable string.
scala> val model = word2vec.fit(b)
<console>:41: error: inferred type arguments [String] do not conform to method fit's type parameter bounds [S <: Iterable[String]]
The sample data you're listing is not an array of lists, but an array of Rows. An array of a single WrappedArray you're trying to create also doesn't seem to serve any meaningful purpose.
If you want to create an array of all the word strings in your Array[Row] data structure, you can simply use a map like in the following:
val df = Seq(
("0_42"), ("big"), ("baller"), ("bitch"), ("shoe"), ("?"), ("since"), ("eliud"), ("win")
).toDF("word")
val a = df.rdd.collect
// a: Array[org.apache.spark.sql.Row] = Array(
// [0_42], [big], [baller], [bitch], [shoe], [?], [since], [eliud], [win]
// )
import org.apache.spark.sql.Row
val b = a.map{ case Row(w: String) => w }
// b: Array[String] = Array(0_42, big, baller, bitch, shoe, ?, since, eliud, win)
[UPDATE]
If you do want to create an array of a single WrappedArray, here's one approach:
val b = Array( a.map{ case Row(w: String) => w }.toSeq )
// b: Array[Seq[String]] = Array(WrappedArray(
// 0_42, big, baller, bitch, shoe, ?, since, eliud, win
// ))
I finally got it working by doing the following
val db=a.map{ case Row(word: String) => word }
val model = word2vec.fit( b.map(l=>Seq(l)))