I have an Array[Any] from Java JPA containing (two in this case, but consider any a small number of) differently-typed things. I would like to represent these as tuples instead.
I have some quick and dirty conversion code, and wondered how it could be improved and perhaps made more generic.
val pair = query.getSingleOrNone // returns Option[Any] (actually a Java array)
pair collect { case array: Array[Any] =>
(array(0).asInstanceOf[MyClass1], array(1).asInstanceOf[MyClass2]) }
How about this?
val pair = query.getSingleOrNone
pair collect { case Array(x: MyClass1, y: MyClass2, _*) => (x,y) }
// result would be Option[(MyClass1, MyClass2)]
Use map { case Array(f1,f2) => (f1,f2) }.
Here is an example:
Array( "CA:California", "WA:Washington", "OR:Oregon").
map(s => s.split(":")).
map { case Array(f1,f2) => (f1,f2)}
My solution is as below:
val loginValues = line.split(",") // return an Array
val (ip, date, action, username) = (loginValues(0), loginValues(1).toLong, loginValues(2), loginValues(3))
You can use Tuple.fromArray. Works for Scala 3.0.2, haven't checked earlier versions.
scala> val a = Array("a", "b")
val a: Array[String] = Array(a, b)
scala> Tuple.fromArray(a)
val res1: Tuple = (a,b)
Related
first array:
var keyColumns = "A,B".split(",")
second array:
var colValues = DataFrameTest.select("Y","Z").collect.map(row => row.toString)
colValues: Array[String]= Array([1,2],[3,4],[5,6])
I want something as a result like:
Array([A=1,B=2],[A=3,B=4],[A=5,B=6])
so that later I can iterate over this Array and can create my where clause like
where
(A=1 AND B=2) OR (A=3 AND B=4) OR (A=5 AND B=6)
First, don't convert structured data to string. Do .map(_.toSeq) after collect, not toString.
Then, something like this should work:
colValues
.map { _ zip keyColumns }
.map { _.map { case (v,k) => s"$k=$v" } }
.map { _.mkString("(", " AND ", ")") }
.mkString(" OR ")
You may find it helpful to run this step-by-step in REPL and see what each line does.
you can use regex expression, like:
scala> val keyColumns = "A,B".split(",")
keyColumns: Array[String] = Array(A, B)
scala> val colValues = "[1,2] [3,4] [5,6]".split(" ")
colValues: Array[String] = Array([1,2], [3,4], [5,6])
scala> val pattern = """^\[(.{1}),(.{1})\]$""".r //here, (.{1}) determines a regex group of exactly 1 any char
pattern: scala.util.matching.Regex = ^\[(.{1}),(.{1})\]$
scala> colValues.map { e => pattern.findFirstMatchIn(e).map { m => s"(${keyColumns(0)}=${m.group(1)} AND ${keyColumns(1)}=${m.group(2)})" }.getOrElse(e) }.mkString(" OR ")
res0: String = (A=1 AND B=2) OR (A=3 AND B=4) OR (A=5 AND B=6)
I would like to add element of a list at the end of every element of another list.
I have :
val Cars_tmp :List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
=> Result : List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
val Values_tmp: List[String] = a.map(r => ((r.split("[|]")(1).toInt)/ (r.split("[|]")(3).toInt)).toString ).toList
=> Result : List[String] = List(2, 5)
I would like to have the following result (first element of Values_tmp is concatenate with first element of Cars_tmp, second element of Values_tmp is concatenate with second element of Cars_tmp...) like below:
List("Cars|10|Paris|5|Type|New|2", "Cars|15|Paris|3|Type|New|5")
I tried to do this:
Values_tmp.foldLeft( Seq[String](), Cars_tmp) { case ((acc, rest), elmt) => ((rest :+ elmt)::acc) }
I have the following error:
console>:28: error: type mismatch;
found : scala.collection.immutable.IndexedSeq[Any]
required: List[String]
Than you for your help.
Try to avoid zip, it "fails" silently when the iterables do not have the same size. (In your code, it seems obvious that the 2 lists have the same size, but for more complex code, this is not obvious.)
You can compute the "value" you need and concatenate it on the fly:
val Cars_tmp: List[String] = List("Cars|10|Paris|5|Type|New|", "Cars|15|Paris|3|Type|New|")
def getValue(str: String): String = {
val Array(_, a, _, b, _, _) = str.split('|') // Note the single quote for the split.
(a.toInt / b.toInt).toString
}
Cars_tmp.map(str => str + getValue(str))
I proposed an implementation of getValue using the unapply of Arrays, but you can keep your implementation !
def getValue(r: String) = ((r.split("[|]")(1).toInt)/ (r.split("[|]")(3).toInt)).toString
I am new to scala and I am trying to make a Tuple pair out an RDD of type Array(Array[String]) that looks like:
(122abc,223cde,334vbn,445das),(221bca,321dsa),(231dsa,653asd,698poq,897qwa)
I am trying to create Tuple Pairs out of these arrays so that the first element of each array is key and and any other part of the array is a value. For example the output would look like:
122abc 223cde
122abc 334vbn
122abc 445das
221bca 321dsa
231dsa 653asd
231dsa 698poq
231dsa 897qwa
I can't figure out how to separate the first element from each array and then map it to every other element.
If I'm reading it correctly, the core of your question has to do with separating the head (first element) of the inner arrays from the tail (remaining elements), which you can use the head and tail methods. RDDs behave a lot like Scala lists, so you can do this all with what looks like pure Scala code.
Given the following input RDD:
val input: RDD[Array[Array[String]]] = sc.parallelize(
Seq(
Array(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
)
)
)
The following should do what you want:
val output: RDD[(String,String)] =
input.flatMap { arrArrStr: Array[Array[String]] =>
arrArrStr.flatMap { arrStrs: Array[String] =>
arrStrs.tail.map { value => arrStrs.head -> value }
}
}
And in fact, because of how the flatMap/map is composed, you could re-write it as a for-comprehension.:
val output: RDD[(String,String)] =
for {
arrArrStr: Array[Array[String]] <- input
arrStr: Array[String] <- arrArrStr
str: String <- arrStr.tail
} yield (arrStr.head -> str)
Which one you go with is ultimately a matter of personal preference (though in this case, I prefer the latter, as you don't have to indent code as much).
For verification:
output.collect().foreach(println)
Should print out:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
This is a classic fold operation; but folding in Spark is calling aggregate:
// Start with an empty array
data.aggregate(Array.empty[(String, String)]) {
// `arr.drop(1).map(e => (arr.head, e))` will create tuples of
// all elements in each row and the first element.
// Append this to the aggregate array.
case (acc, arr) => acc ++ arr.drop(1).map(e => (arr.head, e))
}
The solution is a non-Spark environment:
scala> val data = Array(Array("122abc","223cde","334vbn","445das"),Array("221bca","321dsa"),Array("231dsa","653asd","698poq","897qwa"))
scala> data.foldLeft(Array.empty[(String, String)]) { case (acc, arr) =>
| acc ++ arr.drop(1).map(e => (arr.head, e))
| }
res0: Array[(String, String)] = Array((122abc,223cde), (122abc,334vbn), (122abc,445das), (221bca,321dsa), (231dsa,653asd), (231dsa,698poq), (231dsa,897qwa))
Convert your input element to seq and all and then try to write the wrapper which will give you List(List(item1,item2), List(item1,item2),...)
Try below code
val seqs = Seq("122abc","223cde","334vbn","445das")++
Seq("221bca","321dsa")++
Seq("231dsa","653asd","698poq","897qwa")
Write a wrapper to convert seq into a pair of two
def toPairs[A](xs: Seq[A]): Seq[(A,A)] = xs.zip(xs.tail)
Now send your seq as params and it it will give your pair of two
toPairs(seqs).mkString(" ")
After making it to string you will get the output like
res8: String = (122abc,223cde) (223cde,334vbn) (334vbn,445das) (445das,221bca) (221bca,321dsa) (321dsa,231dsa) (231dsa,653asd) (653asd,698poq) (698poq,897qwa)
Now you can convert your string, however, you want.
Using df and explode.
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
val df2 = df.withColumn("key", 'arr(0)).withColumn("values",explode('arr)).filter('key =!= 'values).drop('arr).withColumn("tuple",struct('key,'values))
df2.show(false)
df2.rdd.map( x => Row( (x(0),x(1)) )).collect.foreach(println)
Output:
+------+------+---------------+
|key |values|tuple |
+------+------+---------------+
|122abc|223cde|[122abc,223cde]|
|122abc|334vbn|[122abc,334vbn]|
|122abc|445das|[122abc,445das]|
|221bca|321dsa|[221bca,321dsa]|
|231dsa|653asd|[231dsa,653asd]|
|231dsa|698poq|[231dsa,698poq]|
|231dsa|897qwa|[231dsa,897qwa]|
+------+------+---------------+
[(122abc,223cde)]
[(122abc,334vbn)]
[(122abc,445das)]
[(221bca,321dsa)]
[(231dsa,653asd)]
[(231dsa,698poq)]
[(231dsa,897qwa)]
Update1:
Using paired rdd
val df = Seq(
Array("122abc","223cde","334vbn","445das"),
Array("221bca","321dsa"),
Array("231dsa","653asd","698poq","897qwa")
).toDF("arr")
import scala.collection.mutable._
val rdd1 = df.rdd.map( x => { val y = x.getAs[mutable.WrappedArray[String]]("arr")(0); (y,x)} )
val pair = new PairRDDFunctions(rdd1)
pair.flatMapValues( x => x.getAs[mutable.WrappedArray[String]]("arr") )
.filter( x=> x._1 != x._2)
.collect.foreach(println)
Results:
(122abc,223cde)
(122abc,334vbn)
(122abc,445das)
(221bca,321dsa)
(231dsa,653asd)
(231dsa,698poq)
(231dsa,897qwa)
I've been trying to convert an RDD to a dataframe. For that, the types need to be defined and not Any. I'm using spark MLLib PrefixSpan, that's where freqSequence.sequence is from. I start with a dataframe that contains Session_IDs, views and purchases as String-Arrays:
viewsPurchasesGrouped: org.apache.spark.sql.DataFrame =
[session_id: decimal(29,0), view_product_ids: array[string], purchase_product_ids: array[string]]
I then calculate frequent patterns and need them in a dataframe so I can write them to a Hive table.
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row => Array(Array(row(1)), Array(row(2)) ))
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
val freqSequencesRdd = sc.parallelize(model.freqSequences.collect())
case class FreqSequences(views: Array[String], purchases: Array[String], support: Long)
val viewsPurchasesDf = freqSequencesRdd.map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
viewsPurchasesDf.toDF() // optional
When I try to run this, views and purchases are "Any" instead of "Array[String]". I've desperately tried to convert them around, but the best I get is Array[Any]. I think I need to map the contents to a String, I've tried e.g. this: How to get an element in WrappedArray: result of Dataset.select("x").collect()? and this: How to cast a WrappedArray[WrappedArray[Float]] to Array[Array[Float]] in spark (scala) and thousands of other Stackoverflow questions...
I really don't know how to solve this. I guess I'm already converting the initial dataframe/RDD to much, but can't understand where.
I think the problem is that you have a DataFrame, which retains no static type information. When you take an item out of a Row, you have to tell it explicitly which type you expect to get.
Untested, but inferred from the information you gave:
import scala.collection.mutable.WrappedArray
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
Array(row.getAs[WrappedArray[String]](1).toArray),
Array(row.getAs[WrappedArray[String]](2).toArray)
)
)
I solved the problem. For reference, this works:
val viewsPurchasesRddString = viewsPurchasesGrouped.map( row =>
Array(
row.getSeq[Long](1).toArray,
row.getSeq[Long](2).toArray
)
)
val prefixSpan = new PrefixSpan()
.setMinSupport(0.001)
.setMaxPatternLength(2)
val model = prefixSpan.run(viewsPurchasesRddString)
case class FreqSequences(views: Long, purchases: Long, frequence: Long)
val ps_frequences = model.freqSequences.filter(fs => fs.sequence.length > 1).map( fs =>
{
val views = fs.sequence(0)(0)
val purchases = fs.sequence(1)(0)
val freq = fs.freq
FreqSequences(views, purchases, freq)
}
)
ps_frequences.toDF()
I have a variable in scala called a which is as below
scala> a
res17: Array[org.apache.spark.sql.Row] = Array([0_42], [big], [baller], [bitch], [shoe] ..)
It is an array of lists which contains a single word.
I would like to convert it to a single array consisting of sequence of strings like shown below
Array[Seq[String]] = Array(WrappedArray(0_42,big,baller,shoe,?,since,eluid.........
Well the reason why I am trying to create an array of single wrapped array is I want to run word2vec model in spark using MLLIB.
The fit() function in this only takes iterable string.
scala> val model = word2vec.fit(b)
<console>:41: error: inferred type arguments [String] do not conform to method fit's type parameter bounds [S <: Iterable[String]]
The sample data you're listing is not an array of lists, but an array of Rows. An array of a single WrappedArray you're trying to create also doesn't seem to serve any meaningful purpose.
If you want to create an array of all the word strings in your Array[Row] data structure, you can simply use a map like in the following:
val df = Seq(
("0_42"), ("big"), ("baller"), ("bitch"), ("shoe"), ("?"), ("since"), ("eliud"), ("win")
).toDF("word")
val a = df.rdd.collect
// a: Array[org.apache.spark.sql.Row] = Array(
// [0_42], [big], [baller], [bitch], [shoe], [?], [since], [eliud], [win]
// )
import org.apache.spark.sql.Row
val b = a.map{ case Row(w: String) => w }
// b: Array[String] = Array(0_42, big, baller, bitch, shoe, ?, since, eliud, win)
[UPDATE]
If you do want to create an array of a single WrappedArray, here's one approach:
val b = Array( a.map{ case Row(w: String) => w }.toSeq )
// b: Array[Seq[String]] = Array(WrappedArray(
// 0_42, big, baller, bitch, shoe, ?, since, eliud, win
// ))
I finally got it working by doing the following
val db=a.map{ case Row(word: String) => word }
val model = word2vec.fit( b.map(l=>Seq(l)))