split dataframe into array scala - arrays

I am trying to split an Dataframe into multiple arrays according to their id.
So I have a table
id|name
12|a
12|b
12|c
13|z
13|y
13|z
and I want to get multiple vectors that look like:
<a,b,c> <x,y,z>
So I have managed to get all the different IDs using:
val ids=dataframe.select("id").distinct.collect.flatMap(_.toSeq)
and that would return 12 and 13.
And I have tried to get for each one of them the names:
val namesArray=ids.map(id=>dataframe.where($"id"===id))
but that doesnt seem to be the way since it is returning the column names and I should find a way to get only the name out of it.

If you already have a DataSet with your data then you can do the following,
val yourDataSet = sc.parallelize(List((12, "a"), (12, "b"), (13, "y"), (13, "z"))).toDF("id", "val")
val requiredDataSet = yourDataSet
.groupBy("id")
.agg(collect_list("val"))
.select("collect_list(val)")
Or you can achieve this by getting the underlying Rdd and then transforming it.
val vaueVectorRdd = dataframe.rdd
.map(row.getInt(0), row.getString(1))
.groupByKey({ case (k, v) => k })
.map({ case (k, iter) => iter.map(_._2).toVector })

Related

Concatenate Values of all Elements of An Array of Maps in Spark SQL

I am new to Spark Sql and I have a column of type array with data like below :
[{"X":"A11"},{"X":"A12"},{"X":"A13"}]
The output I am looking for is a string field as
A11, A12, A13
I cannot explode the array as I need the data in one row.
Since the maximum length of the array in my case is 6, I got it to work using below case statement.
case
when size(arr)=1 then array_join(map_values(map_concat(arr[0])),',')
when size(arr)=2 then array_join(map_values(map_concat(arr[0],arr[1])),',')
when size(arr)=3 then array_join(map_values(map_concat(arr[0],arr[1],arr[2])),',')
when size(arr)=4 then array_join(map_values(map_concat(arr[0],arr[1],arr[2],arr[3])),',')
when size(arr)=5 then array_join(map_values(map_concat(arr[0],arr[1],arr[2],arr[3],arr[4])),',')
when size(arr)=6 then array_join(map_values(map_concat(arr[0],arr[1],arr[2],arr[3],arr[4],arr[5])),',')
else
null
end
Is there a better way to do this?
Assuming that the source and result columns are col and values respectively, it can be implemented as follows:
data = [
([{"X": "A11"}, {"X": "A12"}, {"X": "A13"}],)
]
df = spark.createDataFrame(data, ['col'])
df = df.withColumn('values', F.array_join(F.flatten(F.transform('col', lambda x: F.map_values(x))), ','))
df.show(truncate=False)

How to get keys and values from dictionary in same order? [duplicate]

This question already has an answer here:
Why this iteration is random? [duplicate]
(1 answer)
Closed 1 year ago.
I am using a dictionary to store key-value pairs. It's storing correctly kay-value pair. But when I try to retrieve all keys and all values separately, it's printing different differently.
Here is the example,
var sampleDic:[String:Any] = [:]
sampleDic["one"] = 1
sampleDic["two"] = 2
sampleDic["three"] = 3
print("keys: \(sampleDic.keys)")
print("Values: \(sampleDic.values)")
sometimes its prints,
keys: ["one", "two", "three"]
Values: [1, 2, 3]
and sometimes its prints,
keys: ["three", "one", "two"]
Values: [3, 1, 2]
each time, whenever I run the code, I will get different output.
How I can get the output in the same sequence for keys and values always. Both keys and values must be printed in array bases on which key/value they stored in the dictionary.
If this is not possible with dictionaries, then is there an alternative way to store and get values like this?
Dictionaries stores mappings between keys and values, not a list of key value pairs, so there is no order.
If you want an order, you can give it one by sorting the key value pairs, e.g. by the keys
ket kvp = Array(sampelDic).sorted { $0.key < $1.key }
This produces a [(String, Any)]. If you actually want a [String] and a [Any], you can write an unzip function (from answer by Rob):
func unzip<K, V>(_ array: [(key: K, value: V)]) -> ([K], [V]) {
var keys = [K]()
var values = [V]()
keys.reserveCapacity(array.count)
values.reserveCapacity(array.count)
array.forEach { key, value in
keys.append(key)
values.append(value)
}
return (keys, values)
}
Now you can do:
let (keys, values) = unzip(kvp)
print("Keys:", keys)
print("Values:", values)

Kotlin array In array?

I want to make a array in array and get one by index form in Kotlin.
for example, I make a this array [ (1, 12(real data is Bitmap)) , (2, 24(same)), (3, 36) ]
so I can get array(index) = 12
how can I make this form of array and get data by index like above?
Maybe Map is what you need:
val map = mapOf(1 to 12, 2 to 24, 3 to 36)
val twelve = map[1]
It is a collection that holds pairs of objects (keys and values) and supports efficiently retrieving the value corresponding to each key.
To add data to a map we can use mutableMapOf function:
val map = mutableMapOf<Int, Bitmap>()
val bitmap: Bitmap = ...
map[4] = bitmap
If you just want an array of bytes, use byteArrayOf:
val array = byteArrayOf(12, 24, 36)
println(array[0]) // 12
ByteArray is the equivalent of Java's byte[].
Note: There is also intArrayOf, floatArrayOf, doubleArrayOf etc.
Since you asked for an array in an array as well:
val arrayOfArrays = arrayOf(byteArrayOf(1, 2, 3), byteArrayOf(24), byteArrayOf(36))
println(arrayOfArrays[0][1]) // 2
In this case the type of arrayOfArrays will be Array<ByteArray> and you need arrayOf to construct that.

Scala Join Arrays and Reduce on Value

I have two Arrays of Key/Value Pairs Array[(String, Int)] that I want to join, but only return the minimum value when there's a match on the key.
val a = Array(("personA", 1), ("personB", 4), ("personC", 5))
val b = Array(("personC", 4), ("personA", 2))
Goal: c = Array((personA, 1), (personC, 4))
val c = a.join(b).collect()
results in: c = Array((personA, (1, 2)), (personC, (5, 4)))
I've tried to achieve this using the join method but am having difficulties reducing the values after they have been joined into a single array: Array[(String, (Int, Int))].
Try this:
val a = Array(("personA", 1), ("personB", 4), ("personC", 5))
val b = Array(("personC", 4), ("personA", 2))
val bMap = b.toMap
val cMap = a.toMap.filterKeys(bMap.contains).map {
case(k, v) => k -> Math.min(v, bMap(k))
}
val c = cMap.toArray
The toMap method converts the Array[(String, Int)] into a Map[String, Int]; filterKeys is then used to retain only the keys (strings) in a.toMap that are also in b.toMap. The map operation then chooses the minimum value of the two available values for each key, and creates a new map associating each key with that minimum value. Finally, we convert the resulting map back to an Array[(String, Int)] using toArray.
UPDATED
BTW: I'm not sure where you get the Array.join method from: Array doesn't have such a method, so a.join(b) doesn't work for me. However, I suspect that a and b might be Apache Spark PairRDD collections (or something similar). If that's the case, then you can join a and b, then map the values to the minimum of each pair (a reduce operation is not what you want) as follows:
a.join(b).mapValues(v => Math.min(v._1, v._2)).collect
collect converts the result into an Array[(String, Int)] as you require.

How do I algorithmically instantiate and manipulate a multidimensional array in Scala

I am trying to wrote a program to manage a Database through a Scala Gui, and have been running into alot of trouble formatting my data in such a way as to input it into a Table and have the Column Headers populate. To do this, I have been told I would need to use an Array[Array[Any]] instead of an ArrayBuffer[ArrayBuffer[String]] as I have been using.
My problem is that the way I am trying to fill these arrays is modular: I am trying to use the same function to draw from different tables in a MySQL database, each of which has a different number of columns and entries.
I have been able to (I think) define a 2-D array with
val Data = new Array[Array[String]](numColumns)(numRows)
but I haven't found any ways of editing individual cells in this new array.
Data(i)(j)=Value //or
Data(i,j)=Value
do not work, and give me errors about "Update" functionality
I am sure this can't possibly be as complicated as I have been making it, so what is the easy way of managing these things in this language?
You don't need to read your data into an Array of Arrays - you just need to convert it to that format when you feed it to the Table constuctor - which is easy, as demonstrated my answer to your other question: How do I configure the Column names in a Scala Table?
If you're creating a 2D array, the idiom you want is
val data = Array.ofDim[String](numColumms, numRows)
(There is also new Array[String](numColumns, numRows), but that's deprecated.)
You access element (i, j) of an Array data with data(i)(j) (remember they start from 0).
But in general you should avoid mutable collections (like Array, ArrayBuffer) unless there's a good reason. Try Vector instead.
Without knowing the format in which you're retrieving data from the database it's not possible to say how to put it into a collection.
Update:
You can alternatively put the type information on the left hand side, so the following are equivalent (decide for yourself which you prefer):
val a: Array[Array[String]] = Array.ofDim(2,2)
val a = Array.ofDim[String](2,2)
To explain the syntax for accessing / updating elements: as in Java, a multi-dimensional array is just an array of arrays. So here, a(i) is element i of a, which an Array[String], and so a(i)(j) is element j of that array, which is a String.
Luigi's answer is great, but I'd like to shed some light on why your code isn't working.
val Data = new Array[Array[String]](numColumns)(numRows)
does not do what you expect it to do. The new Array[Array[String]](numColumns) part does create an array of array of strings with numColumns entries, with all entries (arrys of strings) being null, and returns it. The following (numRows) then just calls the apply function on that returned object, which returns the numRowsth entry in that list, which is null.
You can try that out in the scala REPL: When you input
new Array[Array[String]](10)(9)
you get this as output:
res0: Array[String] = null
Luigi's solution, instead
Array.ofDim[String](2,2)
does the right thing:
res1: Array[Array[String]] = Array(Array(null, null), Array(null, null))
It's rather ugly, but you can update a multidimensional array with update
> val data = Array.ofDim[String](2,2)
data: Array[Array[String]] = Array(Array(null, null), Array(null, null))
> data(0).update(0, "foo")
> data
data: Array[Array[String]] = Array(Array(foo, null), Array(null, null))
Not sure about the efficiency of this technique.
Luigi's answer is great, but I just wanted to point out another way of initialising an Array that is more idiomatic/functional – using tabulate. This takes a function that takes the array cell coordinates as input and produces the cell value:
scala> Array.tabulate[String](4, 4) _
res0: (Int, Int) => String => Array[Array[String]] = <function1>
scala> val data = Array.tabulate(4, 4) {case (x, y) => x * y }
data: Array[Array[Int]] = Array(Array(0, 0, 0, 0), Array(0, 1, 2, 3), Array(0, 2, 4, 6), Array(0, 3, 6, 9))

Resources