Convert case class constructor parameters to String Array in Scala - arrays

I have a case class as follows:
case class MHealthUser(acc_Chest_X: Double, acc_Chest_Y: Double, acc_Chest_Z: Double, activityLabel: Int)
These form the schema of a Spark DataFrame, which is why I'm using a case class. I simply want to map these to an Array[String] so I can use the ParamValidators.inArray(attributes) method in Spark. I use the following code to map the constructor parameters to an array using reflection:
val attributes: Array[String] = MHealthUser.getClass.getConstructors.map(a => a.toString)
but this simply gives me an array of length 1 whereas I want an array of length 4, with the contents of the array being the dataset schema which I've defined, as a string. Otherwise I'm using the hard-coded values of the dataset schema, which is obviously inelegant.
In other words I want the output:
val attributes: Array[String] = Array("acc_Chest_X", "acc_Chest_Y", "acc_Chest_Z", "activityLabel")
I've been playing with this for a while and can't get it to work. Any ideas appreciated. Thanks!

I'd use ScalaReflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
ScalaReflection.schemaFor[MHealthUser].dataType match {
case s: StructType => s.fieldNames
case _ => Array[String]()
}
Outside Spark see Scala. Get field names list from case class

Related

Splitting or Breakup multidimensional arrays in scala spark along attributes

var date_columns = df.dtypes.filter(_._2 == "TimestampType")
This creates a two dimensional array containing only timestamp type column names along with their datatepyes
Array[(String, String)] = Array((cutoffdate,TimestampType), (wrk_pkg_start_date,TimestampType), (wrk_pkg_end_date,TimestampType))
Now, how do i split this array such that only columns names are in an array
dateColumns = [ cutoffdate , wrk_pkg_start_date , wrk_pkg_end_date ]
in Scala Spark . Without using for loops please
just use collect for that
var date_columns = df.dtypes.collect{ case (name, "TimestampType") => name }
collect can filter array using pattern matching and map elements
see scala documentation

Recursively apply a function to elements of an array spark dataFrame

I wrote the following function which concatenates two strings and adds them in a dataframe new column:
def idCol(firstCol: String, secondCol: String, IdCol: String = FUNCTIONAL_ID): DataFrame = {
df.withColumn(IdCol,concat(col(firstCol),lit("."),col(secondCol))).dropDuplicates(IdCol)
}
My aim is to replace the use of different strings by one array of strings, and then define the new column from the concatenation of these different elements of the array. I am using an array in purpose in order to have a mutable data collection in case the number of elements to concatenate changes.
Do you have any idea about how to do this
So the function would be changed as :
def idCol(cols:Array[String], IdCol: String = FUNCTIONAL_ID): DataFrame = {
df.withColumn(IdCol,concat(col(cols(0)),lit("."),col(cols(1))).dropDuplicates(IdCol)
}
I want to bypass the cols(0), cols(1) and do a generic transformation which takes all elements of array and seperate them by the char "."
You can use concat_ws which has the following definition:
def concat_ws(sep: String, exprs: Column*): Column
You need to convert your column names which are in String to Column type:
import org.apache.spark.sql.functions._
def idCol(cols:Array[String], IdCol: String = FUNCTIONAL_ID): DataFrame = {
val concatCols = cols.map(col(_))
df.withColumn(IdCol, concat_ws(".", concatCols : _*) ).dropDuplicates(IdCol)
}

Type Mismatch in scala, Array[Int] and Array[Option[Int]]

First, I am new to Scala, So apologies if the following question is too simple.
I have written the following code to the find the values of the keys that I supply in an array from the map.
def stringToCountMap(inputArray: Array[String], inputMap:Map[String,Int]) : Array[Int] = {
return inputArray.map(x => inputMap.get(x))
}
I got the following error,
type mismatch;
found : Array[Option[Int]]
required: Array[Int]
return inputArray.map(x => inputMap.get(x))
Question:
1) Can anyone explain what is Option[Int]?
2) What is my mistake here ?
Thanks in advance.
Option is Scala's option type (also called a nullable type). It represents a case where the value may not exist.
Consider a map that doesn't contain a requested key. How would you handle a request for the key? One option is to result in an error, such as by throwing an exception. Another is to return a special value that indicates there is no value. Map.get does the latter, using Option as the special type and None as the value. This means the return type of Map.get isn't the value type of the map (Int), but Option applied to the value type (Option[Int]).
To correct the type declaration, change the return type:
def stringToCountMap(inputArray: Array[String], inputMap:Map[String,Int]) : Array[Option[Int]] = {
inputArray.map(x => inputMap.get(x))
}
You can leave out the return type of stringToCountMap and let type inference handle it:
def stringToCountMap(inputArray: Array[String], inputMap:Map[String,Int]) = {
inputArray.map(x => inputMap.get(x))
}
As a consequence, missing keys from the input map get carried through:
scala> stringToCountMap(Array("a", "def"), Map("a" -> 1, "bc" -> 2))
res0: Array[Option[Int]] = Array(Some(1), None)
Option[T] is a wrapper around a value of type T. Its purpose is to prevent NullPointerException, that you might know from Java. A value of type Option[T] might either be None, which, as the name implies is an object that represents nothing, or it might be Some(x: T), which represents an existing value.
inputMap(x) returns an Option[Int], since you have no guarantee that the x key exists in inputMap. If it does, it returns Some(value: Int), else it returns None.
Calling stringToCountMap(Array("a", "b", "c"), Map("a" -> 1, "c" -> 2)) results in Array(Some(1), None, Some(2))
If you want an Array[Int] instead, you might do something like inputArray.map(x => inputMap.getOrElse(x, 0)).get. The getOrElse method has two parameters, where the first one is the key, and the second one is the default value. inputArray.map(x => inputMap.get(x).getOrElse(0)) has the same effect, since calling getOrElse(value) on an Option either unwraps the Some object, or returns the default value.
Now, stringToCountMap(Array("a", "b", "c"), Map("a" -> 1, "c" -> 2)) results in Array(1, 0, 2).
You might also want to omit the keys missing in the input array. In that case, you might do inputArray.flatMap(x => inputMap.get(x)). flatMap is a function similar to map, but it returns strictly, as the name implies, flat collections. For example, calling flatMap(x => x) on an Array[Array[Int]] would return an Array of all the values in the 2D array in a single row.
Here, Option is a collection, as well. If it is of type Some, it contains a single value, if it is None, it is an empty collection. Thus, in the resulting array you would only have the values of the keys present in the map, and the keys not present in the map are skipped.
Now, stringToCountMap(Array("a", "b", "c"), Map("a" -> 1, "c" -> 2)) results in Array(1, 2).
First, there is no need to use 'return' in scala to return any value.
def stringToCountMap(inputArray: Array[String], inputMap:Map[String,Int]) = {
inputArray.map(x => inputMap.get(x))
}
When you get the value of key from Map, It returns result in Option.
For example:
scala> val map = Map(1-> "a",2 -> "b")
map: scala.collection.immutable.Map[Int,String] = Map(1 -> a, 2 -> b)
scala> map.get(1)
res0: Option[String] = Some(a)
scala> map.get(3)
res1: Option[String] = None
When you try to get the value of key, which does not exist. In java, you have encountered with NullPointerException. So when there is no value, it returns None to avoid exception.
For more info refer
In your method, you have given return type as Array[Int] but function returns Array[Option[Int]] that's why it throws compilation error.

Scala case class arguments instantiation from array

Consider a case class with a possibly large number of members; to illustrate the case assume two arguments, as in
case class C(s1: String, s2: String)
and therefore assume an array with size of at least that many arguments,
val a = Array("a1", "a2")
Then
scala> C(a(0), a(1))
res9: C = c(a1,a2)
However, is there an approach to case class instantiation where there is no need to refer to each element in the array for any (possibly large) number of predefined class members ?
No, you can't. You cannot guarantee your array size is at least the number of members of your case class.
You can use tuples though.
Suppose you have a mentioned case class and a tuple that looks like this:
val t = ("a1", "a2")
Then you can do:
c.tupled(t)
Having gathered bits and pieces from the other answers, a solution that uses Shapeless 2.0.0 is thus as follows,
import shapeless._
import HList._
import syntax.std.traversable._
val a = List("a1", 2) // List[Any]
val aa = a.toHList[String::Int::HNil]
val aaa = aa.get.tupled // (String, Int)
Then we can instantiate a given case class with
case class C(val s1: String, val i2: Int)
val ins = C.tupled(aaa)
and so
scala> ins.s1
res10: String = a1
scala> ins.i2
res11: Int = 2
The type signature of toHList is known at compile time as much as the case class members types to be instantiate onto.
To convert a Seq to a tuple see this answer: https://stackoverflow.com/a/14727987/2483228
Once you have a tuple serejja's answer will get you to a c.
Note that convention would have us spell c with a capital C.

How do I algorithmically instantiate and manipulate a multidimensional array in Scala

I am trying to wrote a program to manage a Database through a Scala Gui, and have been running into alot of trouble formatting my data in such a way as to input it into a Table and have the Column Headers populate. To do this, I have been told I would need to use an Array[Array[Any]] instead of an ArrayBuffer[ArrayBuffer[String]] as I have been using.
My problem is that the way I am trying to fill these arrays is modular: I am trying to use the same function to draw from different tables in a MySQL database, each of which has a different number of columns and entries.
I have been able to (I think) define a 2-D array with
val Data = new Array[Array[String]](numColumns)(numRows)
but I haven't found any ways of editing individual cells in this new array.
Data(i)(j)=Value //or
Data(i,j)=Value
do not work, and give me errors about "Update" functionality
I am sure this can't possibly be as complicated as I have been making it, so what is the easy way of managing these things in this language?
You don't need to read your data into an Array of Arrays - you just need to convert it to that format when you feed it to the Table constuctor - which is easy, as demonstrated my answer to your other question: How do I configure the Column names in a Scala Table?
If you're creating a 2D array, the idiom you want is
val data = Array.ofDim[String](numColumms, numRows)
(There is also new Array[String](numColumns, numRows), but that's deprecated.)
You access element (i, j) of an Array data with data(i)(j) (remember they start from 0).
But in general you should avoid mutable collections (like Array, ArrayBuffer) unless there's a good reason. Try Vector instead.
Without knowing the format in which you're retrieving data from the database it's not possible to say how to put it into a collection.
Update:
You can alternatively put the type information on the left hand side, so the following are equivalent (decide for yourself which you prefer):
val a: Array[Array[String]] = Array.ofDim(2,2)
val a = Array.ofDim[String](2,2)
To explain the syntax for accessing / updating elements: as in Java, a multi-dimensional array is just an array of arrays. So here, a(i) is element i of a, which an Array[String], and so a(i)(j) is element j of that array, which is a String.
Luigi's answer is great, but I'd like to shed some light on why your code isn't working.
val Data = new Array[Array[String]](numColumns)(numRows)
does not do what you expect it to do. The new Array[Array[String]](numColumns) part does create an array of array of strings with numColumns entries, with all entries (arrys of strings) being null, and returns it. The following (numRows) then just calls the apply function on that returned object, which returns the numRowsth entry in that list, which is null.
You can try that out in the scala REPL: When you input
new Array[Array[String]](10)(9)
you get this as output:
res0: Array[String] = null
Luigi's solution, instead
Array.ofDim[String](2,2)
does the right thing:
res1: Array[Array[String]] = Array(Array(null, null), Array(null, null))
It's rather ugly, but you can update a multidimensional array with update
> val data = Array.ofDim[String](2,2)
data: Array[Array[String]] = Array(Array(null, null), Array(null, null))
> data(0).update(0, "foo")
> data
data: Array[Array[String]] = Array(Array(foo, null), Array(null, null))
Not sure about the efficiency of this technique.
Luigi's answer is great, but I just wanted to point out another way of initialising an Array that is more idiomatic/functional – using tabulate. This takes a function that takes the array cell coordinates as input and produces the cell value:
scala> Array.tabulate[String](4, 4) _
res0: (Int, Int) => String => Array[Array[String]] = <function1>
scala> val data = Array.tabulate(4, 4) {case (x, y) => x * y }
data: Array[Array[Int]] = Array(Array(0, 0, 0, 0), Array(0, 1, 2, 3), Array(0, 2, 4, 6), Array(0, 3, 6, 9))

Resources