Apache-spark: Equal column data structure, different outcome in UDF function - arrays

I have two array of columns
arrayColumns1: org.apache.spark.sql.Column = array("col1","col2")
arrayColumns2: org.apache.spark.sql.Column = array("col1","col2")
Both seems equals, but they came from different sources.
arrayColumns1 is from a conversion of Array("col1","col2") to an array, using this function:
def asLitArray[T](xs: Seq[T]) = array(xs map lit: _*)
arrayColumns2 is from writing the textual Array.
Now when I try to use arrayColumns1 as input to an UDF function:
.withColumn("udfFunction",udfFunction(arrayColumns))
where
val udfFunction= udf(
{ xs : Seq[Double] =>
DO_SOMETHING
(output)
}
)
It thorows me this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(col1,col2))' due to data type mismatch: argument 1 requires array<double> type, however, 'array('col1','col2')' is of array<string> type.;;
But when I use arrayColumns2 it works fine. What did I do wrong?
I'm using Spark 2.1 over scala 2.11

It does not make much sense to pass an array of literals to an UDF, because what you want to pass is the name of the columns, not literal values. Your second case fails because you are creating columns of type string (lit("col1") is a literal column with a content of "col1", it does not reference the column col1)
I would to it like this:
def asColArray(xs: Seq[String]) = array((xs.map(x => col(x))): _*)
val arrayColumns = asColArray(Array("col1","col2"))
df.withColumn("udfFunction",udfFunction(arrayColumns))
If you really want to use literal values, you would need to do something like this:
val arrayColumns = asLitArray(Array(1.0,2.0))
But this would give you a constant output of your udf

Related

CSV to Array of Objects

I am using a third party package for spark that utilizes a "PointFeature" object. I am trying to take a csv file and put elements form each row into an Array of these PointFeature objects.
The PointFeature constructor for my implementation looks like this:
Feature(Point( _c1, _c2), _c3)
where _c1, _c2, and _c3 are the columns of my csv and represent doubles.
Here is my current attempt:
val points: Array[PointFeature[Double]] = for{
line <- sc.textFile("file.csv")
point <- Feature(Point(line._c1,line._c2),line._c3)
} yield point
My error shows up when referencing the columns
<console>:36: error: value _c1 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c2 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c3 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
This is obviously because I'm referencing a String as if it were an element of a DataFrame. I'm wondering if there is a way to work with DataFrames in this loop format, or a way to split each line into a List of doubles. Maybe I need an RDD?
Also, I'm not certain that this will yield an Array. Actually, I suspect it will return an RDD...
I am using Spark 2.1.0 on Amazon EMR
Here are some other Questions I have drawn from:
How to read csv file into an Array of arrays in scala
Splitting strings in Apache Spark using Scala
How to iterate records spark scala?
You could set up a Dataset[Feature] this way:
case class Feature(x: Double, y: Double, z: Double)
sparkSession.read.csv("file.csv")
.toDF("x", "y", "z")
.withColumn("x", 'x.cast(DoubleType))
.withColumn("y", 'y.cast(DoubleType))
.withColumn("z", 'z.cast(DoubleType))
.as[Feature]
Then you can consume your strongly-typed DataSet[Feature] as you see fit.
I suggest taking this on in smaller steps.
Step One
Get your rows as an Array/List/whatever of Strings.
val lines = sc.textFile("file.txt").getLines, or something like that.
Step Two
Break your lines in to their own lists of columns.
val splits = lines.map {l => l.split(",")}
Step Three
Extract your colums as vals that you can use
val res = splits.map {
case Array(col1, col2, col3) => // Convert to doubles, put in to Feature/Point Structure}
case _ => // Handle the case where your csv is malformatted
}
This can all be done in one go, I only split them to show the logical step from file → list string → list list string → list Feature

Creating a typed array column from an empty array

I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.
For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".
The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):
//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}
def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}
df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))
The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type
It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?
You are overthinking a problem. What you really need here is isin:
val blacklist = Seq("foo", "bar")
$"theStringColumn".isin(blacklist: _*)
Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.
Finally to answer your question you can either:
array().cast("array<string>")
or:
import org.apache.spark.sql.types.{ArrayType, StringType}
array().cast(ArrayType(StringType))

Is it possible to define the length of a list in F#?

For example, I define a record as:
type a = {b : float; c: int list}
but I already know that this list must be of a predefined size, let's say 2 and if the list is not 2 it would be a different type or error since it is not defined such a type.
Is it possible to define the size of the list as happens in other languages that you must define the size?
Depending on the application this question can be applied to an array.
Maybe you should use an array instead of a list, since an array has a fixed length:
// create an array of length = 2, initialized with zeros.
let cValues : int[] = Array.create 2 0
cValues.IsFixedSize // returns true
EDIT: As others have suggested, a tuple might also be the way to go. For a pair (a tuple of length two), you can access the values using the fst and snd functions.
For a longer tuple, you can use pattern matching as shown below. If the tuple is too long to make this pattern matching approach practical, then you probably need a structure other than an F# tuple. Of course, one major requirement to consider is whether you need to store values of mixed types. A tuple or a record can store a mix of multiple types, whereas an array or list stores values of a single type.
let fourTuple = (5, 10, 2, 3)
let _,_,third,_ = fourTuple
printfn "%d" third // displays "2"
If an array or a tuple won't meet your requirements, then maybe you should use another record like this:
type ListOfC = {c1 : int; c2 : int}
type a' = {b' : float; c' : ListOfC}
Or you could create a new class that would meet your requirements, starting like the script below. Of course, this would probably not be considered idiomatic functional programming style. It's more like OOP.
type ListOfTwo(firstInt : int, secondInt : int) =
member this.First = firstInt
member this.Second = secondInt
let myListOfTwo = ListOfTwo(4, 5)
myListOfTwo.First
type a = {b : float; c : ListOfTwo }

How to convert Array of values in VBA to array of other value types, based on user input?

So I have an class returns array of singles. Problem is that new requrement came up where I need to be able provide these values as a single string, which can be done with Join except Join requres that array is made up of strings not singles.
I guess I could write a new method in my class that provides same array but with values as strings. However I was wondering if I could just convert existing method to accept something like Optional ValType as vbVarType as a parameter and accordingly change the type of values in ouputed array.
Is this doable in a relatively DRY way?? I'd love to avoid code that looks like:
Select Case ValType
Case #something1
#omitted code
Case #something2
#omitted codev
Case #something3
#omitted code
........
UPDATE: I guess what I am looking for is a formula like Cstr() except that I'd like it to work on an Array and have it expect parameeter describing value to convert to.
Can you simply have an array of type Variant rather than Single? The following code seems to work ok:
Dim arrInt() As Variant
ReDim arrInt(0 To 3)
arrInt(0) = 0
arrInt(1) = 1
arrInt(2) = 2
arrInt(3) = 3
Debug.Print Join(arrInt, "#")

Creating an Array2D in F# (VS2010 Beta 1)

Consider the following code fragment in VS2010 Beta 1:
let array = Array2D.zeroCreate 1000 500
This produces an error, namely:
error FS0030: Value restriction. The value 'array' has been inferred to have
generic type val array : '_a [,]
Either define 'array' as a simple data term, make it a function with explicit
arguments or, if you do not intend for it to be generic, add a type annotation.
Can I explicitly set the type (in my case a grid of string)?
You can explicitly specify the type like this:
let array : string [,] = Array2D.zeroCreate 1000 500
For further information on the value restriction you might want to take a look at this F#-Wiki page.
You can also use init to create an array though it might be slower.
let array = Array2D.init 1000 500 (fun _ _ -> "")
Zeroing out an array is usually not seen in functional programming. It's much more common to pass an initilization function to init and just create the array with the values you want.
To create a 2-dimensional array containing empty strings:
let array = Array2D.create 1000 500 ""

Resources