Convert a column of WrappedArrays in Scala to a column of Vector[Double] - arrays

I have a dataframe in Scala with 3 observations. One of the columns contains wrapped arrays, such that when I write:
df.select("column").collect()
I'll get back
Array[org.apache.spark.sql.Row]= Array([WrappedArray(0.8, 0.5, 0.6)],[WrappedArray(0.6, 0.55, 0.7)], [WrappedArray(0.3, 0.4, 0.5, 0.6)])
Is there a function to convert the wrapped arrays to vectors?

You can try this
import org.apache.spark.mllib.linalg.Vectors
val vectorUDF = org.apache.spark.sql.functions.udf((array: Array[Double]) => {
Vectors.dense(array)
})
df.withColumn("vector", vectorUDF(df.select("column"))).drop("column")
Vectors.dense() allows you to convert the array[Double] into vector value
Please check the import if you are using ml library or mllib one.
Hope This work

Related

Lower order Spark Dataframe Array concatenation of individual elements

I am using Spark 3.x higher-order array functions like this:
%scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Column
val arrayStructureData = Seq(
Row(1,List(2,5,1,3),List(0.1, 0.5, 0.7, 0.8)),
Row(2,List(2,1),List(0.2, 0.3)),
Row(1,List(1,5),List(0.4, 0.3)),
Row(2,List(3,2),List(0.0, 0.1))
)
// Just a single StructType for the Row
val arrayStructureSchema = new StructType()
.add("id",IntegerType)
.add("prop1", ArrayType(IntegerType))
.add("values", ArrayType(DoubleType))
val df = spark.createDataFrame(
spark.sparkContext.parallelize(arrayStructureData),arrayStructureSchema)
df.printSchema()
df.show()
val resDF = df.withColumn(
"jCols",
zip_with(
col("prop1"),
col("values"),
(left: Column, right: Column) => array(left, right)
)
)
resDF.show(false)
resDF.printSchema()
so as to concat individual array elements positionally over 2 arrays to get a new sub-array. That works fine.
E.g.
[3, 2]| [0.0, 0.1]
returns:
[[3.0, 0.0], [2.0, 0.1]]
My question is as I cannot immediately see how I would do this without zip_with, how would we do that the easiest way? UDF?
Nice reference: https://mungingdata.com/spark-3/array-exists-forall-transform-aggregate-zip_with/ but interested in the harder bare-metal way.

How to convert 1D tensor to regular javascript array?

How to convert 1D tensor to regular Javascript array in Tensorflow.js?
My 1D Tensor is like this:
Tensor [-0.0005436, -0.0021222, 0.0006213, 0.0014624, 0.0012601, 0.0007024, -0.0001113, -0.0011119, -0.0021328, -0.0031764]
You can use .dataSync() to get the values of a tensor in a TypedArray and if you want a standard JS array you can use Array.from(), which creates arrays out of array-like objects.
const tensor = tf.tensor1d([1, 2, 3]);
const values = tensor.dataSync();
const arr = Array.from(values);
console.log(values);
console.log(arr);
<script src="https://cdn.jsdelivr.net/npm/#tensorflow/tfjs#0.14.1/dist/tf.min.js"></script>
Keep in mind using .dataSync() blocks the UI thread until the values are ready, which can cause performance issues. If you want to load the values asynchronously you can use .data(), which returns a Promise resolving to the TypedArray.
To convert tf.tensor to plain js array there are array() and arraySync() methods.
e.g. tf.tensor([1, 2, 5]).arraySync()

Create an Array from features vector in Apache Spark/ Scala

Trying to create an array of all the features in a features Vector in Apache Spark and scala. I need to do this in order to create a Breeze Matrix of the features for various commputations in my algorithm. Currently the features are wrapped in a features vector and I want to extract each of these separately. I've been looking at the following question:
Applying IndexToString to features vector in Spark
Here's my current code: (data is a Spark DataFrame, all features are Doubles)
val featureCols = Array("feature1", "feature2", "feature3")
val featureAssembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val dataWithFeatures = featureAssembler.transform(data)
//now we slice the features back again
val featureSlicer = featureCols.map {
col => new VectorSlicer().setInputCol("features").setOutputCol(s"${col}_sliced").setNames(Array(s"${col}"))}
val output = featureSlicer.map(f => f.transform(dataWithFeatures).select(f.getOutputCol).as[Double].collect)
val array = output.flatten.toArray
However this fails with the following error: 'cannot resolve CAST("feature1" AS DOUBLE due to data type mismatch - cannot cast VectorUDT to DoubleType'
This seems odd since I can do the following without an error:
val array: Array[Double] = dataWithFeatures.select("feature1").as[Double].collect()
Any ideas how to fix this, and if there is a better way, as it seems inefficient to create a sequence of DataFrames and perform the operation on each one separately.
Thanks!
Say if features column is the vector column that gets assembled from all the other features column, you can select the features column, convert it to rdd and then flatMap it:
Example data:
dataWithFeatures.show
+--------+--------+--------+-------------+
|feature1|feature2|feature3| features|
+--------+--------+--------+-------------+
| 1| 2| 3|[1.0,2.0,3.0]|
| 4| 5| 6|[4.0,5.0,6.0]|
+--------+--------+--------+-------------+
import org.apache.spark.ml.linalg.Vector
dataWithFeatures.select("features").rdd.flatMap(r => r.getAs[Vector](0).toArray).collect
// res19: Array[Double] = Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0)

Convert case class constructor parameters to String Array in Scala

I have a case class as follows:
case class MHealthUser(acc_Chest_X: Double, acc_Chest_Y: Double, acc_Chest_Z: Double, activityLabel: Int)
These form the schema of a Spark DataFrame, which is why I'm using a case class. I simply want to map these to an Array[String] so I can use the ParamValidators.inArray(attributes) method in Spark. I use the following code to map the constructor parameters to an array using reflection:
val attributes: Array[String] = MHealthUser.getClass.getConstructors.map(a => a.toString)
but this simply gives me an array of length 1 whereas I want an array of length 4, with the contents of the array being the dataset schema which I've defined, as a string. Otherwise I'm using the hard-coded values of the dataset schema, which is obviously inelegant.
In other words I want the output:
val attributes: Array[String] = Array("acc_Chest_X", "acc_Chest_Y", "acc_Chest_Z", "activityLabel")
I've been playing with this for a while and can't get it to work. Any ideas appreciated. Thanks!
I'd use ScalaReflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
ScalaReflection.schemaFor[MHealthUser].dataType match {
case s: StructType => s.fieldNames
case _ => Array[String]()
}
Outside Spark see Scala. Get field names list from case class

Subtracting elements at specified indices in array

I am beginner to functional programming and Scala. I have an Array of arrays which contain Double numerals. I want to subtract elements (basically two arrays, see example) and I am unable to find online how to do this.
For example, consider
val instance = Array(Array(2.1, 3.4, 5.6),
Array(4.4, 7.8, 6.7))
I want to subtract 4.4 from 2.1, 7.8 from 3.4 and 6.7 from 5.6
Is this possible in Scala?
Apologies if the question seems very basic but any guidance in the right direction would be appreciated. Thank you for your time.
You can use .zip:
scala> instance(1).zip(instance(0)).map{ case (a,b) => a - b}
res3: Array[Double] = Array(2.3000000000000003, 4.4, 1.1000000000000005)
instance(1).zip(instance(0)) makes an array of tuples Array((2.1,4.4), (3.4,7.8), (5.6,6.7))from corresponding pairs in your array
.map{ case (a,b) => a - b} or .map(x => x._1 - x._2) is doing subtraction for every tuple.
I would also recommend to use tuple instead of your top-level array:
val instance = (Array(2.1, 3.4, 5.6), Array(4.4, 7.8, 6.7))
So now, with additional definitions, it looks much better
scala> val (a,b) = instance
a: Array[Double] = Array(2.1, 3.4, 5.6)
b: Array[Double] = Array(4.4, 7.8, 6.7)
scala> val sub = (_: Double) - (_: Double) //defined it as function, not method
sub: (Double, Double) => Double = <function2>
scala> a zip b map sub.tupled
res20: Array[Double] = Array(2.3000000000000003, 4.4, 1.1000000000000005)
*sub.tupled allows sub-function to receive tuple of 2 parameters instead of just two parameters here.

Resources