scala java.lang.ClassCastException during randomSplit - arrays

In the code below cohort_counts_4 is a dataframe that has 3 columns g,samplingRate and samplingRate1. In the rowDF variable I am collecting the columns samplingRate and samplingRate1 (which are percentages). And in the
percentages variable I am converting it to Array[Double].
When I am trying to run this, I am getting the error below during runtime in the percentages. I need it to be Array[Double] as I have to randomSplit in the next step.
java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.lang.Double.
Please let me know your thoughts.
sample data of percentages -
percentages: Array[Seq[Double]] =
Array(WrappedArray(0.06449504858964898, 0.9355049514103511)
, WrappedArray(0.015861918725594032, 0.9841380812744059)
, WrappedArray(0.22082241578907924, 0.7791775842109208)
, WrappedArray(0.14416119376185044, 0.8558388062381496)
, WrappedArray(0.10958692395592619, 0.8904130760440738)
, WrappedArray(1.0, 0.0)
, WrappedArray(0.6531128743810083, 0.3468871256189917)
, WrappedArray(0.04880653326943304, 0.9511934667305669))
val cohortList = cohort_counts_4.select("g").collect().map(_(0)).toList
var cohort_list = new ListBuffer[org.apache.spark.sql.DataFrame]()
var total_rows: Int = 0
for (igroupid<-cohortList){
val sample_rate = cohort_counts_4.filter(col("g")===igroupid).select("samplingRate","samplingRate1")
cohort_list += sample_rate
val curr_rows = sample_rate.count().toInt
total_rows += curr_rows
}
val customers_new = cohort_list.reduce(_ union _)
val rowDF = customers_new.select(array(customers_new.columns.map(col):_*) as "row")
var percentages =Array(rowDF.collect.map(_(0)).asInstanceOf[Double])
// var percentages = rowDF.collect.map(_.getSeq[Double](0))
val uni = customers_2.select("x","g").distinct
.randomSplit(percentages)

I changed the code from
var percentages =Array(rowDF.collect.map(_(0)).asInstanceOf[Double])
to below
var percentages =rowDF.collect.map(_(0).asInstanceOf[Seq[Double]]).flatten
and it worked.

Related

Python : Replace a column in a dataframe by datetime values

I'm trying to replace a column an array of 4 columns by datetime values that I treated. The problem is that it's difficult to keep the same form between the different formats of dataframe, array,....
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:,0,0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = pd.to_datetime(ds.variables["time"][:],origin=pd.Timestamp('1850-01-01'),unit='D')
#np.datetime64(ds.variables["time"][:],'D')
x2 = pd.DataFrame(np.zeros((len(dataw),4), float))
x = np.zeros((len(dataw),4), float)
x[:,0] = time
x[:,1] = long
x[:,2] = lat[:]
x[:,3] = dataw[:]*86400
x=pd.DataFrame(x)
x[:,0] = pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
If I put directly the dates transformed in the array, the result is like: 1.32542e+18
I tried
time = ds.variables["time"][:]
and include it in the array, and then use
x[:,0]=pd.to_datetime(x[:,0],origin=pd.Timestamp('1850-01-01'),unit='D')
I get the error:
TypeError: unhashable type: 'slice'
I tried also directly put:
time=pd.to_datetime(time,origin=pd.Timestamp('1850-01-01'),unit='D')
x[:,0] = time[:]
TypeError: unhashable type: 'slice'
try this instead
import numpy as np
import pandas as pd
dataw = ds.variables["pr"][:]
dataw = np.array(dataw[:, 0, 0])
lat = ds.variables["lat"][:]
long = ds.variables["lon"][:]
time = ds.variables["time"][:]
time = np.datetime64(time, 'D')
x = np.zeros((len(dataw), 4), dtype='datetime64[D]')
x[:, 0] = time
x[:, 1] = long
x[:, 2] = lat
x[:, 3] = dataw * 86400
df = pd.DataFrame(x, columns=["Time", "Longitude", "Latitude", "Data"])
Xarray makes the netcdf->pandas workflow quite straightforward:
import xarray as xr
ds = xr.open_dataset('file.nc', engine='netcdf4')
df = ds.to_pandas()
Presuming your time variable is using cf-conventions, Xarray will automatically decode it into datetime objects.

Create array of "deep" struct (scalar) fields

How can I collapse the values of "deep" struct fields into arrays by just indexing?
In the example below, I can only do it for the "top-most" level, and for "deeper" levels I get the error:
"Expected one output from a curly brace or dot indexing expression, but there were XXX results."
The only workaround I found so far is to unfold the operation into several steps, but the deeper the structure the uglier this gets...
clc; clear variables;
% Dummy data
my_struc.points(1).fieldA = 100;
my_struc.points(2).fieldA = 200;
my_struc.points(3).fieldA = 300;
my_struc.points(1).fieldB.subfieldM = 10;
my_struc.points(2).fieldB.subfieldM = 20;
my_struc.points(3).fieldB.subfieldM = 30;
my_struc.points(1).fieldC.subfieldN.subsubfieldZ = 1;
my_struc.points(2).fieldC.subfieldN.subsubfieldZ = 2;
my_struc.points(3).fieldC.subfieldN.subsubfieldZ = 3;
my_struc.info = 'Note my_struc has other fields besides "points"';
% Get all fieldA values by just indexing (this works):
all_fieldA_values = [my_struc.points(:).fieldA]
% Get all subfieldM values by just indexing (doesn't work):
% all_subfieldM_values = [my_struc.points(:).fieldB.subfieldM]
% Ugly workaround:
temp_array_of_structs = [my_struc.points(:).fieldB];
all_subfieldM_values = [temp_array_of_structs.subfieldM]
% Get all subsubfieldZ values by just indexing (doesn't work):
% all_subsubfieldZ_values = [my_struc.points(:).fieldC.subfieldN.subsubfieldZ]
% Ugly workaround:
temp_array_of_structs1 = [my_struc.points(:).fieldC];
temp_array_of_structs2 = [temp_array_of_structs1.subfieldN];
all_subsubfieldZ_values = [temp_array_of_structs2.subsubfieldZ]
Output:
all_fieldA_values =
100 200 300
all_subfieldM_values =
10 20 30
all_subsubfieldZ_values =
1 2 3
Thanks for any help!
You can use arrayfun to have acces to each individual 'point', and then acces its data. This will return an array with the same dimensions as my_struc.points:
all_subfieldM_values = arrayfun(#(in) in.fieldB.subfieldM, my_struc.points)
all_subsubfieldZ_values = arrayfun(#(in) in.fieldC.subfieldN.subsubfieldZ, my_struc.points)
Not optimal, but at least it's one line.

Creating a Random Feature Array in Spark DataFrames

When creating an ALS model, we can extract a userFactors DataFrame and an itemFactors DataFrame. These DataFrames contain a column with an Array.
I would like to generate some random data and union it to the userFactors DataFrame.
Here is my code:
val df1: DataFrame = Seq((123, 456, 4.0), (123, 789, 5.0), (234, 456, 4.5), (234, 789, 1.0)).toDF("user", "item", "rating")
val model1 = (new ALS()
.setImplicitPrefs(true)
.fit(df1))
val iF = model1.itemFactors
val uF = model1.userFactors
I then create a random DataFrame using a VectorAssembler with this function:
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select("user", "userFeatures")
}
I then create the DataFrame with new user IDs and add the random vectors and bias:
val usersDf: DataFrame = Seq(567), (678)).toDF("user")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
The problem arises when I union the two DataFrames.
usersFactorsNew.union(uF) produces the error:
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<type:tinyint,size:int,indices:array<int>,values:array<double>> <> array<float> at the second column of the second table;;
If I print the schema, the uF DataFrame has a feature vector of type Array[Float] and the usersFactorsNew DataFrame as a feature vector of type Vector.
My question is how to change the type of the Vector to an Array in order to perform the union.
I tried writing this udf with little success:
val toArr: org.apache.spark.ml.linalg.Vector => Array[Double] = _.toArray
val toArrUdf = udf(toArr)
Perhaps the VectorAssembler is not the best option for this task. However, at the moment, it is the only option I have found. I would love to get some recommendations for something better.
Instead of creating a dummy dataframe and using VectorAssembler to generate a random feature vector, you can simply use an UDF directly. The userFactors from the ALS model will return an Array[Float] so the output from the UDF should match that.
val createRandomArray = udf((rank: Int) => {
Array.fill(rank)(Random.nextFloat)
})
Note that this will give numbers in the interval [0.0, 1.0] (same as the rand() used in the question), if other numbers are required, modify as fit.
Using a rank of 3 and the userDf:
val usersFactorsNew = usersDf.withColumn("userFeatures", createRandomArray(lit(3)))
will give a dataframe as follows (of course with random feature values)
+----+----------------------------------------------------------+
|user|userFeatures |
+----+----------------------------------------------------------+
|567 |[0.6866711267486822,0.7257031656127676,0.983562255688249] |
|678 |[0.7013908820314967,0.41029552817665327,0.554591149586789]|
+----+----------------------------------------------------------+
Joining this dataframe with the uF dataframe should now be possible.
The reason the UDF did not work should be due to it being an Array[Double] while you need anArray[Float]for theunion. It should be possible to fix with amap(_.toFloat)`.
val toArr: org.apache.spark.ml.linalg.Vector => Array[Float] = _.toArray.map(_.toFloat)
val toArrUdf = udf(toArr)
All of your process are all correct. Even the udf function is working successfully. All you need to do is change the last part of makeNew function as
def makeNew(df: DataFrame, rank: Int): DataFrame = {
var df_dummy = df
var i: Int = 0
var inputCols: Array[String] = Array()
for (i <- 0 to rank) {
df_dummy = df_dummy.withColumn("feature".concat(i.toString), rand())
inputCols = inputCols :+ "feature".concat(i.toString)
}
val assembler = new VectorAssembler()
.setInputCols(inputCols)
.setOutputCol("userFeatures")
val output = assembler.transform(df_dummy)
output.select(col("id"), toArrUdf(col("userFeatures")).as("features"))
}
and you should be perfect to go so that when you do (I created userDf with id column and not user column)
val usersDf: DataFrame = Seq((567), (678)).toDF("id")
var usersFactorsNew: DataFrame = makeNew(usersDf, 20)
usersFactorsNew.union(uF).show(false)
you should be getting
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |features |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|567|[0.8259185719733708, 0.327713892339658, 0.049547223031371046, 0.056661808506210054, 0.5846626163454274, 0.038497936270104005, 0.8970865088803417, 0.8840660648882804, 0.837866669938156, 0.9395263094918058, 0.09179528484355126, 0.4915430644129799, 0.11083447052043116, 0.5122858182953718, 0.4302683812966408, 0.3862741815833828, 0.6189322403095068, 0.3000371006293433, 0.09331299668168902, 0.7421838728601371, 0.855867963988993]|
|678|[0.7686514248005568, 0.5473580740023187, 0.072945344124282, 0.36648594574355287, 0.9780202082328863, 0.5289221651923784, 0.3719451099963028, 0.2824660794505932, 0.4873197501260199, 0.9364676464120849, 0.011539929543513794, 0.5240615794930654, 0.6282546154521298, 0.995256022569878, 0.6659179561266975, 0.8990775317754092, 0.08650071017556926, 0.5190186149992805, 0.056345335742325475, 0.6465357505620791, 0.17913532817943245] |
|123|[0.04177388548851013, 0.26762014627456665, -0.19617630541324615, 0.34298020601272583, 0.19632814824581146, -0.2748605012893677, 0.07724890112876892, 0.4277132749557495, 0.1927199512720108, -0.40271613001823425] |
|234|[0.04139673709869385, 0.26520395278930664, -0.19440513849258423, 0.3398836553096771, 0.1945556253194809, -0.27237895131111145, 0.07655145972967148, 0.42385169863700867, 0.19098000228405, -0.39908021688461304] |
+---+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Google sheets results of Google function query to new range

I am trying to loop through dates from a selected near date, going a period of four weeks back each time, for 12 periods.
I am doing something wrong in the arrays to push the data to the Results sheet, and it is not working.
Thanks in advance for your help.
function loopDates() {
var sss = SpreadsheetApp.getActiveSpreadsheet();
var ss = sss.getSheetByName('Main');
var results = sss.getSheetByName('Results');
// gets values from Main sheet to calculate new Date
var firstYear = ss.getRange('Main!C3').getValue();
var firstMonth = ss.getRange('Main!E3').getValue();
var firstDay = ss.getRange('Main!G3').getValue();
var target = new Array();// this is a new array to collect data
var headers = [];//inner array
var periods = 13;
var counter = 0;
//loop thru 12 dates and get just the dates for the header
for (var i = 0; i < periods-1; i++) {
counter = counter + 1;
// yearValue,monthValue,day Calculate new Date
var date = [firstYear,firstMonth,firstDay];
var d = new Date(firstYear,firstMonth,firstDay); // must have commas day,
month numbers start at 0
var numberDays = 28;
var countBack = 2*numberDays * 24 * 60 * 60 * 1000 + 3* 24 * 60 * 60 *
1000;
//sign must be '+' since 'offset' is subtracted
var NewDate = new Date(d.getTime() - countBack);
ss.getRange('C5').setValue(new Date(NewDate)).setNumberFormat('mm dd
yyyy');
headers.push([NewDate[i]]); // inner array
}
target.push(headers); // outer array
//Write to Results sheet
var ss = sss.getSheetByName('Main').getRange('A1').setValue(counter);
results.getRange("A1").offset(0, 1, target.length).setValue(target);
return target;
}

How to use a self made Type in F#?

I made a type, but I don't know how to use it properly and I don't found any solution on google.
type Sample =
{
TrackPosition : int
TubePosition : int
Barcode : string
}
let arraySamples = Array.create Scenario.Samples.NumberOfSamples **Sample**
BarcodeGenerieren.Samples.Sample
let mutable trackPosition = Scenario.Samples.StartTrackPositions
let mutable index = 1
for i in 1 .. Scenario.Samples.NumberOfSamples do
let randomNumber = System.Random().Next(0,9999)
if index > 24 then
trackPosition <- trackPosition + 1
index <- 1
arraySamples.[index] <- **new Sample{TrackPosition= trackPosition, TubePosition = index, Barcode = sprintf "100%s%06d" ((trackPosition + 1) - Scenario.Samples.StartTrackPositions) randomNumber}**
So my question is, what should I changed so that it works, when I will give the type of the array and when I will give the sample with data to the array?
You have created what is referred to as a record type. You can initialise it with the following syntax
{TrackPosition = 0;TubePosition = 0;Barcode = "string"}
your syntax in the last line is almost correct - it should be
arraySamples.[index] <- Sample{
TrackPosition= trackPosition;
TubePosition = index;
Barcode = sprintf "100%s%06d" ((trackPosition + 1) - Scenario.Samples.StartTrackPositions) randomNumber}
The changes are
Eliminate new
replace , with ;

Resources