Recursively apply a function to elements of an array spark dataFrame - arrays

I wrote the following function which concatenates two strings and adds them in a dataframe new column:
def idCol(firstCol: String, secondCol: String, IdCol: String = FUNCTIONAL_ID): DataFrame = {
df.withColumn(IdCol,concat(col(firstCol),lit("."),col(secondCol))).dropDuplicates(IdCol)
}
My aim is to replace the use of different strings by one array of strings, and then define the new column from the concatenation of these different elements of the array. I am using an array in purpose in order to have a mutable data collection in case the number of elements to concatenate changes.
Do you have any idea about how to do this
So the function would be changed as :
def idCol(cols:Array[String], IdCol: String = FUNCTIONAL_ID): DataFrame = {
df.withColumn(IdCol,concat(col(cols(0)),lit("."),col(cols(1))).dropDuplicates(IdCol)
}
I want to bypass the cols(0), cols(1) and do a generic transformation which takes all elements of array and seperate them by the char "."

You can use concat_ws which has the following definition:
def concat_ws(sep: String, exprs: Column*): Column
You need to convert your column names which are in String to Column type:
import org.apache.spark.sql.functions._
def idCol(cols:Array[String], IdCol: String = FUNCTIONAL_ID): DataFrame = {
val concatCols = cols.map(col(_))
df.withColumn(IdCol, concat_ws(".", concatCols : _*) ).dropDuplicates(IdCol)
}

Related

Splitting or Breakup multidimensional arrays in scala spark along attributes

var date_columns = df.dtypes.filter(_._2 == "TimestampType")
This creates a two dimensional array containing only timestamp type column names along with their datatepyes
Array[(String, String)] = Array((cutoffdate,TimestampType), (wrk_pkg_start_date,TimestampType), (wrk_pkg_end_date,TimestampType))
Now, how do i split this array such that only columns names are in an array
dateColumns = [ cutoffdate , wrk_pkg_start_date , wrk_pkg_end_date ]
in Scala Spark . Without using for loops please
just use collect for that
var date_columns = df.dtypes.collect{ case (name, "TimestampType") => name }
collect can filter array using pattern matching and map elements
see scala documentation

Groovy Convert string data to map

I have a large String which I want to convert to a Map in groovy.
The String data is an array of key value pairs each key and value is enclosed in square brackets [] and separated by commas. Full data string here: https://pastebin.com/raw/4rBWRzMs
Some of the values can be empty e.g. '[]' or a list of values containing , and : characters e.g.
[1BLL:220,1BLE:641,2BLL:871,2BLE:475,SW:10029,KL:0,KD:78,ODT:148,AVB:358]
I only want to split on these characters if they are not enclosed in square brackets [].
The code I have tried but breaks when there are a list of values. Is there a better way? Thanks.
String testData="[[DEVICE_PROVISIONED]: [1], [aaudio.hw_burst_min_usec]: [2000],[debug.hwui.use_buffer_age]: [false], [ro.boot.boottime][1BLL:220,1BLE:641,2BLL:871,2BLE:475,SW:10029,KL:0,KD:78,ODT:148,AVB:358], ro.boot.hardware]: [walleye],[dev.mnt.blk.postinstall]: [],[ro.boot.usbradioflag]: [0], [ro.boot.vbmeta.avb_version]: [1.0], [ro.boot.vbmeta.device]: [/dev/sda18], [ro.boot.vbmeta.device_state]: [unlocked]]"
def map = [:]
testData.replaceAll('\\[]','null').replaceAll("\\s","").replaceAll('\\[','').replaceAll(']','').split(",").each {param ->
def nameAndValue = param.split(":")
map[nameAndValue[0]] = nameAndValue[1]
}
I'd grep the key-value-tuples from that format and build a map from
there. Once this is done it's easier to deal with further
transformations. E.g.
def testData="[DEVICE_PROVISIONED]: [1], [aaudio.hw_burst_min_usec]: [2000],[debug.hwui.use_buffer_age]: [false], [ro.boot.boottime]: [1BLL:220,1BLE:641,2BLL:871,2BLE:475,SW:10029,KL:0,KD:78,ODT:148,AVB:358], [ro.boot.hardware]: [walleye],[dev.mnt.blk.postinstall]: [],[ro.boot.usbradioflag]: [0], [ro.boot.vbmeta.avb_version]: [1.0], [ro.boot.vbmeta.device]: [/dev/sda18], [ro.boot.vbmeta.device_state]: [unlocked]"
def map = [:]
(testData =~ /\s*\[(.*?)\]\s*:\s*\[(.*?)\]\s*,?\s*/).findAll{ _, k, v ->
map.put(k,v)
}
println map.inspect()
// → ['DEVICE_PROVISIONED':'1', 'aaudio.hw_burst_min_usec':'2000', 'debug.hwui.use_buffer_age':'false', 'ro.boot.boottime':'1BLL:220,1BLE:641,2BLL:871,2BLE:475,SW:10029,KL:0,KD:78,ODT:148,AVB:358', 'ro.boot.hardware':'walleye', 'dev.mnt.blk.postinstall':'', 'ro.boot.usbradioflag':'0', 'ro.boot.vbmeta.avb_version':'1.0', 'ro.boot.vbmeta.device':'/dev/sda18', 'ro.boot.vbmeta.device_state':'unlocked']
Note that I have fixed some syntax in the testData and removed the outer
[]. If the original testData are actually containing invalid syntax
to the rules given, then this will not work.

ValueError: setting an array element with a sequence for incorporating word2vec model in my pandas dataframe

I am getting "ValueError: setting an array element with a sequence." error when I am trying to run my random forest classifier on a heterogenous data--the text data is been fed to word2vec model and I extracted one dimensional numpy array by taking mean of the word2vec vectors for each word in the text row.
Here is the sample of the data am working with:
col-A col-B ..... col-z
100 230 ...... [0.016612869501113892, -0.04279713928699493, .....]
where col-z is the numpy array with fixed size of 300 in each row.
Following is the code for calculating mean the word2vec vectors and creating numpy arrays:
` final_data = []
for i, row in df.iterrows():
text_vectorized = []
text = row['col-z']
for word in text:
try:
text_vectorized.append(list(w2v_model[word]))
except Exception as e:
pass
try:
text_vectorized = np.asarray(text_vectorized, dtype='object')
text_vectorized_mean = list(np.mean(text_vectorized, axis=0))
except Exception as e:
text_vectorized_mean = list(np.zeros(100))
pass
try:
len(text_vectorized_mean)
except:
text_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(text_vectorized_mean, dtype='object')
final_data.append(temp_row)
text_array = np.asarray(final_data, dtype='object')`
After this, I convert text_array to pandas dataframe and concatenate it with my original dataframe with other numeric columns. But as soon as I try to feed this data into a classifier, it gives me the above error at this line:
--> array = np.array(array, dtype=dtype, order=order, copy=copy)
Why am I getting this error?
You are trying to create an array from a mixed list containing both numeric values and an another list. Try to flatten the array first using .ravel()
For example,
text_array = np.asarray(final_data.ravel(), dtype='object')

Convert case class constructor parameters to String Array in Scala

I have a case class as follows:
case class MHealthUser(acc_Chest_X: Double, acc_Chest_Y: Double, acc_Chest_Z: Double, activityLabel: Int)
These form the schema of a Spark DataFrame, which is why I'm using a case class. I simply want to map these to an Array[String] so I can use the ParamValidators.inArray(attributes) method in Spark. I use the following code to map the constructor parameters to an array using reflection:
val attributes: Array[String] = MHealthUser.getClass.getConstructors.map(a => a.toString)
but this simply gives me an array of length 1 whereas I want an array of length 4, with the contents of the array being the dataset schema which I've defined, as a string. Otherwise I'm using the hard-coded values of the dataset schema, which is obviously inelegant.
In other words I want the output:
val attributes: Array[String] = Array("acc_Chest_X", "acc_Chest_Y", "acc_Chest_Z", "activityLabel")
I've been playing with this for a while and can't get it to work. Any ideas appreciated. Thanks!
I'd use ScalaReflection:
import org.apache.spark.sql.catalyst.ScalaReflection
import org.apache.spark.sql.types.StructType
ScalaReflection.schemaFor[MHealthUser].dataType match {
case s: StructType => s.fieldNames
case _ => Array[String]()
}
Outside Spark see Scala. Get field names list from case class

Storing multiple values from one text field into an array

I have one text field but I want to be able to store the multiple values entered in that text field (eg. 1,2,3,4) stored into an array. So far, all it does is store it all as one element that still has the commas. How can I get rid of the commas and store each value separately?
You can use global split function which works on any Sequence (including String):
If you want it to be separated by commas only:
let array = split("x,y,z") { $0 == "," }
If you'd want to separate by either commas or spaces:
let array = split("x, y z") { contains(", ", $0) }
You can use the string method componentsSeparatedByString(separator: String) -> [String]
For example:
let example = "1,2,3,4"
let elements = textfieldValue.componentsSeparatedByString(",") // elements is an array with Strings.
Just try below :-
NSArray *valueArr=[[yourTextfield stringValue] componentsSeparatedByString:#","];

Resources