Splitting or Breakup multidimensional arrays in scala spark along attributes - arrays

var date_columns = df.dtypes.filter(_._2 == "TimestampType")
This creates a two dimensional array containing only timestamp type column names along with their datatepyes
Array[(String, String)] = Array((cutoffdate,TimestampType), (wrk_pkg_start_date,TimestampType), (wrk_pkg_end_date,TimestampType))
Now, how do i split this array such that only columns names are in an array
dateColumns = [ cutoffdate , wrk_pkg_start_date , wrk_pkg_end_date ]
in Scala Spark . Without using for loops please

just use collect for that
var date_columns = df.dtypes.collect{ case (name, "TimestampType") => name }
collect can filter array using pattern matching and map elements
see scala documentation

Related

Get the index of the last occurrence of each string in an array

I have an array that is storing a large number of various names in string format. There can be duplicates.
let myArray = ["Jim","Tristan","Robert","Lexi","Michael","Robert","Jim"]
In this case I do NOT know what values will be in the array after grabbing the data from a parse server. So the data imported will be different every time. Just a list of random names.
Assuming I don't know all the strings in the array I need to find the index of the last occurrence of each string in the array.
Example:
If this is my array....
let myArray = ["john","john","blake","robert","john","blake"]
I want the last index of each occurrence so...
blake = 5
john = 4
robert = 3
What is the best way to do this in Swift?
Normally I would just make a variable for each item possibility in the array and then increment through the array and count the items but in this case there are thousands of items in the array and they are of unknown values.
Create an array with elements and their indices:
zip(myArray, myArray.indices)
then reduce into a dictionary where keys are array elements and values are indices:
let result = zip(myArray, myArray.indices).reduce(into: [:]) { dict, tuple in
dict[tuple.0] = tuple.1
}
(myArray.enumerated() returns offsets, not indices, but it would have worked here too instead of zip since Array has an Int zero-based indices)
EDIT: Dictionary(_:uniquingKeysWith:) approach (#Jessy's answer) is a cleaner way to do it
New Dev's answer is the way to go. Except, the standard library already has a solution that does that, so use that instead.
Dictionary(
["john", "john", "blake", "robert", "john", "blake"]
.enumerated()
.map { ($0.element, $0.offset) }
) { $1 }
Or if you've already got a collection elsewhere…
Dictionary(zip(collection, collection.indices)) { $1 }
Just for fun, the one-liner, and likely the shortest, solution (brevity over clarity, or was it the other way around? :P)
myArray.enumerated().reduce(into: [:]) { $0[$1.0] = $1.1 }

Recursively apply a function to elements of an array spark dataFrame

I wrote the following function which concatenates two strings and adds them in a dataframe new column:
def idCol(firstCol: String, secondCol: String, IdCol: String = FUNCTIONAL_ID): DataFrame = {
df.withColumn(IdCol,concat(col(firstCol),lit("."),col(secondCol))).dropDuplicates(IdCol)
}
My aim is to replace the use of different strings by one array of strings, and then define the new column from the concatenation of these different elements of the array. I am using an array in purpose in order to have a mutable data collection in case the number of elements to concatenate changes.
Do you have any idea about how to do this
So the function would be changed as :
def idCol(cols:Array[String], IdCol: String = FUNCTIONAL_ID): DataFrame = {
df.withColumn(IdCol,concat(col(cols(0)),lit("."),col(cols(1))).dropDuplicates(IdCol)
}
I want to bypass the cols(0), cols(1) and do a generic transformation which takes all elements of array and seperate them by the char "."
You can use concat_ws which has the following definition:
def concat_ws(sep: String, exprs: Column*): Column
You need to convert your column names which are in String to Column type:
import org.apache.spark.sql.functions._
def idCol(cols:Array[String], IdCol: String = FUNCTIONAL_ID): DataFrame = {
val concatCols = cols.map(col(_))
df.withColumn(IdCol, concat_ws(".", concatCols : _*) ).dropDuplicates(IdCol)
}

ValueError: setting an array element with a sequence for incorporating word2vec model in my pandas dataframe

I am getting "ValueError: setting an array element with a sequence." error when I am trying to run my random forest classifier on a heterogenous data--the text data is been fed to word2vec model and I extracted one dimensional numpy array by taking mean of the word2vec vectors for each word in the text row.
Here is the sample of the data am working with:
col-A col-B ..... col-z
100 230 ...... [0.016612869501113892, -0.04279713928699493, .....]
where col-z is the numpy array with fixed size of 300 in each row.
Following is the code for calculating mean the word2vec vectors and creating numpy arrays:
` final_data = []
for i, row in df.iterrows():
text_vectorized = []
text = row['col-z']
for word in text:
try:
text_vectorized.append(list(w2v_model[word]))
except Exception as e:
pass
try:
text_vectorized = np.asarray(text_vectorized, dtype='object')
text_vectorized_mean = list(np.mean(text_vectorized, axis=0))
except Exception as e:
text_vectorized_mean = list(np.zeros(100))
pass
try:
len(text_vectorized_mean)
except:
text_vectorized_mean = list(np.zeros(100))
temp_row = np.asarray(text_vectorized_mean, dtype='object')
final_data.append(temp_row)
text_array = np.asarray(final_data, dtype='object')`
After this, I convert text_array to pandas dataframe and concatenate it with my original dataframe with other numeric columns. But as soon as I try to feed this data into a classifier, it gives me the above error at this line:
--> array = np.array(array, dtype=dtype, order=order, copy=copy)
Why am I getting this error?
You are trying to create an array from a mixed list containing both numeric values and an another list. Try to flatten the array first using .ravel()
For example,
text_array = np.asarray(final_data.ravel(), dtype='object')

How can I assign the values of one array to a multiple dimension array?

I am creating a multiple dimension array and I need to assign the column names to the array, but I keep getting the error:
Cannot convert the value of type String to expected argument of type
[String]
I am new to swift, so I don't really know what to do, so here is my code:
var data = [[[String]]]()
var rows = 3
var columns = 3
var column_names = ["Red", "Blue", "Green", "Orange"]
var index1 = 0
for index1 in 0...columns{
data[index1] = column_names[index1]
}
The code var data = [[[String]]]() creates an array of arrays of arrays. you need 3 indexes if you want to be able to insert a string into it.
Assuming you only want a 2-dimensional array, you might use code like this instead:
var data = [[String]]()
var column_names = ["Red", "Blue", "Green", "Orange"]
let rows = 3
let columns = column_names.count
let empty_row = Array(repeating: "", count: columns)
data.append(column_names)
for _ in 1 ..< rows {
data.append(empty_row)
}
print(data)
In the code above we create an empty 2 dimensional array. We then add an array of column names, followed by rows of empty strings.
Swift doesn't actually have a native n-dimensional array type. Instead, you create arrays that contain other arrays. Thus it's possible to create "jagged" arrays where the different sub-arrays have differing numbers of elements. In your case I'm assuming you want a 4x3 2-dimensional array, so that's what the code I wrote above creates.

Using fill with multi dim arrays

Is it possible to use fill to pass in a array into an array of tuples in ruby using fill?
For example I am trying to combine the following two arrays using zip, and then plan on transposing them. I am trying the following
column_name_tuples = [["foo"], ["bar"]]
column_label_tuples = [["Foo Bar"]]
column_label_tuples.fill(column_name_tuples.size..column_label_tuples.size - 1) { [nil] }
This results in column labels being filled as follows
[["Foo Bar"], nil]
When in fact I need it to be filled like this so I can do a transpose afterwards
[["Foo Bar"], [nil]]
You can do it like so:
column_label_tuples.fill([nil], column_label_tuples.size,
column_name_tuples.size-column_label_tuples.size)
#=> now [["Foo Bar"], [nil]]
which reduces to:
column_label_tuples.fill([nil], 1, 2-1)

Resources