I am using a third party package for spark that utilizes a "PointFeature" object. I am trying to take a csv file and put elements form each row into an Array of these PointFeature objects.
The PointFeature constructor for my implementation looks like this:
Feature(Point( _c1, _c2), _c3)
where _c1, _c2, and _c3 are the columns of my csv and represent doubles.
Here is my current attempt:
val points: Array[PointFeature[Double]] = for{
line <- sc.textFile("file.csv")
point <- Feature(Point(line._c1,line._c2),line._c3)
} yield point
My error shows up when referencing the columns
<console>:36: error: value _c1 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c2 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
<console>:36: error: value _c3 is not a member of String
point <- Feature(Point(line._c1,line._c2),line._c3.toDouble)
^
This is obviously because I'm referencing a String as if it were an element of a DataFrame. I'm wondering if there is a way to work with DataFrames in this loop format, or a way to split each line into a List of doubles. Maybe I need an RDD?
Also, I'm not certain that this will yield an Array. Actually, I suspect it will return an RDD...
I am using Spark 2.1.0 on Amazon EMR
Here are some other Questions I have drawn from:
How to read csv file into an Array of arrays in scala
Splitting strings in Apache Spark using Scala
How to iterate records spark scala?
You could set up a Dataset[Feature] this way:
case class Feature(x: Double, y: Double, z: Double)
sparkSession.read.csv("file.csv")
.toDF("x", "y", "z")
.withColumn("x", 'x.cast(DoubleType))
.withColumn("y", 'y.cast(DoubleType))
.withColumn("z", 'z.cast(DoubleType))
.as[Feature]
Then you can consume your strongly-typed DataSet[Feature] as you see fit.
I suggest taking this on in smaller steps.
Step One
Get your rows as an Array/List/whatever of Strings.
val lines = sc.textFile("file.txt").getLines, or something like that.
Step Two
Break your lines in to their own lists of columns.
val splits = lines.map {l => l.split(",")}
Step Three
Extract your colums as vals that you can use
val res = splits.map {
case Array(col1, col2, col3) => // Convert to doubles, put in to Feature/Point Structure}
case _ => // Handle the case where your csv is malformatted
}
This can all be done in one go, I only split them to show the logical step from file → list string → list list string → list Feature
Related
Basically, I’m given a list of strings such as:
["structA.structB.myArr[6].myVar",
"structB.myArr1[4].myArr2[2].myVar",
"structC.myArr1[3][4].myVar",
"structA.myArr1[4]",
"structA.myVar"]
These strings are describing variables/arrays from multiple structs. The integers in the arrays describe the size each array. Given a string has a/multiple arrays (1d or 2d), I want to generate a list of strings which go through each index combination in the array for that string. I thought of using for loops but issue is I don’t know how many arrays are in a given string before running the script. So I couldn’t do something like
for i in range (0, idx1):
for j in range (0, idx2):
for k in range (0, idx3):
arr.append(“structA.myArr1[%i][%i].myArr[%i]” %(idx1,idx2,idx3))
but the issue is that I don’t know how I can create multiple/dynamic for loops based on how many indexes and how I could create a dynamic append statement that changes per each string from the original list since each string will have a different number of indexes and the arrays will be in different locations of the string.
I was able to write a regex to find all the index for each string in my list of strings:
indexArr = re.findall('\[(.*?)\]', myString)
//after looping, indexArr = [['6'],['4','2'],['3','4'],['4']]
however I'm really stuck on how to achieve the "dynamic for loops" or use recursion for this. I want to get my ending list of strings to look like:
[
["structA.structB.myArr[0].myVar",
"structA.structB.myArr[1].myVar",
...
"structA.structB.myArr[5].myVar”],
[“structB.myArr1[0].myArr2[0].myVar",
"structB.myArr1[0].myArr2[1].myVar",
"structB.myArr1[1].myArr2[0].myVar",
…
"structB.myArr1[3].myArr2[1].myVar”],
[“structC.myArr1[0][0].myVar",
"structC.myArr1[0][1].myVar",
…
"structC.myArr1[2][3].myVar”],
[“structA.myArr1[0]”,
…
"structA.myArr1[3]”],
[“structA.myVar”] //this will only contain 1 string since there were no arrays
]
I am really stuck on this, any help is appreciated. Thank you so much.
The key is to use itertools.product to generate all possible combinations of a set of ranges and substitute them as array indices of an appropriately constructed string template.
import itertools
import re
def expand(code):
p = re.compile('\[(.*?)\]')
ranges = [range(int(s)) for s in p.findall(code)]
template = p.sub("[{}]", code)
result = [template.format(*s) for s in itertools.product(*ranges)]
return result
The result of expand("structA.structB.myArr[6].myVar") is
['structA.structB.myArr[0].myVar',
'structA.structB.myArr[1].myVar',
'structA.structB.myArr[2].myVar',
'structA.structB.myArr[3].myVar',
'structA.structB.myArr[4].myVar',
'structA.structB.myArr[5].myVar']
and expand("structB.myArr1[4].myArr2[2].myVar") is
['structB.myArr1[0].myArr2[0].myVar',
'structB.myArr1[0].myArr2[1].myVar',
'structB.myArr1[1].myArr2[0].myVar',
'structB.myArr1[1].myArr2[1].myVar',
'structB.myArr1[2].myArr2[0].myVar',
'structB.myArr1[2].myArr2[1].myVar',
'structB.myArr1[3].myArr2[0].myVar',
'structB.myArr1[3].myArr2[1].myVar']
and the corner case expand("structA.myVar") naturally works to produce
['structA.myVar']
I am using a list of integers corresponding to an x,y index of a gridded NetCDF array to extract specific values, the initial code was derived from here. My NetCDF file has a single dimension at a single timestep, which is named TMAX2M. My code written to execute this is as follows (please note that I have not shown the call of netCDF4 at the top of the script):
# grid point lists
lat = [914]
lon = [2141]
# Open netCDF File
fh = Dataset('/pathtofile/temperaturedataset.nc', mode='r')
# Variable Extraction
point_list = zip(lat,lon)
dataset_list = []
for i, j in point_list:
dataset_list.append(fh.variables['TMAX2M'][i,j])
print(dataset_list)
The code executes, and the result is as follows:
masked_array(data=73,mask=False,fill_value=999999,dtype=int16]
The data value here is correct, however I would like the output to only contain the integer contained in "data". The goal is to pass a number of x,y points as seen in the example linked above and join them into a single list.
Any suggestions on what to add to the code to make this achievable would be great.
The solution to calling the particular value from the x,y list on single step within the dataset can be done as follows:
dataset_list = []
for i, j in point_list:
dataset_list.append(fh.variables['TMAX2M'][:][i,j])
The previous linked example contained [0,16] for the indexed variables, [:] can be used in this case.
I suggest converting to NumPy array like this:
for i, j in point_list:
dataset_list.append(np.array(fh.variables['TMAX2M'][i,j]))
I have two array of columns
arrayColumns1: org.apache.spark.sql.Column = array("col1","col2")
arrayColumns2: org.apache.spark.sql.Column = array("col1","col2")
Both seems equals, but they came from different sources.
arrayColumns1 is from a conversion of Array("col1","col2") to an array, using this function:
def asLitArray[T](xs: Seq[T]) = array(xs map lit: _*)
arrayColumns2 is from writing the textual Array.
Now when I try to use arrayColumns1 as input to an UDF function:
.withColumn("udfFunction",udfFunction(arrayColumns))
where
val udfFunction= udf(
{ xs : Seq[Double] =>
DO_SOMETHING
(output)
}
)
It thorows me this error:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(col1,col2))' due to data type mismatch: argument 1 requires array<double> type, however, 'array('col1','col2')' is of array<string> type.;;
But when I use arrayColumns2 it works fine. What did I do wrong?
I'm using Spark 2.1 over scala 2.11
It does not make much sense to pass an array of literals to an UDF, because what you want to pass is the name of the columns, not literal values. Your second case fails because you are creating columns of type string (lit("col1") is a literal column with a content of "col1", it does not reference the column col1)
I would to it like this:
def asColArray(xs: Seq[String]) = array((xs.map(x => col(x))): _*)
val arrayColumns = asColArray(Array("col1","col2"))
df.withColumn("udfFunction",udfFunction(arrayColumns))
If you really want to use literal values, you would need to do something like this:
val arrayColumns = asLitArray(Array(1.0,2.0))
But this would give you a constant output of your udf
I just want to solve the following problem: i want to filter out all tuples of a data frame in which the strings contained in one column are not contained in a blacklist which is given as a (potentially empty) array of strings.
For example: if the blacklist contains "fourty two" and "twenty three", all rows are filtered out from the dataframe in which the respective column contains either "fourty two" or "twenty three".
The following code will successfully execute, if the blacklist is not empty (for example Array("fourty two")) and fail else (Array.empty[String]):
//HELPERs
val containsStringUDF = udf(containsString(_: mutable.WrappedArray[String], _: String))
def containsString(array: mutable.WrappedArray[String], value: String) = {array.contains(value)}
def arrayCol[T](arr: Array[T]) = {array(arr map lit: _*)}
df.filter(!containsStringUDF(arrayCol[String](blacklist),$"theStringColumn"))
The error message is:
org.apache.spark.sql.AnalysisException: cannot resolve 'UDF(array(), theStringColumn)' due to data type mismatch: argument 1 requires array<string> type, however, 'array()' is of array<null> type
It seems, that empty arrays appear typeless to spark. Is there a nice way to deal with this?
You are overthinking a problem. What you really need here is isin:
val blacklist = Seq("foo", "bar")
$"theStringColumn".isin(blacklist: _*)
Moreover don't depend on the local type for ArrayType being a WrappedArray. Just use Seq.
Finally to answer your question you can either:
array().cast("array<string>")
or:
import org.apache.spark.sql.types.{ArrayType, StringType}
array().cast(ArrayType(StringType))
I am trying to carry out the intersection of two arrays in Matlab but I cannot find the way.
The arrays that I want to intersect are:
and
I have tried:[dur, itimes, inewtimes ] = intersect(array2,char(array1));
but no luck.
However, if I try to intersect array1 with array3 (see array3 below), [dur, itimes, inewtimes ] = intersect(array3,char(array1));the intersection is performed without any error.
Why I cannot intersect array1 with array2?, how could I do it?. Thank you.
Just for ease of reading, your formats for Arrays are different, and you want to make them the same. There are many options for you, like #Visser suggested, you could convert the date/time into a long int which allows faster computation, or you can keep them as strings, or even convert them into characters (like what you have done with char(Array2)).
This is my example:
A = {'00:00:00';'00:01:01'} %//Type is Cell String
Z = ['00:00:00';'00:01:01'] %//Type is Cell Char
Q = {{'00:00:00'};{'00:01:01'}} %//Type is a Cell of Cells
A = cellstr(A) %//Convert CellStr to CellStr is essentially doing nothing
Z = cellstr(Z) %//Convert CellChar to CellStr
Q = vertcat(Q{:,:}) %// Convert Cell of Cells to Cell of Strings
I = intersect (A,Z)
>>'00:00:00'
'00:01:01'
II = intersect (A,Q)
>>'00:00:00'
'00:01:01'
This keeps your dates in the format of Strings in case you want to export them back into a txt/csv file.
Your first array would look something like this:
array1 = linspace(0,1,86400); % creates 86400 seconds in 1 day
Your second array should be converted using datenum, then use cell2mat to make it a matrix. Lastly, use ismember to find the intersection:
InterSect = ismember(array2,array1);