This question already has an answer here:
Function returns an empty List in Spark
(1 answer)
Closed 4 years ago.
I've a following code :-
case class event(imei: String, date: String, gpsdt: String, entrygpsdt: String,lastgpsdt: String)
object recalculate extends Serializable {
def main(args: Array[String]) {
val sc = SparkContext.getOrCreate(conf)
val rdd = sc.cassandraTable("db", "table").select("imei", "date", "gpsdt").where("imei=? and date=? and gpsdt>? and gpsdt<?", entry(0), entry(1), entry(2), entry(3))
var lastgpsdt = "2018-04-06 10:10:10"
var updatedValues = new Array[event](rdd.count().toInt)
var index = 0
rdd.foreach(f => {
val imei = f.get[String]("imei")
val date = f.get[String]("date")
val gpsdt = f.get[String]("gpsdt")
updatedValues(index) = new event(imei, date, gpsdt,lastgpsdt)
println(updatedValues(index).toString())
index = index + 1
lastgpsdt = gpsdt
})
println("updates values are " + updatedValues.toString())
}}
So, here I'm trying to create an array of event class answer save values in array on each iteration and want to access the array outside foreach block. My issue is when I'm trying to access the array it gives null pointer exception and i checked it shows the array is empty. Although I have declared the array as var still why not able to access outside. Suggestions please, Thanks.
If you want to get Array[event] then I don't think that is the right approach
Here is what you can do for alternative
case class event(imei: String, date: String, gpsdt: String,
entrygpsdt: String,lastgpsdt: String)
val result = rdd.map(row => {
val imei = row.getString(0)
val date = row.getString(1)
val gpsdt = row.getString(2)
//create case class as you want
event(imei, date, gpsdt, lastgpsdt ,"2018-04-06 10:10:10")
})
.collect()
The result you obtain is Array[event]
Collect is also preferred only when your data size is small and can fit in a driver.
Hope this helps!
Related
I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})
I am trying to filter the array 'employee_name' consisting of NaNs and one string element, to exclude any element BUT the string. The context is that I have a spreadsheet containing employee's birth dates, and I'm sending an email notification in case there's a birthday two days from today. My variables look like this:
var ss = SpreadsheetApp.getActiveSpreadsheet().getSheetByName('Employees');
var range = ss.getRange(2, 1, ss.getLastRow()-1, 1); // column containing the birth dates
var birthdates = range.getValues(); // get the `values` of birth date column
var today = new Date ();
var today = new Date(today.getTime());
var secondDate = new Date(today.getTime() + 48 * 60 * 60 * 1000);
var employee_name = new Array(birthdates.length-1);
And the loop:
for (var i=0;i<=birthdates.length-1;i=i+1){
var fDate = new Date(birthdates[i][0]);
if (fDate.getDate() == secondDate.getDate() &&
fDate.getMonth() == secondDate.getMonth()){
//define variables for outgoing email
for (var j=0; j<=birthdates.length-1;j=j+1){
employee_name[j] = [NaN];
}
employee_name[i] = ss.getRange(i+2,6);
employee_name[i] = employee_name[i].getValues();
}
}
after which the array in question looks like this
Logger.log(employee_name);
[[[Mia-Angelica]], [NaN], [NaN], [NaN], ..., [NaN]]
I have already tried the filter(Boolean), but this isn't working:
employee_name_filtered = employee_name.filter(Boolean);
Logger.log(employee_name_filtered);
returns [[[Mia-Angelica]], [NaN], [NaN], [NaN], ..., [NaN]].
I have also tried filling the non-string array entries with numeric values (instead of NaN) and then apply
employee_name_filtered = employee_name.filter(isFinite);
Logger.log(employee_name_filtered);
returns [[1.0], [2.0], [3.0], ..., [72.0]], so this filter method is working, but then I would need the 'inverse' of that because I want to keep the string.
I need the array within array to store the values at the position of the counter variable where the condition's met (similar to How to store data in Array using For loop in Google apps script - pass array by value).
This is my first time posting a question on SO, so if I overlooked any 'rules' about posting, just let me know and I will provide additional info.
Any help will be appreciated!
EDIT:
what I would like to receive in the end is simply
[[Mia-Angelica]].
The array you are using a 2 dimensional array - meaning it's an array of arrays so the filter method you are using cannot be applied in the same manner.
For this, I suggest you try the below snippet.
function cleanArray() {
var initialArray = [
['Mia-Angelica'],
['Space'],
['2'],
[NaN],
[NaN],
[NaN],
[NaN]
];
var finalArray = [];
for (let i = 0; i < initialArray.length; i++) {
var midArray = initialArray[i].filter(item => (Number.isFinite(item) && item.id !== 0) || !Object.is(item, NaN));
finalArray.push(midArray);
}
console.log(finalArray.filter(item => item != ''));
}
Note
Please bear in mind that getValues will return an Object[][] which is a two-dimensional array of values.
Reference
Apps Script Range Class;
Array.prototype.filter().
I'm trying to persist in a table view cell, the result of a quiz test with questions and I needed the array of answers given (String Array) so I decided to use RealmSwift.
I created this class and of course I created also a RealmString object in the same file to handle the possibility to persist arrays of String in Realm in this way:
class RealmString: Object {
dynamic var stringValue = ""
}
class Test: Object {
#objc dynamic var ID = UUID().uuidString
#objc dynamic var testScore : String = String()
#objc dynamic var testTitle : String = String()
#objc dynamic var testSubTitle : String = String()
#objc dynamic var dateOfExecution: String = String()
#objc dynamic var answersGiven: [String] {
get {
return _backingAnswersGiven.map { $0.stringValue }
}
set {
_backingAnswersGiven.removeAll()
_backingAnswersGiven.append(objectsIn: (newValue.map({ RealmString(value: [$0]) })))
}
}
let _backingAnswersGiven = List<RealmString>()
override static func ignoredProperties() -> [String] {
return ["answersGiven"]
}
override static func primaryKey() -> String? {
return "ID"
}
Now in the view controller:
I have a variable that stores the result (is an Int array that will take ten answers with values from 0 to 5, and these will later be converted to String)
i.e.: [0,2,2,3,4,5,2,1,0,2] -> ["0","2","2","3","4","5","2","1","0","2"]
and when an option is selected in a question the value is set with this function, everything works fine.
public var questionResults: [Int] = []
func setValueToQuestion(questionNumber: Int) {
questionResults[questionNumber] = optionChosen
}
When the test is completed successfully everything is saved in this way:
let test = Test()
test.ID = currentTest?.ID ?? UUID().uuidString
test.testTitle = testTitleLabel.text!
test.testScore = resultNumberLabel.text!
test.testSubTitle = resultLabel.text!
test.dateOfExecution = dateTimeString
test.answersGiven = questionResults.map({String($0)})
DBManager.sharedInstance.addData(object: test)
I tried the code separately also adding breakpoints and everything works in the flow, expect this line:
test.answersGiven = questionResults.map({String($0)})
that raises the error shown in the title: "Invalid array input: more values (1) than properties (0)."
I guess it can be an error of mapping maybe?
This value is then treated in the rest of flow as a simple swift array of String = [String]
There are a few issues which may be leading to that error.
First the RealmString property is not persisted because it needs #objc
dynamic var stringValue = ""
should be
#objc dynamic var stringValue = ""
Secondly, and this is important, Realm does not support primitives in Lists. Well, it kinda does but not very well.
EDIT: Release 10.7 added support for filters/queries as well as aggregate functions on primitives so the below info is no longer completely valid. However, it's still something to be aware of.
See my answer to this question but in a nutshell, you need another class to store the string in - kind of like your RealmString class.
class StringClass: Object {
#objc dynamic var myString = ""
}
and then change the Test object property to use the StringClass property
#objc dynamic var answersGiven = [StringClass]()
and then I see you're trying to use a backed var and computed property but I am not sure why. It may be simpler to use use the var itself
let _backingAnswersGiven = List<RealmString>()
since the List collection already handles what's being computed.
For example, if you set the list you can set it to another list (which wipes out the current list). Or when you get the list let myStringList = test._backingAnswersGiven, gets all of the StringClasses in the list without having to .map over them.
uri:23e6b806-7a39-4836-bae2-f369673defef offset:1
uri:z65e9d4e-a099-41a1-a9fe-3cf74xbb01a4 offset:2
uri:2beff8bf-1019-4265-9da4-30c696397e08 offset:3
uri:3b1df8bb-69f6-4892-a516-523fd285d659 offset:4
uri:4f961415-b847-4d2c-9107-87617671c47b offset:5
uri:015ba25c-c145-456a-bae7-ebe999cb8e0f offset:6
uri:z1f9592f-64d0-443d-ad0d-38c386dd0adb offset:7
The above is an arrays of arrays.
Each line is an element in the array however this in itself is an array. I did this by splitting each line on the comma and removing it. What I am trying to do is only extract the uri and offset and apply it to a case class.
case class output2(uri: String, offset: Int)
All I want is the actual values, so each instance of the case class, the uri and offset would be in the below format -
e1af5db7-3aad-4ab0-ac3a-55686fccf6ae
1
I'm trying to find a simple way to do this.
No need to split() each line on the comma. Make the comma part of the recognized intput pattern.
val data = Array("uri:23e6b806-7a39-4836-bae2-f369673defef,offset:1"
,"uri:z65e9d4e-a099-41a1-a9fe-3cf74xbb01a4,offset:2"
,"poorly formatted data will be ignored"
,"uri:2beff8bf-1019-4265-9da4-30c696397e08,offset:3"
,"uri:3b1df8bb-69f6-4892-a516-523fd285d659,offset:4"
,"uri:4f961415-b847-4d2c-9107-87617671c47b,offset:5"
,"uri:015ba25c-c145-456a-bae7-ebe999cb8e0f,offset:6"
,"uri:z1f9592f-64d0-443d-ad0d-38c386dd0adb,offset:7")
case class Data(uri:String, offset:Int)
val dataRE = "uri:([^,]+),offset:(\\d+)".r
val rslt:Array[Data] = data.collect{case dataRE(uri, os) => Data(uri, os.toInt)}
You can build your data checking the guid using the regex like:
val regexp = """uri:(\b[0-9a-f]{8}\b-[0-9a-f]{4}-[0-9a-f]{4}-[0-9a-f]{4}-\b[0-9a-f]{12}\b) offset:([0-9]+)""".r
val regexp(pattern, value) = "uri:23e6b806-7a39-4836-bae2-f369673defef offset:1"
output2(pattern, value.toInt)
I'd do it this way:
case class Output(uri: String, offset: Int)
val lines = Source
.fromFile("input")
.getLines
.toList
def parseUri(s: String): Option[String] =
s.splitAt(s.indexOf(":") + 1)._2 match {
case "" => None
case uri => Some(uri)
}
def parseOffset(s: String): Option[Int] =
s.splitAt(s.indexOf(":") + 1)._2 match {
case "" => None
case offset => Some(offset.toInt)
}
def parseOutput(xs: Array[String]): Option[Output] = for {
uri <- parseUri(xs(0))
offset <- parseOffset(xs(1))
} yield {
Output(uri, offset)
}
lines.map(_.split(" ")).flatMap { x =>
parseOutput(x)
}
I'm trying to get a slice of an Array as Seq avoiding copy. I can make use of toSeq method.
val array = Array[AnyRef](
new Integer(1),
new Integer(2),
new Integer(3),
new Integer(4),
new Integer(5)
)
val seq = array.toSeq
array(1) = null
println(seq.mkString(",")) //1,null,3,4,5
It works fine: Ideone Live example. The array was not copied. But when I try to slice it
val array = Array[AnyRef](
new Integer(1),
new Integer(2),
new Integer(3),
new Integer(4),
new Integer(5)
)
val seq = array.toSeq.slice(0, 3)
array(1) = null
println(seq.mkString(",")) //1,2,3
As can be seen the copy is made: Ideone Live Example. I am trying to avoid it. Is there a way to do so in Scala?
Here is the code:
val a = (0 to 10).toArray
val b = a.toSeq.view.slice(1, 9)
a(5) = 12345
b.mkString(",") // res5: String = 1,2,3,4,12345,6,7,8
And here is a quote from Jurassic Park:
"Your scientists were so preoccupied with whether or not they could that they didn't stop to think if they should."