I use a MapStateDescriptor for my stateful computation. Some code here
final val myMap = new MapStateDescriptor[String, List[String]]("myMap", classOf[String], classOf[List[String]])
During my computation i want to update my map by adding new elements to the List[String].
Is it possible?
Update #1
have written following def to manage my map
def updateTagsMapState(mapKey: String, tagId: String, mapToUpdate: MapState[String, List[String]]): Unit = {
if (mapToUpdate.contains(mapKey)) {
val mapValues: List[String] = mapToUpdate.get(mapKey)
val updatedMapValues: List[String] = tagId :: mapValues
mapToUpdate.put(mapKey, updatedMapValues)
} else {
mapToUpdate.put(mapKey,List(tagId))
}
}
Sure, it is. Depending on whether this is Scala List or Java that You will be using there you can do smth like that to actually create the state from descriptor:
lazy val stateMap = getRuntimeContext.getMapState(myMap)
Then You can simply do:
val list = stateMap.get("someKey")
stateMap.put("someKey", list +: "SomeVal")
Note that if You would work with mutable data structure, You wouldn't necessarily need to call put again, since the update of the data structure would also update the state. But this approach does not work in case of RocksDB state, since the state is only updated after You call put in this case, so it is always advised to update the state itself instead of just underlying object.
Related
I am writing a Spark 3 UDF to mask an attribute in an Array field.
My data (in parquet, but shown in a JSON format):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
case class:
case class MyClass(conditions: Seq[MyItem])
case class MyItem(code: String, category: String)
Spark code:
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
import spark.implicits._
val rdd = spark.sparkContext.parallelize(data)
val ds = rdd.toDF().as[MyClass]
val maskedConditions: Column = updateArray.apply(col("conditions"))
ds.withColumn("conditions", maskedConditions)
.select("conditions")
.show(2)
Tried the following UDF function.
UDF code:
def updateArray = udf((arr: Seq[MyItem]) => {
for (i <- 0 to arr.size - 1) {
// Line 3
val a = arr(i).asInstanceOf[org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema]
val a = arr(i)
println(a.getAs[MyItem](0))
// TODO: How to make code = "XXXX" here
// a.code = "XXXX"
}
arr
})
Goal:
I need to set 'code' field value in each array item to "XXXX" in a UDF.
Issue:
I am unable to modify the array fields.
Also I get the following error if remove the line 3 in the UDF (cast to GenericRowWithSchema).
Error:
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to MyItem
Question: How to capture Array of Structs in a function and how to return a modified array of items?
Welcome to Stackoverflow!
There is a small json linting error in your data: I assumed that you wanted to close the [] square brackets of the list array. So, for this example I used the following data (which is the same as yours):
{"conditions":{"list":[{"element":{"code":"1234","category":"ABC"}},{"element":{"code":"4550","category":"EDC"}}]}}
You don't need UDFs for this: a simple map operation will be sufficient! The following code does what you want:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyElement(element: MyItem)
case class MyList(list: Seq[MyElement])
case class MyClass(conditions: MyList)
val df = spark.read.json("./someData.json").as[MyClass]
val transformedDF = df.map{
case (MyClass(MyList(list))) => MyClass(MyList(list.map{
case (MyElement(item)) => MyElement(MyItem(code = "XXXX", item.category))
}))
}
transformedDF.show(false)
+--------------------------------+
|conditions |
+--------------------------------+
|[[[[XXXX, ABC]], [[XXXX, EDC]]]]|
+--------------------------------+
As you see, we're doing some simple pattern matching on the case classes we've defined and successfully renaming all of the code fields' values to "XXXX". If you want to get a json back, you can call the to_json function like so:
transformedDF.select(to_json($"conditions")).show(false)
+----------------------------------------------------------------------------------------------------+
|structstojson(conditions) |
+----------------------------------------------------------------------------------------------------+
|{"list":[{"element":{"code":"XXXX","category":"ABC"}},{"element":{"code":"XXXX","category":"EDC"}}]}|
+----------------------------------------------------------------------------------------------------+
Finally a very small remark about the data. If you have any control over how the data gets made, I would add the following suggestions:
The conditions JSON object seems to have no function in here, since it just contains a single array called list. Consider making the conditions object the array, which would allow you to discard the list name. That would simpify your structure
The element object does nothing, except containing a single item. Consider removing 1 level of abstraction there too.
With these suggestions, your data would contain the same information but look something like:
{"conditions":[{"code":"1234","category":"ABC"},{"code":"4550","category":"EDC"}]}
With these suggestions, you would also remove the need of the MyElement and the MyList case classes! But very often we're not in control over what data we receive so this is just a small disclaimer :)
Hope this helps!
EDIT: After your addition of simplified data according to the above suggestions, the task gets even easier. Again, you only need a map operation here:
import spark.implicits._
import org.apache.spark.sql.Encoders
case class MyItem(code: String, category: String)
case class MyClass(conditions: Seq[MyItem])
val data = Seq(MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("4550", "EDC"))))
val df = data.toDF.as[MyClass]
val transformedDF = df.map{
case MyClass(conditions) => MyClass(conditions.map{
item => MyItem("XXXX", item.category)
})
}
transformedDF.show(false)
+--------------------------+
|conditions |
+--------------------------+
|[[XXXX, ABC], [XXXX, EDC]]|
+--------------------------+
I am able to find a simple solution with Spark 3.1+ as new features are added in this new Spark version.
Updated code:
val data = Seq(
MyClass(conditions = Seq(MyItem("1234", "ABC"), MyItem("234", "KBC"))),
MyClass(conditions = Seq(MyItem("4550", "DTC"), MyItem("900", "RDT")))
)
import spark.implicits._
val ds = data.toDF()
val updatedDS = ds.withColumn(
"conditions",
transform(
col("conditions"),
x => x.withField("code", updateArray(x.getField("code")))))
updatedDS.show()
UDF:
def updateArray = udf((oldVal: String) => {
if(oldVal.contains("1234"))
"XXX"
else
oldVal
})
I divide stream by key, and manage a map state for each key like;
stream
.keyBy(_.userId)
.process(new MyStateFunc)
Every time, I have to read all values under a key, calculate something and update only a few of them. An example;
class MyStateFunc() .. {
val state = ValueState[Map[String, String]]
def process(event: MyModel...): {
val stateAsMap = state.value()
val updatedStateValues = updateAFewColumnsOfStateValByUsingIncomingEvent(event, stateAsMap)
doCalculationByUsingSomeValuesOfState(updatedStateValues)
state.update(updatedStateValues)
}
def updateAFewColumnsOfStateValByUsingIncomingEvent(event, state): Map[String, String] = {
val updateState = Map.empty
event.foreach {case (status, newValue) =>
updateState.put(status, newValue)
}
state ++ updatedState
}
def doCalculationByUsingSomeValuesOfState(stateValues): Map[String, String] = {
// do some staff by using some key and values
}
}
I'm not sure this is the most efficient way. Yeah, I have to read all the values (at least some of them) to do a calculation, but I also need to update a few of them, not all of the Map stored in each key. I'm just wondering that which one is more efficient; Value[Map[String, String]] vs MapState[String, String]?
If I use MapState[String, String], I have to do something like below in order to update related keys;
val state = MapState[String, String]
def process(event: MyModel...): {
val stateAsMap = state.entries().asScala
event.foreach { case (status, newValue)
state.put(status, newValue)
}
}
I'm not sure trying to update state for each event type is the efficient or not.
mapState.putAll(changeEvents)
Will this only overwrite related keys instead of all of them?
Or can be another way to overcome?
If your state only has a few entries, then it likely doesn't matter much. If your map can have a significant number of entries, then using MapState (with RocksDB state backend) should significantly cut down on the serialization cost, as you're only updating a few entries versus the entire state.
Note that for efficiency you should iterate over the MapState once, doing your calculation and (occasionally) updating the entry, assuming that's possible.
Let's say I have an array of strings and I want to get a list with objects that match, such as:
var locales=Locale.getAvailableLocales()
val filtered = locales.filter { l-> l.language=="en" }
except, instead of a single value I want to compare it with another list, like:
val lang = listOf("en", "fr", "es")
How do I do that? I'm looking for a one-liner solution without any loops. Thanks!
Like this
var locales = Locale.getAvailableLocales()
val filtered = locales.filter { l -> lang.contains(l.language)}
As pointed out in comments, you can skip naming the parameter to the lambda, and use it keyword to have either of the following:
val filtered1 = locales.filter{ lang.contains(it.language) }
val filtered2 = locales.filter{ it.language in lang }
Just remember to have a suitable data structure for the languages, so that the contains() method has low time complexity like a Set.
I have Core Data setup in my application so that when there is no internet connection the app will save data locally. Once a connection is established it will trigger online-mode and then I need to send that data that I have stored inside Core Data up to my database.
My question is how do you turn an entity into a dictionary such as this:
<Response: 0x1c0486860> (entity: AQ; id: 0xd000000000580000 <x-coredata://D6656875-7954-486F-8C35-9DBF3CC64E34/AQ/p22> ; data: {
amparexPin = 3196;
dateFull = "\"10/5/2018\"";
finalScore = 34;
hasTech = No;
intervention = Before;
responseValues = "(\n 31,\n 99,\n 82,\n 150,\n 123,\n 66,\n 103,\n 125,\n 0,\n 14\n)";
timeStamp = "\"12:47\"";
who = "Luke";
})
into this:
amparexPin: 5123
timeStamp: 10:30
hasTech: No
intervention: Before
Basically a dictionary, I am trying to perform the same operation on each set of data on each entity. I know it sounds overly complicated but its quite imperative that each entity go through the same filter/function to then send its data up to a database. Any help on this would be awesome!
I see two ways to go here. The first is to have a protocol with some "encode" function, toDictionary() -> [String, Any?] that each managed object class implements in an extension and then call this function on each object before sending it.
The advantage of this way is that you get more precise control of each mapping between the entity and the dictionary, the disadvantage is that you need to implement it for each entity in your core data model.
The other way is to make use of NSEntityDescription and KVC to extract all values in one function. This class holds a dictionary of all attributes, attributesByName that could be used to extract all values using key-value coding. Depending on if you need to map the data type of the values as well you can get that from the NSAttributeDescription. (If you need to deal with relationships and more there is also propertiesByName).
A simplified example. (Written directly here so no guarantee it compiles)
static func convertToDictionary(object: NSManagedObject) -> [String, Any?] {
let entity = object.entity
let attributes = entity.attributesByName
var result: [String: Any?]
for key in attributes.keys {
let value = object.valueForKey: key)
result[key] = value
}
}
you can use this function
func objectAsDictionary(obj:YourObject) -> [String: String] {
var object: [String: String] = [String: String]()
object["amparexPin"] = "\(obj.amparexPin)"
object["timeStamp"] = "\(obj.timeStamp)"
object["hasTech"] = "\(obj.hasTech)"
object["intervention"] = "\(obj.intervention)"
return object
}
I'm trying to subclass Array to implement a map method that returns instances of my Record class. I'm trying to create a sort of "lazy" array that only instantiates objects as they are needed to try and avoid allocating too many Ruby objects at once. I'm hoping to make better use of the garbage collector by only instantiating an object on each iteration.
class LazyArray < Array
def initialize(results)
#results = results
end
def map(&block)
record = Record.new(#results[i]) # how to get each item from #results for each iteration?
# how do I pass the record instance to the block for each iteration?
end
end
simple_array = [{name: 'foo'}, {name: 'bar'}]
lazy_array_instance = LazyArray.new(simple_array)
expect(lazy_array_instance).to be_an Array
expect(lazy_array_instance).to respond_to :map
lazy_array_instance.map do |record|
expect(record).to be_a Record
end
How can I subclass Array so that I can return an instance of my Record class in each iteration?
From what I know, you shouldn't have to do anything like this at all. Using .lazy you can perform lazy evaluation of arrays:
simple_array_of_results.lazy.map do |record|
# do something with Record instance
end
Now, you've got some odd situation where you're doing something like -
SomeOperation(simple_array_of_results)
and either you want SomeOperation to do it's thing lazily, or you want the output to be something lazy -
lazily_transformed_array_of_results = SomeOperation(simple_array_of_results)
page_of_results = lazily_transformed_array_of_results.take(10)
If that sounds right... I'd expect it to be as simple as:
SomeOperation(simple_array_of_results.lazy)
Does that work? array.lazy returns an object that responds to map, after all...
Edit:
...after reading your question again, it seems like what you actually want is something like:
SomeOperation(simple_array_of_results.lazy.collect{|r| SomeTransform(r)})
SomeTransform is whatever you're thinking of that takes that initial data and uses it to create your objects ("as needed" becoming "one at a time"). SomeOperation is whatever it is that needs to be passed something that responds to map.
So you have an array of simple attributes or some such and you want to instantiate an object before calling the map block. Sort of pre-processing on a value-by-value basis.
class Record
attr_accessor :name
def initialize(params={})
self.name = params[:name]
end
end
require 'delegate'
class MapEnhanced < SimpleDelegator
def map(&block)
#delegate_ds_obj.map do |attributes|
object = Record.new(attributes)
block.call(object)
end
end
end
array = MapEnhanced.new([{name: 'Joe'}, {name: 'Pete'}])
array.map {|record| record.name }
=> ["Joe" "Pete"]
An alternative (which will allow you to keep object.is_a? Array)
class MapEnhanced < Array
alias_method :old_map, :map
def map(&block)
old_map do |attributes|
object = Record.new(attributes)
block.call(object)
end
end
end