Apache Flink MapState vs Value[Map[String, String]] usage - apache-flink

I divide stream by key, and manage a map state for each key like;
stream
.keyBy(_.userId)
.process(new MyStateFunc)
Every time, I have to read all values under a key, calculate something and update only a few of them. An example;
class MyStateFunc() .. {
val state = ValueState[Map[String, String]]
def process(event: MyModel...): {
val stateAsMap = state.value()
val updatedStateValues = updateAFewColumnsOfStateValByUsingIncomingEvent(event, stateAsMap)
doCalculationByUsingSomeValuesOfState(updatedStateValues)
state.update(updatedStateValues)
}
def updateAFewColumnsOfStateValByUsingIncomingEvent(event, state): Map[String, String] = {
val updateState = Map.empty
event.foreach {case (status, newValue) =>
updateState.put(status, newValue)
}
state ++ updatedState
}
def doCalculationByUsingSomeValuesOfState(stateValues): Map[String, String] = {
// do some staff by using some key and values
}
}
I'm not sure this is the most efficient way. Yeah, I have to read all the values (at least some of them) to do a calculation, but I also need to update a few of them, not all of the Map stored in each key. I'm just wondering that which one is more efficient; Value[Map[String, String]] vs MapState[String, String]?
If I use MapState[String, String], I have to do something like below in order to update related keys;
val state = MapState[String, String]
def process(event: MyModel...): {
val stateAsMap = state.entries().asScala
event.foreach { case (status, newValue)
state.put(status, newValue)
}
}
I'm not sure trying to update state for each event type is the efficient or not.
mapState.putAll(changeEvents)
Will this only overwrite related keys instead of all of them?
Or can be another way to overcome?

If your state only has a few entries, then it likely doesn't matter much. If your map can have a significant number of entries, then using MapState (with RocksDB state backend) should significantly cut down on the serialization cost, as you're only updating a few entries versus the entire state.
Note that for efficiency you should iterate over the MapState once, doing your calculation and (occasionally) updating the entry, assuming that's possible.

Related

Flink MapStateDescriptor update List value

I use a MapStateDescriptor for my stateful computation. Some code here
final val myMap = new MapStateDescriptor[String, List[String]]("myMap", classOf[String], classOf[List[String]])
During my computation i want to update my map by adding new elements to the List[String].
Is it possible?
Update #1
have written following def to manage my map
def updateTagsMapState(mapKey: String, tagId: String, mapToUpdate: MapState[String, List[String]]): Unit = {
if (mapToUpdate.contains(mapKey)) {
val mapValues: List[String] = mapToUpdate.get(mapKey)
val updatedMapValues: List[String] = tagId :: mapValues
mapToUpdate.put(mapKey, updatedMapValues)
} else {
mapToUpdate.put(mapKey,List(tagId))
}
}
Sure, it is. Depending on whether this is Scala List or Java that You will be using there you can do smth like that to actually create the state from descriptor:
lazy val stateMap = getRuntimeContext.getMapState(myMap)
Then You can simply do:
val list = stateMap.get("someKey")
stateMap.put("someKey", list +: "SomeVal")
Note that if You would work with mutable data structure, You wouldn't necessarily need to call put again, since the update of the data structure would also update the state. But this approach does not work in case of RocksDB state, since the state is only updated after You call put in this case, so it is always advised to update the state itself instead of just underlying object.

Compare two big arrays value for value in Node.js

I have two arrays, one containing 200.000 product objects coming from a CSV file and one containing 200.000 product objects coming from a database.
Both arrays contains objects with the same fields, with one exception: the database objects have a unique ID as well.
I need to compare all 200.000 CSV objects with the 200.000 database objects. If the CSV object already exists in the database objects array I put it in an "update" array together with the ID from the match, and if it doesn't, then I put it in a "new" array.
When done, I update all the "update" objects in the database, and insert all the "new" ones. This goes fast (few seconds).
The compare step however takes hours. I need to compare three values: the channel (string), date (date) and time (string). If all three are the same, it's a match. If one of those isn't, then it's not a match.
This is the code I have:
const newProducts = [];
const updateProducts = [];
csvProducts.forEach((csvProduct) => {
// check if there is a match
const match = dbProducts.find((dbProduct) => {
return dbProduct.channel === csvProduct.channel && moment(dbProduct.date).isSame(moment(csvProduct.date), 'day') && dbProduct.start_time === csvProduct.start_time;
});
if (match) {
// we found a match, add it to updateProducts array
updateProducts.push({
id: match.id,
...csvProduct
});
// remove the match from the dbProducts array to speed things up
_.pull(dbProducts, match);
} else {
// no match, it's a new product
newProducts.push(csvProduct);
}
});
I am using lodash and moment.js libraries.
The bottleneck is in the check if there is a match, any ideas on how to speed this up?
This is a job for the Map collection class. Arrays are a hassle because they must be searched linearly. Maps (and Sets) can be searched fast. You want to do your matching in RAM rather than hitting your db for every single object in your incoming file.
So, first read every record in your database and construct a Map where the keys are objects like this {start_time, date, channel} and the values are id. (I put the time first because I guess it's the attribute with the most different values. It's an attempt to make lookup faster.)
Something like this pseudocode.
const productsInDb = new Map()
for (const entry in database) {
const key = { // make your keys EXACTLY the same when you load your Map ..
start_time: entry.start_time,
date: moment(entry.date),
entry.channel}
productsInDb.add(key, entry.id)
}
This will take a whole mess of RAM, but so what? It's what RAM is for.
Then do your matching more or less the way you did it in your example, but using your Map.
const newProducts = [];
const updateProducts = [];
csvProducts.forEach((csvProduct) => {
// check if there is a match
const key = { // ...and when you look up entries in the Map.
start_time: entry.start_time,
date: moment(entry.date),
entry.channel}
const id = productsInDb.get(key)
if (id) {
// we found a match, add it to updateProducts array
updateProducts.push({
id: match.id,
...csvProduct
});
// don't bother to update your Map here
// unless you need to do something about dups in your csv file
} else {
// no match, it's a new product
newProducts.push(csvProduct)
}
});

How to buffer and drop a chunked bytestring with a delimiter?

Lets say you have a publisher using broadcast with some fast and some slow subscribers and would like to be able to drop sets of messages for the slow subscriber without having to keep them in memory. The data consists of chunked ByteStrings, so dropping a single ByteString is not an option.
Each set of ByteStrings is followed by a terminator ByteString("\n"), so I would need to drop a set of ByteStrings ending with that.
Is that something you can do with a custom graph stage? Can it be done without aggregating and keeping the whole set in memory?
Avoid Custom Stages
Whenever possible try to avoid custom stages, they are very tricky to get correct as well as being pretty verbose. Usually some combination of the standard akka-stream stages and plain-old-functions will do the trick.
Group Dropping
Presumably you have some criteria that you will use to decide which group of messages will be dropped:
type ShouldDropTester : () => Boolean
For demonstration purposes I will use a simple switch that drops every other group:
val dropEveryOther : ShouldDropTester =
Iterator.from(1)
.map(_ % 2 == 0)
.next
We will also need a function that will take in a ShouldDropTester and use it to determine whether an individual ByteString should be dropped:
val endOfFile = ByteString("\n")
val dropGroupPredicate : ShouldDropTester => ByteString => Boolean =
(shouldDropTester) => {
var dropGroup = shouldDropTester()
(byteString) =>
if(byteString equals endOfFile) {
val returnValue = dropGroup
dropGroup = shouldDropTester()
returnValue
}
else {
dropGroup
}
}
Combining the above two functions will drop every other group of ByteStrings. This functionality can then be converted into a Flow:
val filterPredicateFunction : ByteString => Boolean =
dropGroupPredicate(dropEveryOther)
val dropGroups : Flow[ByteString, ByteString, _] =
Flow[ByteString] filter filterPredicateFunction
As required: the group of messages do not need to be buffered, the predicate will work on individual ByteStrings and therefore consumes a constant amount of memory regardless of file size.

How to turn NSManagedObject values into a Dictionary with key and value pairs?

I have Core Data setup in my application so that when there is no internet connection the app will save data locally. Once a connection is established it will trigger online-mode and then I need to send that data that I have stored inside Core Data up to my database.
My question is how do you turn an entity into a dictionary such as this:
<Response: 0x1c0486860> (entity: AQ; id: 0xd000000000580000 <x-coredata://D6656875-7954-486F-8C35-9DBF3CC64E34/AQ/p22> ; data: {
amparexPin = 3196;
dateFull = "\"10/5/2018\"";
finalScore = 34;
hasTech = No;
intervention = Before;
responseValues = "(\n 31,\n 99,\n 82,\n 150,\n 123,\n 66,\n 103,\n 125,\n 0,\n 14\n)";
timeStamp = "\"12:47\"";
who = "Luke";
})
into this:
amparexPin: 5123
timeStamp: 10:30
hasTech: No
intervention: Before
Basically a dictionary, I am trying to perform the same operation on each set of data on each entity. I know it sounds overly complicated but its quite imperative that each entity go through the same filter/function to then send its data up to a database. Any help on this would be awesome!
I see two ways to go here. The first is to have a protocol with some "encode" function, toDictionary() -> [String, Any?] that each managed object class implements in an extension and then call this function on each object before sending it.
The advantage of this way is that you get more precise control of each mapping between the entity and the dictionary, the disadvantage is that you need to implement it for each entity in your core data model.
The other way is to make use of NSEntityDescription and KVC to extract all values in one function. This class holds a dictionary of all attributes, attributesByName that could be used to extract all values using key-value coding. Depending on if you need to map the data type of the values as well you can get that from the NSAttributeDescription. (If you need to deal with relationships and more there is also propertiesByName).
A simplified example. (Written directly here so no guarantee it compiles)
static func convertToDictionary(object: NSManagedObject) -> [String, Any?] {
let entity = object.entity
let attributes = entity.attributesByName
var result: [String: Any?]
for key in attributes.keys {
let value = object.valueForKey: key)
result[key] = value
}
}
you can use this function
func objectAsDictionary(obj:YourObject) -> [String: String] {
var object: [String: String] = [String: String]()
object["amparexPin"] = "\(obj.amparexPin)"
object["timeStamp"] = "\(obj.timeStamp)"
object["hasTech"] = "\(obj.hasTech)"
object["intervention"] = "\(obj.intervention)"
return object
}

NGRX - can't set the state tree as I would like it to be

So I'm using ngrx for managing the state in my application. I tried to add a new property (selected shifts) which should look like this:
state: {
shifts: {
selectedShifts: [
[employeeId]: [
[shiftId]: shift
]
]
}
}
at the moment, my state looks like this:
state: {
selectedShifts: {
[employeeId]: {
[shiftId]: shift
}
}
}
so as you can see, my "selected shift" is a property, not an array - which makes it diffictult to add/remove/query the state.
How do I compose the state to look like I want it?
This is what I tried in the reducer:
return {
...state,
selectedShifts: {
...state.selectedShifts,
[action.payload.employeeId]: {
...state.selectedShifts[action.payload.employeeId],
[action.payload.shiftId]: action.payload[shift.shiftId]
}
}
};
Now when I try to return the state in the way I'd like to, this is the result:
state: {
selectedShifts: {
[action.payload.employeeId]:
[0]: {[action.payload.shiftId]: { shift }}
}
}
What am I missing here? When I try to replace the {} items which should be [] this error comes up: "," expected.
Oh yea, I would like the index of the array to be the id of the specific shift and not [0], [1]...
Is this possible at all?
Would it be a bad idea to change the index from numerics to the actual shift's id?
Array length kind of miss behaves when you add data at numeric index points. This might get you into problems with array methods using length join, slice, indexOf etc. & array methods altering length push, splice, etc.
var fruits = [];
fruits.push('banana', 'apple', 'peach');
console.log(fruits.length); // 3
When setting a property on a JavaScript array when the property is a valid array index and that index is outside the current bounds of the array, the engine will update the array's length property accordingly:
fruits[5] = 'mango';
console.log(fruits[5]); // 'mango'
console.log(Object.keys(fruits)); // ['0', '1', '2', '5']
console.log(fruits.length); // 6
There is no problem selecting / updating state from object, it's just a bit different from what you're probably used to. With straight hashmap { objectId: Object } finding the required object to update / remove is the fastest possible if changes are defined for object id.
I know your problem is related to NGRX but reading Redux immutable patterns is going to definitely help you out here for add / update / remove objects from the state. https://redux.js.org/recipes/structuring-reducers/immutable-update-patterns
Generally you don't want to have arrays in state ( at least large arrays ) object hashmaps are a lot better.
To get array of your selected user shifts for views you could do something like. Note this is not a shift indexed array just array of shifts under userId property. From original state form following state.
state: {
selectedShifts: {
[employeeId]: {
[shiftId]: shift
}
}
}
const getSelectedShiftsAsArray = this.store.select( getSelectedShifts() )
.map(
userShifts => {
// get array of object ids
const userIds = Object.keys( userShifts );
const ret = {};
for( const userId of userIds ) {
const collectedShifts = [];
// convert Dictionary<Shift> into a Shift[]
// get array of shift ids
const shiftIds = Object.keys( userShifts[userId] );
// map array of shift ids into shift object array
collectedShifts = shiftIds.map( shiftId => userShifts[shiftId] );
// return value for a userId
ret[userId] = collectedShifts;
}
return ret;
});
Code is completely untested and just for a reference one level up from pseudocode. You could easily convert that into a NGRX selector though. The state is there just for the storage, how you model it for use in components is upto selector functions & components themselves.
If you really really need it you could add.
ret[userId].shiftIds = shiftIds;
ret[userId].shifts = collectedShifts;
But it really depends on how you plan to use these.
From my personal experience I would separate shift entities from selectedShifts but how you organise your state is completely up to you.
state: {
shifts: {
// contains shift entities as object property map id: entity
entities: Dictionary<Shift>,
selectedShifts: [
[employeeId]: number[] // contains ids for shifts
]
}
}
Now updating / removing and adding a shift would just be setting updated data into path shifts.entities[entityId]
Also selectedShifts for employeeId would be about checking if id is already in there and appending it into an array if it wasn't. ( If these arrays are humongous I'd go with object hash here too for fast access. <employeeId>: {shiftId:shiftId} ).
Check also:
redux: state as array of objects vs object keyed by id

Resources