i want to init value state from my own db(es.eg) if the application can not restore state from the backend , but how can i get current key during initializeState ?
here is the sample code :
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
KeyedStateStore stateStore = context.getKeyedStateStore();
ValueStateDescriptor<PickUpState> pickUpStateConfig = new ValueStateDescriptor<>("pickUpState", PickUpState.class);
ValueState<PickUpState> state = stateStore.getState(pickUpStateConfig);
pickUpState = state;
if(!context.isRestored()){
//get the current key helpful
String key = ...
PickUpState upState = initStateFromEs(key);
state.update(upState);
}
}
Any repply will be helpful , thx !
That's not possible because there is no current key when initializeState is called.
Each instance of a user function is multiplexed across many keys -- namely all of the keys in the key groups assigned to that task slot. initializeState is just called once, and it needs to perform whatever actions are needed for all of those keys. (And there is no way to determine which keys are relevant to a given instance.)
The assumption is that the state backend is always available. The only time this isn't true is if the remote filesystem storing the snapshot is unavailable, in which case there's nothing you can do about it -- except for rebuilding the state from scratch. E.g., you could use the state processor API to build a new state snapshot from the data in the database.
Related
We have a requirement to create a kind of user session. Our front end is react and backend is .net core 6 api and db is postgres.
When 1 user clicks on a delete button , he should not be allowed to delete that item when another user is already using that item and performing some actions.
Can you guys suggest me an approach or any kind of service that is available to achieve this. Please help
I would say dont make it too complicated. A simple approach could be to add the properties 'BeingEditedByUserId' and 'ExclusiveEditLockEnd' (datetime) to the entity and check these when performing any action on this entity. When an action is performed on the entity, the id is assigned and a timeslot (for example 10 minutes) would be assigned for this user. If any other user would try to perform an action, you block them. If the timeslot is expired anyone can edit again.
I have had to do something similar with Java (also backed by a postgres db)
There are some pitfalls to avoid with a custom lock implementation, like forgetting to unlock when finished, given that there is not guarantee that a client makes a 'goodbye, unlock the table' call when they finish editing a page, they could simply close the browser tab, or have a power outage... Here is what i decided to do:
Decide if the lock should be implemented in the API or DB?
Is this a distributed/scalable application? Does it run as just a single instance or multiple? If multiple, then you can not (as easily) implement an API lock (you could use something like a shared cache, but that might be more trouble than it is worth)
Is there a record in the DB that could be used as a lock, guaranteed to exist for each editable item in the DB? I would assume so, but if the app is backed by multiple DBs maybe not.
API locking is fairly easy, you just need to handle thread safety as most (if not all) REST/SOAP... implementations are heavily multithreaded.
If you implement at the DB consider looking into a 'Row Level Lock' which allows you to request a lock on a specific row in the DB, which you could use as a write lock.
If you want to implement in the API, consider something like this:
class LockManager
{
private static readonly object writeLock = new();
// the `object` is whatever you want to use as the ID of the resource being locked, probably a UUID/GUID but could be a String too
// the `holder` is an ID of the person/system that owns the lock
Dictionary<object, _lock> locks = new Dictionary<object, _lock>();
_lock acquireLock(object id, String holder)
{
_lock lok = new _lock();
lok.id = id;
lok.holder = holder;
lock (writeLock)
{
if (locks.ContainsKey(id))
{
if (locks[id].release > DateTime.Now)
{
locks.Remove(id);
}
else
{
throw new InvalidOperationException("Resource is already locked, lock held by: " + locks[id].holder);
}
}
lok.allocated = DateTime.Now;
lok.release = lok.allocated.AddMinutes(5);
}
return lok;
}
void releaseLock(object id)
{
lock (writeLock)
{
locks.Remove(id);
}
}
// called by .js code to renew the lock via ajax call if the user is determined to be active
void extendLock(object id)
{
if (locks.ContainsKey(id))
{
lock (writeLock)
{
locks[id].release = DateTime.Now.AddMinutes(5);
}
}
}
}
class _lock
{
public object id;
public String holder;
public DateTime allocated;
public DateTime release;
}
}
This is what i did because it does not depend on the DB or client. And was easy to implement. Also, it does not require configuring any lock timeouts or cleanup tasks to release locked items with expired locks on them, as that is taken care of in the locking step.
I have two sources, kafka and hbase. In Kafka, there is a data stream in only 24 hours. In Hbase, there is an aggregated data from the beginning. My purpose is that the two data merge on stream processing, when stream input(Kafka) of some session is occurred. I tried a couple of methods but it is not satisfied because of performance.
After some searching, I have an idea with state in keyed process function. The idea is down below. (caching using state of keyed process function)
make input to keyed process function using session information
check keyed process's state
if state is not initialized -> then query from hbase and initialize into state -> go to 5
else (state is initialized) -> go to 5
do business logic using state
During coding the idea, I have faced performance issue that querying to hbase is slow with sync way. So, I tried async version but it's complicated.
I have faced two issues. One of them is thread-safe issue between processElement and hbase Async worker thread, the other is Context of the process function is expired after end of processElement function (not end of hbase Async worker).
val sourceStream = env.addsource(kafkaConsumer.setStartFromGroupOffsets())
sourceStream.keyBy(new KeySelector[InputMessage, KeyInfo]() {
override def getKey(v: InputMessage): KeyInfo = v.toKeyInfo()
})
.process(new KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]() {
var state: MapState[String, (String, Long)] = _
override def open(parameters: Configuration): Unit = {
val conn = ConnectionFactory.createAsyncConnection(hbaseConfInstance).join
table = conn.getTable(TableName.valueOf("tablename"))
state = getRuntimeContext.getMapState(stateDescripter)
}
def request(action: Consumer[CacheResult] ): Unit = {
if ( !state.isEmpty ) {
action.accept(new CacheResult(state))
}
else { // state is empty, so load from hbase
table.get(new Get(key)).thenAccept((hbaseResult: Result) => {
// this is called by worker thread
hbaseResult.toState(state) // convert from hbase result into state
action.accept(new CacheResult(state))
}
}
}
override def processElement(value: InputMessage
, ctx: KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]#Context
, out: Collector[OUTPUTTYPE]): Unit = {
val businessAction = new Consumer[CacheResult]() {
override def accept(t: CacheResult): Unit = {
// .. do business logic here.
out.collect( /* final result */ )
}
}
request(businessAction)
}
}).addSink()
Is there any suggestion to make KeyedProcessFunction available with async call in third party?
Or any other idea to approach using mixed-up Kafka and Hbase in Flink?
I think your general assumptions are wrong. I faced similar issue but regarding quite different problem and didn't resolve it yet. Keeping state in the program is contradictory with async function and Flink prevents using state in async code by its design (which is a good thing). If you want to make your function async, then you must get rid of the state. To achieve your goal, you probably need to redesign your solution. I don't know all details regarding your problem, but you can think of splitting your process into more pipelines. E.g. you can create pipeline consuming data from hbase and passing it into kafka topic. Then another pipeline can consume data sent by pipeline gathering data from hbase. In such approach you don't have to care about the state becasue each pipeline is doing its own thing, just consuming data and passing it further.
in aggregation to this question I'm still not having clear why the checkpoints of my Flink job grows and grows over time and at the moment, for about 7 days running, these checkpoints never gets the plateau.
I'm using Flink 1.10 version at the moment, FS State Backend as my job cannot afford the latency costs of using RocksDB.
See the checkpoints evolve over 7 days:
Let's say that I have this configuration for the TTL of the states in all my stateful operators for one hour or maybe more than that and a day in one case:
public static final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(Time.hours(1))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupFullSnapshot().build();
In my concern all the objects into the states will be cleaned up after the expires time and therefore the checkpoints size should be reduced, and as we expect more or less the same amount of data everyday.
In the other hand we have a traffic curve, which has more incoming data in some hours of the day, but late night the traffic goes down and all the objects into the states that expires should be cleaned up causing that the checkpoint size should be reduced not kept with the same size until the traffic goes up again.
Let's see this code sample of one use case:
DataStream<Event> stream = addSource(source);
KeyedStream<Event, String> keyedStream = stream.filter((FilterFunction<Event>) event ->
apply filters here;))
.name("Events filtered")
.keyBy(k -> k.rType.equals("something") ? k.id1 : k.id2);
keyedStream.flatMap(new MyFlatMapFunction())
public class MyFlatMapFunction extends RichFlatMapFunction<Event, Event>{
private final MapStateDescriptor<String, Event> descriptor = new MapStateDescriptor<>("prev_state", String.class, Event.class);
private MapState<String, Event> previousState;
#Override
public void open(Configuration parameters) {
/*ttlConfig described above*/
descriptor.enableTimeToLive(ttlConfig);
previousState = getRuntimeContext().getMapState(descriptor);
}
#Override
public void flatMap(Event event, Collector<Event> collector) throws Exception {
final String key = event.rType.equals("something") ? event.id1 : event.id2;
Event previous = previousState.get(key);
if(previous != null){
/*something done here*/
}else /*something done here*/
previousState.put(key, previous);
collector.collect(previous);
}
}
More or less these is the structure of the use cases, and some others that uses Windows(Time Window or Session Window)
Questions:
What am I doing wrong here?
Are the states cleaned up when they expires and this scenario which is the same of the rest of the use cases?
What can help me to fix the checkpoint size if they are working wrong?
Is this behaviour normal?
Kind regards!
In this stretch of code it appears that you are simply writing back the state that was already there, which only serves to reset the TTL timer. This might explain why the state isn't being expired.
Event previous = previousState.get(key);
if (previous != null) {
/*something done here*/
} else
previousState.put(key, previous);
It also appears that you should be using ValueState rather than MapState. ValueState effectively provides a sharded key/value store, where the keys are the keys used to partition the stream in the keyBy. MapState gives you a nested map for each key, rather than a single value. But since you are using the same key inside the flatMap that you used to key the stream originally, key-partitioned ValueState would appear to be all that you need.
My goal is to have a Flink streaming program that keeps the last N ids, where the id is extracted from an event. The sink is a Cassandra store so that the list of ids can be fetched at any time. It is important that Cassandra is updated immediately upon every event.
This can be implemented easily with mapWithState (see code below). However, there is important problem with this code. The state is keyed by userid. Some users might be active for some time and then never again. What I am worrying about is that state storage will grow forever.
How does one cleanup state for inactive keys?
case class MyEvent(userId: Int, id: String)
env
.addSource(new FlinkKafkaConsumer010[MyEvent]("vips", new MyJsonDeserializationSchema(), kafkaConsumerProperties))
.keyBy(_.userId)
.mapWithState[(Int, Seq[String]), Seq[String]] { (in: MyEvent, currentIds: Option[Seq[String]]) =>
val keepNIds = currentIds match {
case None => Seq(in.id)
case Some(cids) => (cids :+ in.id).takeRight(100)
}
((in.userId, keepNIds), Some(keepNIds))
}
.addSink { in: (Int, Seq[String]) =>
CassandraSink.appDatabase.idsTable.store(...)
}
The growing state is an important and correct observation. This will definitely happen if your keyspace is moving.
Flink 1.2.0 added the ProcessFunction which addresses this problem. A ProcessFunction is similar to a FlatMapFunction but has access to timer services. You can register timers which invoke the onTimer() callback function when they expire. The callback can be used to clean-up the state.
I'm execute method Datastore.delete(key) form my GWT web application, AsyncCallback had call onSuccess() method .Them i refresh http://localhost:8888/_ah/admin immediately , the Entity i intent to delete still exist. Smilar to, I refresh my GWT web application immediately the item i intent to delete still show on web page.Note the the onSuccess() had been call.
So, how can i know when the Entity already deleted ?
public void deleteALocation(int removedIndex,String symbol ){
if(Window.confirm("Sure ?")){
System.out.println("XXXXXX " +symbol);
loCalservice.deletoALocation(symbol, callback_delete_location);
}
}
public AsyncCallback<String> callback_delete_location = new AsyncCallback<String>() {
public void onFailure(Throwable caught) {
Window.alert(caught.getMessage());
}
public void onSuccess(String result) {
// TODO Auto-generated method stub
int removedIndex = ArryList_Location.indexOf(result);
ArryList_Location.remove(removedIndex);
LocationTable.removeRow(removedIndex + 1);
//Window.alert(result+"!!!");
}
};
SERver :
public String deletoALocation(String name) {
// TODO Auto-generated method stub
Transaction tx = Datastore.beginTransaction();
Key key = Datastore.createKey(Location.class,name);
Datastore.delete(tx,key);
tx.commit();
return name;
}
Sorry i'm not good at english :-)
According to the docs
Returns the Key object (if one model instance is given) or a list of Key objects (if a list of instances is given) that correspond with the stored model instances.
If you need an example of a working delete function, this might help. Line 108
class DeletePost(BaseHandler):
def get(self, post_id):
iden = int(post_id)
post = db.get(db.Key.from_path('Posts', iden))
db.delete(post)
return webapp2.redirect('/')
How do you check the existence of the entity? Via a query?
Queries on HRD are eventually consistent, meaning that if you add/delete/change an entity then immediately query for it you might not see the changes. The reason for this is that when you write (or delete) an entity, GAE asynchronously updates the index and entity in several phases. Since this takes some time it might happen that you don't see the changes immediately.
Linked article discusses ways to mitigate this limitation.