I need the ability to remove old keys from map state which are older than a fixed amount of time.
I currently keep the timestamps of each event in the the key state map, and I'd like to have an ansyncronous process which will remove these stale keys.
I'm using RocksDB as state backend, and I don't think that the Java API of RocksDB supports the open with TTL as noted here.
So my questions are:
Is it at all possible to have an async thread that has access to the Mapstate since it runs in an operator function?
Is there a better practice in this case?
Thanks in advance,
One straightforward approach for expiring state in Flink is to use a ProcessFunction operator to hold the state. You can then use a timer (either a processing time timer or an event time timer, depending on what makes sense for your application) and clear the state in the onTimer method.
As Flink 1.6.0 version, state TTL feature has been implemented. It allows to explicitly define TTL for records in state backend. The catch is that removal of key happens lazy when keys are getting read. If the key is not accessed it will stay there. The limitation most likely be removed in future version.
State Time-To-Live (TTL) Flink Documentation
State TTL for Apache Flink: How to Limit the Lifetime of State
Related
We have a job which all the user feature and information are stored in keyed state. Each user feature represents a state descriptor. But we are evolving our features so sometimes some features are abandoned in our next release/version because we will no longer declare the abandoned feature state's descriptor in our code. My question is how flink takes care of those abandoned state? Will it no longer restore those abandoned state automatically?
If you are using Flink POJOs or Avro types, then Flink will automatically migrate the types and state for you. Otherwise, it will not, and you could implement a custom serializer instead. Or you could use the State Processor API to clean things up.
The external database consists of a set of rules for each key, these rules should be applied on each stream element in the Flink job. Because it is very expensive to make a DB call for each element and retrieve the rules, I want to fetch the rules from the database at initialization and store it in a local cache.
When rules are updated in the external database, a status change event is published to the Flink job which should be used to fetch the rules and refresh this cache.
What is the best way to achieve what I've described? I looked into keyed state but initializing all keys and refreshing the keys on update doesn't seem possible.
I think you can make use of BroadcastProcessFunction or KeyedBroadcastProcessFunction to achieve your use case. A detailed blog available here
In short: You can define the source such as Kafka or any other and then publish the rules to Kafka that you want the actual stream to consume. Connect the actual data stream and rules stream. Then the processBroadcastElement will stream the rules where you can update the state. Finally the updated state (rules) can be retrieved in the actual event streaming method processElement.
Points to consider: Broadcast state will be kept on the heap always, not in state store (RocksDB). So, it has to be small enough to fit in memory. Each slot will copy all of the broadcast state into its checkpoints, so all checkpoints and savepoints will have n (parallelism) copies of the broadcast state.
A few different mechanisms in Flink may be relevant to this use case, depending on your detailed requirements.
Broadcast State
Jaya Ananthram has already covered the idea of using broadcast state in his answer. This makes sense if the rules should be applied globally, for every key, and if you can find a way to collect and broadcast the updates.
Note that the Context in the processBroadcastElement() of a KeyedBroadcastProcessFunction method contains the method applyToKeyedState(StateDescriptor<S, VS> stateDescriptor, KeyedStateFunction<KS, S> function). This means you can register a KeyedStateFunction that will be applied to all states of all keys associated with the provided stateDescriptor.
State Processor API
If you want to bootstrap state in a Flink savepoint from a database dump, you can do that with this library. You'll find a simple example of using the State Processor API to bootstrap state in this gist.
Change Data Capture
The Table/SQL API supports Debezium, Canal, and Maxwell CDC streams, and Kafka upsert streams. This may be a solution. There's also flink-cdc-connectors.
Lookup Joins
Flink SQL can do temporal lookup joins against a JDBC database, with a configurable cache. Not sure this is relevant.
In essence David's answer summarizes it well. If you are looking for more detail: not long ago, I gave a webinar [1] on this topic including running code examples. [2]
[1] https://www.youtube.com/watch?v=cJS18iKLUIY
[2] https://github.com/knaufk/enrichments-with-flink
I am adding TTL to ValueState in one ProcessFunction in one of my Flink apps. The Flink app has multiple other kinds of state both in this one ProcessFunction and in other operators. I understand that adding TTL to ValueState makes it non-backwards compatible. However, I was wondering if I could use the AllowNonRestoredState option to restore the rest of the application's state from the snapshot and have Flink just skip restoring the state for the one ValueState I add TTL to? Essentially, I was hoping for a little more insight into what AllowedNonRestoredState does. From the docs, it seems like it only works in situations where state was dropped all together, not in cases where the state still exists but has been modified.
AllowedNonRestoredState simply allows a job to start from a state snapshot (a savepoint or checkpoint) that contains state that has nowhere to be restored to in the job being started. In other words, some state was dropped.
Instead of trying to get Flink to not restore state for a particular ValueState, you could leave the old ValueState alone, while also introducing a new ValueState (with state TTL). When reading the new ValueState, if it's null, you could then migrate forward the old value.
However, I think it would be preferable to do a complete, one-time migration using the State Processor API (as I proposed here).
Apache Flink allows me to use a State in a RichMapFunction. I am planning to build a continuously running job which analyses a stream of web events. Part of the processing will be the creation of a session context with session scoped metrics (like nth of the session, duration etc) and additionally a user context.
A session context will timeout after 30 minutes, but a user context may exist for a year to handle returning users.
There will be millions of sessions and users so I would end up in millions of individual states. Every state is just a few KB in size.
Is this something that can be handled properly with the Flink states?
How is Flink actually cleaning up deprecated states?
Would it make sense to think about providing a custom backend to store the state in a KV cluster?
For large state I would recommend using Flink's RocksDBStateBackend. This state backend uses RocksDB to store state. Since RocksDB gracefully spills to disk, it is only limited by your available disk space. Thus, Flink should be able to handle your use case.
At the moment you need to register timers to clean up state. However, with the next Flink release, the community will add clean up for state with TTL. This will then automatically clean up your state when it is expired.
Keeping your state close to your computation with periodic checkpoints which are persisted will keep your application fast. If every state access went to a remote KV cluster, it would considerably slow down the processing.
We have a flink job that would persist large keyed state in rocksdb backend. We are using incremental checkpointing strategy. As time goes by, the size of the state become a problem. We have checked the state ttl solution but it does not support incremental rocksdb states.
What would be the best approach for this problem if I really need incremental checkpoint?
One approach that is often used is to manipulate the state in some kind of ProcessFunction, and use a timer to clear the state when it is no longer needed -- e.g., if it hasn't been accessed for several hours. ProcessFunctions are able to have both event-time and processing-time timers, so you can choose whichever is more appropriate for your use case.
See the expiring state exercise on the Flink training site for an example of using timers to clear state.