We have a topology that uses states (ValueState and ListState) with TTL(StateTtlConfig) because we can not use Timers (We would generate hundred of millions of timers per day, and it does scale : a savepoint/checkpoint would take hours to be generated and might even get stuck while running).
However we need to update the value of the TTL at runtime depending of the type of some incoming events and other logic. Is this alright to recreate a new state with a new StateTtlConfig (and updated TTL time) and copy the values from "old" to "new in the processElement1() and processElement2() methods of a CoProcessFunction (instead of once in the open() like we usually do) ?
I guess the "old" state would be garbage collected (?).
Would this solution scale? be performant? generate any issue? anything bad?
I think your approach can work with the state re-creation in runtime to some extent but it is brittle. The problem, I can see, is that the old state meta information can linger somewhere depending on backend implementation.
For Heap (FS) backend, eventually the checkpoint/savepoint will have no records for the expired old state but the meta info can linger in memory while the job is running. It will go away if the job is restarted.
For RocksDB, the column family of the old state can linger. Moreover, the background cleanup runs only during compaction. If the table is too small, like the part which is in memory, this part (maybe even a bit on disk) will linger. It will go away after restart if cleanup on full snapshot is active (not for incremental checkpoints).
All in all, it depends on how often you have to create the new state and restart your job from savepoint/checkpoint.
I created a ticket to document what can be changed in TTL config and when,
so check some details in the issue.
I guess the "old" state would be garbage collected (?).
from the Flink documentation Cleanup of Expired State.
By default, expired values are explicitly removed on read, such as
ValueState#value, and periodically garbage collected in the background
if supported by the configured state backend. Background cleanup can
be disabled in the StateTtlConfig:
import org.apache.flink.api.common.state.StateTtlConfig;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.disableCleanupInBackground()
.build();
or execute the clean up after a full snapshot:
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.time.Time;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.cleanupFullSnapshot()
.build();
you can change the TTL at anytime according to the documentation. However, you have to restart the query (it I snot in run-time):
For existing jobs, this cleanup strategy can be activated or
deactivated anytime in StateTtlConfig, e.g. after restart from
savepoint.
But why don'y you see the timers on RocksDB like David said on the referenced answer?
Related
I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.
I'm configuring my StateTtlConfig for MapState and my interest is the objects into the state has for example 3 hours of life and then they should disappear from state and passed to the GC to be cleaned up and release some memory and the checkpoints should release some weight too I think. I had this configuration before and it seems like it was not working because the checkpoints where always growing up:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3)).cleanupFullSnapshot().build();
Then I realized the that configuration works only when reading states from a savepoints but not in my scenario. I'd change my TTL configuration to this one:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3))
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired).build();
Based on the idea that I want to clean all the states for all keys after a defined time.
My questions are:
I'm I doing the right configuration right now?
What is the best way to do it?
Thanks one more time.
Kind regards!!!
I don't know enough about your use case to recommend a specific expiration/cleanup policy, but I can offer a few notes.
My understanding is that cleanupFullSnapshot() specifies that in addition to whatever other cleanup is being done, a full cleanup will be done whenever taking a snapshot.
The FsStateBackend uses the incremental cleanup strategy. By default it checks 5 entries during each state access, and does no additional cleanup during record processing. If your workload is such that there are many more writes than reads, that might not be enough. If no access happens to the state, expired state will persist. Choosing cleanupIncrementally(10, false) will make the cleanup more aggressive, assuming you do have some level of state access going on.
It's not unusual for checkpoint sizes to grow, or to take longer than you'd expect to reach a plateau. Could it simply be that the keyspace is growing?
https://flink.apache.org/2019/05/19/state-ttl.html is a good resource for learning more about Flink's State TTL mechanism.
I am using Flink version 1.10.1 with rocksdb backend.
I know that rocksdb using memory from "managed memory" and I did not setup any specific value for managed memory. It is done by Flink.
When I monitor my application, free memory of taskmanagers always decreasing (I mean free memory of operating system measured via free -h). I suspect that the reason could be Rocksdb.
Question_1 => if ValueState's value expired, then rocksdb will remove from its memory and will delete from localstorage directory? (I have also limited storage capacity)
Question_2 => stream.keyBy(ipAddress), if this ipAddress will be hold by rocksdb (i am talking about keyBy itself not the state), does it always place in managed memory? If not, then flink heap memory will be increased?
Here is the general structure of my application:
streamA = source.filter(..);
streamA2 = source2.filter(..);
streamB = streamA.keyBy(ipAddr).window().process(); // contains value state
streamC = streamA.keyBy(ipAddr).flatMap(..); // contains value state
streamD = streamA2.keyBy(ipAddr).window.process(); // contains value state
streamE = streamA.union(streamA2).keyBy(ipAddr)....
Here is the state example from my application:
private transient ValueState<SampleObject> sampleState;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(10))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
ValueStateDescriptor<SampleObject> sampleValueStateDescriptor = new ValueStateDescriptor<>(
"sampleState",
TypeInformation.of(SampleObject.class)
);
sampleValueStateDescriptor.enableTimeToLive(ttlConfig);
Rocksdb configuration:
state.backend: rocksdb
state.backend.rocksdb.checkpoint.transfer.thread.num: 6
state.backend.rocksdb.localdir: /pathTo/checkpoint_only_local
Why I am using Rocksdb
I am using rocksdb because I have a huge key size(think of it ip address) that would not be handled by HeapState backend or other.
My application using rocksdb because I have a bunch of state in the user defined keyedprocessfunction for future decision. (each of state has `StateTtlConfig)
Note
My application does not need incremental checkpointing or anything about savepoint. I don't care about the saving all snapshot of my application.
Flink ValueState will be removed from storage after expired when using
Rocksdb?
Yes, but not immediately. (And in some earlier versions of the Flink, the answer was "it depends".)
In your state ttl config you haven't specified how you want state cleanup to be done. In this case, expired values are explicitly removed on read (such as ValueState#value) and are otherwise periodically garbage collected in the background. In the case of RocksDB, this background cleanup is done during compaction. In other words, the cleanup isn't immediate. The docs provide more details on how you can tune this -- you could configure the cleanup to be done more quickly, at the expense of some performance degradation.
A keyBy itself does not use any state. The key selector function is used to partition the stream, but the keys are not stored in connection with the keyBy. Only the windows and flatmap operations are keeping state, which is per-key state, and all of this keyed state will be in RocksDB (unless you have configured your timers to be on the heap, which is an option, in but Flink 1.10 timers are stored off-heap, in rocksdb, by default).
You could change the flatmap to a KeyedProcessFunction and use timers to explicitly clear state for state keys -- which would give you direct control over exactly when the state is cleared, rather than relying on the state TTL mechanism to eventually clear the state.
But it's more likely that the windows are building up considerable state. If you can switch to doing pre-aggregation (via reduce or aggregate) that may help a lot.
I have an always one application, listening to a Kafka stream, and processing events. Events are part of a session. And I need to do calculations based off of a sessions data. I am running into a problem trying to correctly run my calculations due to the length of my sessions. 90% of my sessions are done after 5 minutes. 99% are done after 1 hour. Sessions may last more than a day, due to this being a real-time system, there is no determined end. Session are unique, and show never collide.
I am looking for a way where I can process a window multiple times, either with an initial wait period and processing any later events after that, or a pure process per event type structure. I will need to keep all previous events around(ListState), as well as previously processed values(ValueState).
I previously thought allowedLateness would allow me to do this, but it seems the lateness is only considered for when the event should have been processed, it does not extend an actual window. GlobalWindows may also work, but I am unsure if there is a way to process a window multiple times. I believe I can used an evictor with GlobalWindows to purge the Windows after a period of inactivity(although admittedly, I did not research this yet, because I was unsure of how to trigger a GlobalWindow multiple times.
Any suggestions on how to achieve what I am looking to do would be greatly appreciated, I would also be happy to clarify any points needed.
If SessionWindows won't do the job, then you can use GlobalWindows with a custom Trigger and Evictor. The Trigger interface has onElement and timer-based callbacks that can fire whenever and as often as you like. If you go down this route, then yes, you'll also need to implement an Evictor to dispose of elements when they are no longer needed.
The documentation and the source code are helpful when trying to understand how this all fits together.
EDIT2: skip to the end, state is restored but it's not queryable new tl;dr "How do I make State that was queryable, still queryable after a restore from checkpoint?"
I have a keyed stream with check pointing enabled similar to this (I've tried this with in memory as well as HDFS with the same results)
env.enableCheckpointing(60000)
env.setStateBackend(new FsStateBackend("file:///flink-test"))
val stream = env.addSource(consumer)
.flatMap(new ValidationMap()).name("ValidationMap")
.keyBy(x => new Tuple3[String, String, String](x.account(), x.organization(), x.`type`()))
.flatMap(new Foo()).name(jobname)
Within this stream, I have a Managed Keyed State ValueState that I set as queryable.
val newValueStateDescriptor = new ValueStateDescriptor[java.util.ArrayList[java.util.ArrayList[Long]]]("foo", classOf[java.util.ArrayList[java.util.ArrayList[Long]]])
newValueStateDescriptor.setQueryable("foo")
valueState = getRuntimeContext.getState[java.util.ArrayList[java.util.ArrayList[Long]]](newValueStateDescriptor)
valueState.update(new java.util.ArrayList[java.util.ArrayList[Long]]())
This list is periodically appended to or removed from and the valueState is updated. When I make a request of the Queryable State I currently see correct values.
In my JobManager log I see check pointing every minute, and when I check the file system, I see files being created that are non-empty.
My setup has 3 JobManagers (2 in standby), 3 TaskManagers (all 3 in use).
I put a single data point into the system and read it out of QueryableState, everything looks good. Then I pick a single TaskManager (not even the one that processed the data, any of the 3) and I kill it, then restart it to simulate a crash.
I watch the job get retried 2 or 3 times until the TaskManager comes back online, and finally I see the same JobID running again in Flink, life seems good.
But, I then hit the Queryable State again, and I get an UnknownKvStateLocation exception.
I'm really not quite sure what I've done wrong here, things appear to be check pointing, but I never manage to get my ValueState back ? Maybe it's back but not Queryable?
EDIT:
Log snippet from JobManager implies things are restored
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state RESTARTING to CREATED."}
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Recovering checkpoints from ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Found 1 checkpoints in ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Trying to retrieve checkpoint 5."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","ndc":"", "message":"Restoring from latest valid checkpoint: Checkpoint 5 # 1496330912627 for dc7850a6866f181c2f07968d35fe3d46."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state CREATED to RUNNING."}
It really looks like it's restored, and when I inspect the file created in /flink-test I see some binary data but it contains the identifying names for my Queryable State ValueState. Any ideas on what to look for would be welcome.
EDIT2: State >is< restored, it's just not queryable!
The fact that a given piece of registered state has been made queryable is not (currently) part of what Flink records in checkpoints or savepoints. So after recovery, the state isn't queryable until a new StateDescriptor is provided.
For more, see this discussion on the flink-users mailing list.