I'm configuring my StateTtlConfig for MapState and my interest is the objects into the state has for example 3 hours of life and then they should disappear from state and passed to the GC to be cleaned up and release some memory and the checkpoints should release some weight too I think. I had this configuration before and it seems like it was not working because the checkpoints where always growing up:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3)).cleanupFullSnapshot().build();
Then I realized the that configuration works only when reading states from a savepoints but not in my scenario. I'd change my TTL configuration to this one:
private final StateTtlConfig ttlConfig = StateTtlConfig.newBuilder(org.apache.flink.api.common.time.Time.hours(3))
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired).build();
Based on the idea that I want to clean all the states for all keys after a defined time.
My questions are:
I'm I doing the right configuration right now?
What is the best way to do it?
Thanks one more time.
Kind regards!!!
I don't know enough about your use case to recommend a specific expiration/cleanup policy, but I can offer a few notes.
My understanding is that cleanupFullSnapshot() specifies that in addition to whatever other cleanup is being done, a full cleanup will be done whenever taking a snapshot.
The FsStateBackend uses the incremental cleanup strategy. By default it checks 5 entries during each state access, and does no additional cleanup during record processing. If your workload is such that there are many more writes than reads, that might not be enough. If no access happens to the state, expired state will persist. Choosing cleanupIncrementally(10, false) will make the cleanup more aggressive, assuming you do have some level of state access going on.
It's not unusual for checkpoint sizes to grow, or to take longer than you'd expect to reach a plateau. Could it simply be that the keyspace is growing?
https://flink.apache.org/2019/05/19/state-ttl.html is a good resource for learning more about Flink's State TTL mechanism.
Related
We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?
In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.
Let's say you are working on a big flink project. And also you are keyBy the client ip addresses of your customers.
And realized that you are going to filter the same things in the different code places like that:
public void calculationOne(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
kafkaSource.filter(isContainsSmthA).keyBy(clientip).process(processB).sink(...);
}
And assumed that they are many kafkaSource.filter(isContainsSmthA)..
Now this structure leads to performance issue in the flink?
If I did something like the below, would be much better?
public Stream filteredA(){
return kafkaSource.filter(isContainsSmthA);
public void calculationOne(){
filteredA().keyBy(clientip).process(processA).sink(...);
}
public void calculationTwo(){
filteredA().keyBy(clientip).process(processB).sink(...);
}
It depends a bit on how it should behave operationally.
The first way is a more friendly to the Kafka cluster: all records are read once. The filter itself is a very cheap operation, so you don't need to worry to much about it. However, the big downside of this approach is that if one calculations is much slower than the others, it will slow them down. If you do not process historic events, it shouldn't matter as you'd size your application cluster to keep up with all events anyways. Another current downside is that if you have a failure in calculationTwo also tasks in calculationOne are restarted. The community is actively working to mitigate that though.
The second way would allow only the affected source -> ... -> sink subtopology to be restarted. So if you expect frequent restarts or need to guarantee certain SLAs, this approach is better. An extension is to actually have separate Flink applications for each of these pipelines. You can share the same jar, but use different arguments to select the correct pipeline on submission. This approach also makes updating of applications much easier as you would only experience downtime for the pipeline that you actually modify.
I might do something like below, where a simple wrapper operator can run data through two different functions, and generate two side outputs.
SingleOutputStreamOperator comboResults = kafkaSource
.filter(isContainsSmthA)
.keyBy(clientip)
.process(new MyWrapperFunction(processA, processB));
comboResults
.getSideOutput(processATag)
.sink(...);
comboResults
.getSideOutput(processBTag)
.sink(...);
Though I don't know how that compares with what Arvid suggested.
We have a topology that uses states (ValueState and ListState) with TTL(StateTtlConfig) because we can not use Timers (We would generate hundred of millions of timers per day, and it does scale : a savepoint/checkpoint would take hours to be generated and might even get stuck while running).
However we need to update the value of the TTL at runtime depending of the type of some incoming events and other logic. Is this alright to recreate a new state with a new StateTtlConfig (and updated TTL time) and copy the values from "old" to "new in the processElement1() and processElement2() methods of a CoProcessFunction (instead of once in the open() like we usually do) ?
I guess the "old" state would be garbage collected (?).
Would this solution scale? be performant? generate any issue? anything bad?
I think your approach can work with the state re-creation in runtime to some extent but it is brittle. The problem, I can see, is that the old state meta information can linger somewhere depending on backend implementation.
For Heap (FS) backend, eventually the checkpoint/savepoint will have no records for the expired old state but the meta info can linger in memory while the job is running. It will go away if the job is restarted.
For RocksDB, the column family of the old state can linger. Moreover, the background cleanup runs only during compaction. If the table is too small, like the part which is in memory, this part (maybe even a bit on disk) will linger. It will go away after restart if cleanup on full snapshot is active (not for incremental checkpoints).
All in all, it depends on how often you have to create the new state and restart your job from savepoint/checkpoint.
I created a ticket to document what can be changed in TTL config and when,
so check some details in the issue.
I guess the "old" state would be garbage collected (?).
from the Flink documentation Cleanup of Expired State.
By default, expired values are explicitly removed on read, such as
ValueState#value, and periodically garbage collected in the background
if supported by the configured state backend. Background cleanup can
be disabled in the StateTtlConfig:
import org.apache.flink.api.common.state.StateTtlConfig;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.disableCleanupInBackground()
.build();
or execute the clean up after a full snapshot:
import org.apache.flink.api.common.state.StateTtlConfig;
import org.apache.flink.api.common.time.Time;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.seconds(1))
.cleanupFullSnapshot()
.build();
you can change the TTL at anytime according to the documentation. However, you have to restart the query (it I snot in run-time):
For existing jobs, this cleanup strategy can be activated or
deactivated anytime in StateTtlConfig, e.g. after restart from
savepoint.
But why don'y you see the timers on RocksDB like David said on the referenced answer?
I am using Flink version 1.10.1 with rocksdb backend.
I know that rocksdb using memory from "managed memory" and I did not setup any specific value for managed memory. It is done by Flink.
When I monitor my application, free memory of taskmanagers always decreasing (I mean free memory of operating system measured via free -h). I suspect that the reason could be Rocksdb.
Question_1 => if ValueState's value expired, then rocksdb will remove from its memory and will delete from localstorage directory? (I have also limited storage capacity)
Question_2 => stream.keyBy(ipAddress), if this ipAddress will be hold by rocksdb (i am talking about keyBy itself not the state), does it always place in managed memory? If not, then flink heap memory will be increased?
Here is the general structure of my application:
streamA = source.filter(..);
streamA2 = source2.filter(..);
streamB = streamA.keyBy(ipAddr).window().process(); // contains value state
streamC = streamA.keyBy(ipAddr).flatMap(..); // contains value state
streamD = streamA2.keyBy(ipAddr).window.process(); // contains value state
streamE = streamA.union(streamA2).keyBy(ipAddr)....
Here is the state example from my application:
private transient ValueState<SampleObject> sampleState;
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(10))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
ValueStateDescriptor<SampleObject> sampleValueStateDescriptor = new ValueStateDescriptor<>(
"sampleState",
TypeInformation.of(SampleObject.class)
);
sampleValueStateDescriptor.enableTimeToLive(ttlConfig);
Rocksdb configuration:
state.backend: rocksdb
state.backend.rocksdb.checkpoint.transfer.thread.num: 6
state.backend.rocksdb.localdir: /pathTo/checkpoint_only_local
Why I am using Rocksdb
I am using rocksdb because I have a huge key size(think of it ip address) that would not be handled by HeapState backend or other.
My application using rocksdb because I have a bunch of state in the user defined keyedprocessfunction for future decision. (each of state has `StateTtlConfig)
Note
My application does not need incremental checkpointing or anything about savepoint. I don't care about the saving all snapshot of my application.
Flink ValueState will be removed from storage after expired when using
Rocksdb?
Yes, but not immediately. (And in some earlier versions of the Flink, the answer was "it depends".)
In your state ttl config you haven't specified how you want state cleanup to be done. In this case, expired values are explicitly removed on read (such as ValueState#value) and are otherwise periodically garbage collected in the background. In the case of RocksDB, this background cleanup is done during compaction. In other words, the cleanup isn't immediate. The docs provide more details on how you can tune this -- you could configure the cleanup to be done more quickly, at the expense of some performance degradation.
A keyBy itself does not use any state. The key selector function is used to partition the stream, but the keys are not stored in connection with the keyBy. Only the windows and flatmap operations are keeping state, which is per-key state, and all of this keyed state will be in RocksDB (unless you have configured your timers to be on the heap, which is an option, in but Flink 1.10 timers are stored off-heap, in rocksdb, by default).
You could change the flatmap to a KeyedProcessFunction and use timers to explicitly clear state for state keys -- which would give you direct control over exactly when the state is cleared, rather than relying on the state TTL mechanism to eventually clear the state.
But it's more likely that the windows are building up considerable state. If you can switch to doing pre-aggregation (via reduce or aggregate) that may help a lot.
EDIT2: skip to the end, state is restored but it's not queryable new tl;dr "How do I make State that was queryable, still queryable after a restore from checkpoint?"
I have a keyed stream with check pointing enabled similar to this (I've tried this with in memory as well as HDFS with the same results)
env.enableCheckpointing(60000)
env.setStateBackend(new FsStateBackend("file:///flink-test"))
val stream = env.addSource(consumer)
.flatMap(new ValidationMap()).name("ValidationMap")
.keyBy(x => new Tuple3[String, String, String](x.account(), x.organization(), x.`type`()))
.flatMap(new Foo()).name(jobname)
Within this stream, I have a Managed Keyed State ValueState that I set as queryable.
val newValueStateDescriptor = new ValueStateDescriptor[java.util.ArrayList[java.util.ArrayList[Long]]]("foo", classOf[java.util.ArrayList[java.util.ArrayList[Long]]])
newValueStateDescriptor.setQueryable("foo")
valueState = getRuntimeContext.getState[java.util.ArrayList[java.util.ArrayList[Long]]](newValueStateDescriptor)
valueState.update(new java.util.ArrayList[java.util.ArrayList[Long]]())
This list is periodically appended to or removed from and the valueState is updated. When I make a request of the Queryable State I currently see correct values.
In my JobManager log I see check pointing every minute, and when I check the file system, I see files being created that are non-empty.
My setup has 3 JobManagers (2 in standby), 3 TaskManagers (all 3 in use).
I put a single data point into the system and read it out of QueryableState, everything looks good. Then I pick a single TaskManager (not even the one that processed the data, any of the 3) and I kill it, then restart it to simulate a crash.
I watch the job get retried 2 or 3 times until the TaskManager comes back online, and finally I see the same JobID running again in Flink, life seems good.
But, I then hit the Queryable State again, and I get an UnknownKvStateLocation exception.
I'm really not quite sure what I've done wrong here, things appear to be check pointing, but I never manage to get my ValueState back ? Maybe it's back but not Queryable?
EDIT:
Log snippet from JobManager implies things are restored
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state RESTARTING to CREATED."}
{"level":"INFO","time":"2017-06-01 15:30:02,332","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Recovering checkpoints from ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Found 1 checkpoints in ZooKeeper."}
{"level":"INFO","time":"2017-06-01 15:30:02,333","class":"org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore","ndc":"", "message":"Trying to retrieve checkpoint 5."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.checkpoint.CheckpointCoordinator","ndc":"", "message":"Restoring from latest valid checkpoint: Checkpoint 5 # 1496330912627 for dc7850a6866f181c2f07968d35fe3d46."}
{"level":"INFO","time":"2017-06-01 15:30:02,340","class":"org.apache.flink.runtime.executiongraph.ExecutionGraph","ndc":"", "message":"Job Foo (dc7850a6866f181c2f07968d35fe3d46) switched from state CREATED to RUNNING."}
It really looks like it's restored, and when I inspect the file created in /flink-test I see some binary data but it contains the identifying names for my Queryable State ValueState. Any ideas on what to look for would be welcome.
EDIT2: State >is< restored, it's just not queryable!
The fact that a given piece of registered state has been made queryable is not (currently) part of what Flink records in checkpoints or savepoints. So after recovery, the state isn't queryable until a new StateDescriptor is provided.
For more, see this discussion on the flink-users mailing list.