TTL for state in ProcessWindowFunction - apache-flink

I would like to set the TTL of the state in a processwindowfunction. This state is shared across windows. This TTL needs to be based on an attribute in the event itself. So I cannot calculate the TTL in the state descriptor. Also, onTimer function is not supported in processwindowfunction.
Is there any other way to achieve this?

If the time-to-live must be computed as a function of the event itself, then you can't use the state TTL mechanism.
The only alternative is to use timers with a KeyedProcessFunction, rather than using the window API. There's an example in the flink documentation: https://ci.apache.org/projects/flink/flink-docs-stable/learn-flink/event_driven.html#example

Related

Cleanup configuration for ProcessWindowFunction's window state without TTL with RocksDB as backend

Flink offers TTL configuration for managed state and,
when using RocksDB as backend,
it executes cleanup in a custom compaction filter
(if I understand correctly).
However, in the case of keyed windowed state in a ProcessWindowFunction,
the expectation is that we override the clear method and explicitly call something like
context.windowState().*.clear()
If the state descriptor does not configure TTL,
does cleanup still occur after the clear callback?
If not, and cleanup for this type of state depends solely on sizes in RocksDB's levels,
what's the default setting and is it configurable?
If the state descriptor does not configure TTL, does cleanup still occur after the clear callback?
Yes, unless the state descriptor was used to create state stored in KeyedStateStore ProcessWindowFunction.Context#globalState. This global state is the only state that is kept after windows are cleared. If you have an ever-growing key space, you should configure state TTL for any globalState you use, as otherwise globalState for stale keys will never be cleaned up.
FWIW, there's nothing RocksDB-specific about this. The answer is the same for any of the state backends.

Map State lifecycle in a keyed process function or windowed stream

Is MapState content automatically cleared up after the window expires or when onTimer function for that particular key is called or it has to be manually cleared given a TTL config is not defined
Any state you register yourself and that doesn't have TTL defined will be retained indefinitely.
Flink's built-in windows are cleaned up automatically, but windows you implement yourself in a KeyedProcessFunction using MapState need to be manually cleared when they are no longer useful.

Flink set timer on non keyed stream

Can Flink set timer on the non-keyed stream?
ProcessAllWindowFunctionis a good option. But it cannot scale up the parallelism. It has to be 1.
I am looking for such non-keyed process function that can set a timer.
Flink's timers are only available within keyed process functions.
The standard answer to this question is to go ahead and key the stream, adding a field holding a random number to use as the key (if there isn't already a suitable way to implement a key selector).
If you can't live with the expense of a network shuffle, for event-time timers you could implement a custom operator that implements your logic in its processWatermark method.
And if you are looking for processing-time timers, you could roll your own.
you could keyBy(_ => None) or keyBy() a constant and still use timers

About States and what is better for Flink

Lets assume that I have a job with max.parallelism=4 and a RichFlatMapFunction which is working with MapState. What is the best way to create the MapStateDescriptor? into the RichFlatMapFunction which means that for each instance of this class I will have a descriptor, or create a single instance of the descriptor, for example: public static MapStateDescriptor descriptor in a single class and call it from the RichFlatMapFunction? Because doing it on this way I will have just one MapStateDescriptor instead of 4, or did I misunderstood something?
Kind regards!
A few points...
Since each of your RichFlatMapFunction sub-tasks can be running in a different JVM on a different server, how would they share a static MapStateDescriptor?
Note that Flink's "max parallelism" isn't the same as the default environment parallelism. In general you want to leave the max parallelism value alone, and (if necessary) set your environment parallelism equal to the number of slots in your cluster.
The MapStateDescriptor doesn't store state. It tells Flink how to create the state. In your RichFlatMapFunction operator's open() call is where you'll be creating the state using the state descriptor.
So net-net is don't bother using a static MapStateDescriptor, it won't help. Just create your state (as per many examples) in your open() method.

Flink add a TTL to an existing value state

For one of our Flink jobs, we found a state causing a state leak. To fix this we need to add a TTL to the state causing the leak, however, we would like to keep existing state(savepoint). If we add a TTL to a value state would we be able to use the existing savepoint? Thank you.
No, according to the docs this won't work:
Trying to restore state, which was previously configured without TTL, using TTL enabled descriptor or vice versa will lead to compatibility failure and StateMigrationException.
However, you may be able to use the state processor API to accomplish this.
However, exactly how you should handle this depends on what kind of state it is, how it was serialized, and whether the operator has a UID.

Resources