Huge checkpoint size using ValueState leading to event processing lag - apache-flink

I have an application in flink, which does deduplication of multiple streams.
It does key-by on one string field and dedupes it by using value-state.
Using value state in RichFilterFunction.
public class DedupeWithState extends RichFilterFunction<Tuple2<String, Message>> {
private ValueState<Boolean> seen;
private final ValueStateDescriptor<Boolean> desc;
public DedupeWithState(long cacheExpirationTimeMs) {
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.milliseconds(cacheExpirationTimeMs))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
desc.enableTimeToLive(ttlConfig);
}
#Override
public void open(Configuration conf) {
seen = getRuntimeContext().getState(desc);
}
#Override
public boolean filter(Tuple2<String, Message> stringMessageTuple2) throws Exception {
if (seen.value() == null) {
seen.update(true);
return true;
}
return false;
}
}
The application consumes 3 streams from kafka, and each stream has its own dedupe function with ttl of 4hours.
DataStream<Tuple2<String, Message>> event1 = event1Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
DataStream<Tuple2<String, Message>> event2 = event2Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
DataStream<Tuple2<String, Message>> event3 = event3Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
Screenshots attached.
Backend properties are:
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: <azure blob store>
Checkpoint configuration as on WebUI
We are using Flink 1.13.6.
The QPS of each stream is event1 - 7k, event2 - 6k, event3 - 200
Key size is ~110 bytes
Checkpoint interval is 5 mins and incremental checkpoint is enabled.
As per above configs (given that incremental checkpoint is enabled) each stream should have following checkpoint size:
event1 -> ((7000 * 60 * 5) * 110bytes) = ~220MB
Issue is the checkpoint size is very huge. It starts from 400 MB (as expected) but is going upto 2-3GB per checkpoint Checkpoint history. This results in huge backpressure in Dedupe function and overall lag in the system. Checkpoint per operator

Maybe the state is not being cleaned since it is done lazily (on read). From the initial release post (a bit old but may still stand):
When a state object is accessed in a read operation, Flink will check its timestamp and clear the state if it is expired (depending on the configured state visibility, the expired state is returned or not). Due to this lazy removal, expired state that is never accessed again will forever occupy storage space unless it is garbage collected.
I would try with a MapState per stream (without keying by) instead of a ValueState per key as you have now so the same state is continuously accessed. Or you may also be able to set up a timer in DedupeWithState that accesses the state and forces the cleanup (you may need to use a ProcessFunction to be able to set up timers) or that simply clears it.

Try something like this -
/**
* #author sucheth.shivakumar
*/
public class Check extends KeyedProcessFunction {
private ValueState<Boolean> seen;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
// defines the time the state has to be stored in the state backend before it is auto cleared
seen = getRuntimeContext().getState(desc);
}
#Override
public void processElement(Object value, Context ctx, Collector out) throws Exception {
if (seen.value() == null) {
seen.update(true);
// emits the record
out.collect(stringMessageTuple2);
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + 14400000);
}
}
#Override
// this fires after 4 hrs is passed and clears the state
public void onTimer(long timestamp, OnTimerContext ctx, Collector out)
throws Exception {
// triggers after ttl has passed
if (seen.value()) {
seen.clear();
}
}

Related

Change in Behavior for EventTimeSessionWindows from Flink 1.11.1 to 1.14.0

I observed what appears to be a change in behavior for EventTimeSessionWindows when upgrading from 1.11.1 to 1.14.0. This was identified in a unit test.
For a window with a defined time gap of 10 seconds.
Publish KEY_1 with eventtime 1 second
Publish KEY_1 with eventtime 3 seconds
Publish KEY_1 with eventtime 2 seconds
Publish KEY_2 with eventtime 101 seconds
For Flink 1.11.1 the window for KEY_1 closes, reduces, and publishes, supposedly because KEY_2 had an event time greater than 10 seconds after the last message in KEY_1's window. KEY_2 window would also not close. In the absence of KEY_2 the KEY_1 window would not close.
For Flink 1.14.0 the main difference is that the window for KEY_2 DOES close even though there are no new messages after 111 seconds.
This appears to be a change in behavior. The nearest I could find was https://issues.apache.org/jira/browse/FLINK-20443 but that’s in 1.14.1. I also noticed https://issues.apache.org/jira/browse/FLINK-19777 which was in 1.11.3 but couldn't ascertain if that would have resulted in this behavior. Is there an explanation for this change in behavior? Is it expected or desirable? Is it because all pending windows are automatically closed based on an updated trigger behavior?
I tested the same behavior for ProcessingTimeSessionWindows and did not observe a similar change in behavior.
Thanks.
Jai
#Test
public void testEventTime() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// configure your test environment
env.setParallelism(1);
env.getConfig().registerTypeWithKryoSerializer(Document.class, ProtobufSerializer.class);
// values are collected in a static variable
CollectSink.values.clear();
// create a stream of custom elements and apply transformations
SingleOutputStreamOperator<Tuple2<String, Document>> inputStream = buildStream(env, this.generateTestOrders());
SingleOutputStreamOperator<Tuple2<String, Document>> intermediateStream = this.documentDebounceFunction.insertIntoPipeline(inputStream);
intermediateStream.addSink(new CollectSink());
// execute
env.execute();
// verify your results
Assertions.assertEquals(1, CollectSink.values.size());
Map<String, Long> expectedVersions = Maps.newHashMap();
expectedVersions.put(KEY_1, 2L);
for (Tuple2<String, Document> actual : CollectSink.values) {
Assertions.assertEquals(expectedVersions.get(actual.f0), actual.f1.getVersion());
}
}
// create a testing sink
private static class CollectSink implements SinkFunction<Tuple2<String, Document>> {
// must be static
public static final List<Tuple2<String, Document>> values = Collections.synchronizedList(new ArrayList<>());
#Override
public void invoke(Tuple2<String, Document> value, SinkFunction.Context context) {
values.add(value);
}
}
public List<Tuple2<String, Document>> generateTestOrders() {
List<Tuple2<String, Document>> testMessages = Lists.newArrayList();
// KEY_1
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(1)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(1).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(2)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(3).build())
.build()));
testMessages.add(
Tuple2.of(
KEY_1,
Document.newBuilder()
.setVersion(3)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(2).build())
.build()));
// KEY_2 -- WAY IN THE FUTURE
testMessages.add(
Tuple2.of(
KEY_2,
Document.newBuilder()
.setVersion(15)
.setUpdatedAt(
Timestamp.newBuilder().setSeconds(101).build())
.build()));
return ImmutableList.copyOf(testMessages);
}
private SingleOutputStreamOperator<Tuple2<String, Document>> buildStream(
StreamExecutionEnvironment executionEnvironment,
List<Tuple2<String, Document>> inputMessages) {
inputMessages =
inputMessages.stream()
.sorted(
Comparator.comparingInt(
msg -> (int) ProtobufTypeConversion.toMillis(msg.f1.getUpdatedAt())))
.collect(Collectors.toList());
WatermarkStrategy<Tuple2<String, Document>> watermarkStrategy =
WatermarkStrategy.forMonotonousTimestamps();
return executionEnvironment
.fromCollection(
inputMessages, TypeInformation.of(new TypeHint<Tuple2<String, Document>>() {}))
.assignTimestampsAndWatermarks(
watermarkStrategy.withTimestampAssigner(
(event, timestamp) -> Timestamps.toMillis(event.f1.getUpdatedAt())));
}

Flink watermark not advancing at all? Stuck at -9223372036854775808

I'm encountering similar issue to Flink EventTime Processing Watermark is always coming as -9223372036854725808 However, the suggested solutions (set parallelism and disable checkpointing) do not have any effect. In this example, I'm simply streaming 1000 events 1 second apart, and then comparing the event timestamp to ctx.timerService().currentWatermark()
>>> v=(61538659200000,0), watermark=-9223372036854775808
>>> v=(61538659201000,1), watermark=-9223372036854775808
>>> v=(61538660198000,998), watermark=-9223372036854775808
>>> v=(61538660199000,999), watermark=-9223372036854775808
public void watermarks()
throws Exception
{
final var env = StreamExecutionEnvironment.createLocalEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setMaxParallelism(1);
final long startMs = new Date(2020, 1, 1).getTime();
final var events = new ArrayList<Tuple2<Long, Integer>>();
for (var ii = 0; ii < 1000; ++ii ) {
events.add(new Tuple2<Long, Integer>(startMs + ii * 1000, ii));
}
env.fromCollection(events)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Tuple2<Long, Integer>>forMonotonousTimestamps()
.withTimestampAssigner((event, ts) -> event.f0))
.setParallelism(1)
.keyBy(row -> row.f1 % 2)
.process(new ProcessFunction<Tuple2<Long, Integer>, String>()
{
#Override
public void processElement(
final Tuple2<Long, Integer> value,
final Context ctx,
final Collector<String> out)
throws Exception
{
out.collect("v=" + value + ", watermark=" + ctx.timerService().currentWatermark());
}
})
.setParallelism(1)
.print()
.setParallelism(1);
final var result = env.execute();
System.out.println(result);
}
forMonotonousTimestamps is a periodic watermark generator that only generates watermarks when triggered by a timer. By default this timer fires every 200 msec (this is the autoWatermarkInterval). Your job doesn't run long enough for this timer to fire.
Bounded sources do generate a watermark with its timestamp set to MAX_WATERMARK when they reach the end of their input -- just before shutting down the job. You're not seeing this watermark in the output from your job because there are no events that follow it.
If you want to generate watermarks with every event, you can implement a custom watermark strategy that emits a watermarks in the onEvent method of the WatermarkGenerator (docs). This is usually a bad idea in production, as you'll waste CPU cycles and network bandwidth on these extra watermarks, but sometimes for testing this is helpful.
According to source code comments:
/**
* Creates a new enriched {#link WatermarkStrategy} that also does idleness detection in the
* created {#link WatermarkGenerator}.
*
* <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
* stream for that amount of time, then that partition is considered "idle" and will not hold
* back the progress of watermarks in downstream operators.
*
* <p>Idleness can be important if some partitions have little data and might not have events
* during some periods. Without idleness, these streams can stall the overall event time
* progress of the application.
*/
default WatermarkStrategy<T> withIdleness(Duration idleTimeout) ...
So, You can try to use WatermarkStrategy.forMonotonousTimestamps.withIdleness(...)

Flink counter with timestamp

I was reading the the Flink example CountWithTimestamp and below is a code snippet from the example:
#Override
public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
My question is that if I remove the if statment timestamp == result.lastModified + 60000 (collect stmt not touched) in the onTimer, and instead replace it by another if statment if(ctx.timestamp < current.lastModified + 60000) { deleteEventTimeTimer(current.lastModified + 60000)} in the begining of processElement, would the semnatics of the program be the same? any preference of one version over the other in case of same semantics?
You are correct to think that the implementation that deletes the timer has the same semantics. And in fact I recently changed the example used in our training materials to do just that, as I prefer this approach. The reason I find it preferable is that all of the complex business logic is then in one place (in processElement), and whenever onTimer is called, you know exactly what to do, no questions asked. Plus, it's more performant, as there are fewer timers to checkpoint and eventually trigger.
This example was written for the docs back before timers could be deleted, and hasn't been updated.
You can find the reworked example I mentioned in these slides -- https://training.ververica.com/decks/process-function/ -- once you get past the registration page.
FWIW, I also recently reworked the reference solution to the corresponding training exercise along the same lines: https://github.com/apache/flink-training/tree/master/long-ride-alerts.

MapState does not store the previous session with EventTimeSessionWindows in Flink java

I need to compare the previous session to averages from different sessions for the same user. I'm using MapState to keep the previous session, but somehow the mapstate never contains any previous keys, so every session is new. here's my code:
SessionIdentificationProcessFunction (this is a function that gather all the events that belongs to the same session.
static SingleOutputStreamOperator<SessionEvent> sessionUser(KeyedStream<Event, String> stream) {
return stream.window(EventTimeSessionWindows.withGap(Time.minutes(PropertyFileReader.getGAP_SECTION())))
.allowedLateness(Time.minutes(PropertyFileReader.getLATENCY_ALLOWED()))
.process(new SessionIdentificationProcessFunction<Event, SessionEvent, String, TimeWindow>() {
#Override
public void open(Configuration parameters) {
/*state configured to live just one day to avoid garbage accumulation*/
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(1))
.cleanupFullSnapshot()
.build();
MapStateDescriptor<String, SessionEvent> map_descriptor = new MapStateDescriptor<>("prevMapUserSession", String.class, SessionEvent.class);
map_descriptor.enableTimeToLive(ttlConfig);
previous_user_sessions_state = getRuntimeContext().getMapState(map_descriptor);
}
#Override
public SessionEvent generateSessionRecord(String s, Context context, Iterable<Event> elements) {
Comparator<Event> sortFunc = (o1, o2) -> ((o1.timestamp.before(o2.timestamp)) ? 0 : 1);
Event start = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
Event end = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
SessionEvent session_user = (end.timestamp.equals(Timestamp.from(Instant.EPOCH))) ? new SessionEvent(start) : new SessionEvent(end);
session_user.sessionEvents = StreamSupport.stream(elements.spliterator(), false).count();
session_user.sessionDuration = sd;
try {
if (previous_user_sessions_state.contains(s)) {
SessionEvent previous = previous_user_sessions_state.get(s);
/*Update values of the session with the values of the previous which never exist and delete the previous session in the map to create a new entry with the new values updated*/
previous_user_sessions_state.remove(s);
} else {
/*always get here and create a new session*/
}
previous_user_sessions_state.put(s, session_user);
} catch (Exception e) {
e.printStackTrace();
}
return session_user;
}
})
.name("User Sessions");
}
Without seeing how SessionIdentificationProcessFunction is implemented, I'm not sure exactly what's going wrong, but Flink's session windows are rather special, so it's not terribly surprising that this isn't working. Part of the problem is that any given session window has a very short lifetime before it is merged with another session window. (As each new event arrives it is initially assigned to its own session window, after which the set of all current session windows is processed and any possible merges are performed (based on the session gap).)
What I can recommend is rather than using getRuntimeContext().getMapState(), use context.globalState().getMapState() instead (where context is the ProcessWindowFunction.Context passed to the process() method of a ProcessWindowFunction). This globalState is a KeyedStateStore meant for precisely this purpose -- keeping keyed state that is global/shared among all window instances for that key.

How to print an aggregated DataStream in flink?

I have a custom state calculation that is represented Set<Long> and it will keep getting updated as my Datastream<Set<Long>> sees new events from Kafka. Now, every time my state is updated I want to print the updated state to stdout. wondering how to do that in Flink? Little confused with all the window and trigger operations and I keep getting the following error.
Caused by: java.lang.RuntimeException: Record has Long.MIN_VALUE timestamp (= no timestamp marker). Is the time characteristic set to 'ProcessingTime', or did you forget to call 'DataStream.assignTimestampsAndWatermarks(...)'?
I just want to know how to print my aggregated stream Datastream<Set<Long>> to stdout or write it back to another kafka topic?
Below is the snippet of the code that throws the error.
StreamTableEnvironment bsTableEnv = StreamTableEnvironment.create(env, bsSettings);
DataStream<Set<Long>> stream = bsTableEnv.toAppendStream(kafkaSourceTable, Row.class)
stream
.aggregate(new MyCustomAggregation(100))
.process(new ProcessFunction<Set<Long>, Object>() {
#Override
public void processElement(Set<Long> value, Context ctx, Collector<Object> out) throws Exception {
System.out.println(value.toString());
}
});
Keeping collections in state with Flink can be very expensive, because in some cases the collection will be frequently serialized and deserialized. When possible it is preferred to the use Flink's built-in ListState and MapState types.
Here's an example illustrating a few things:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.fromElements(1L, 2L, 3L, 4L, 3L, 2L, 1L, 0L)
.keyBy(x -> 1)
.process(new KeyedProcessFunction<Integer, Long, List<Long>> () {
private transient MapState<Long, Boolean> set;
#Override
public void open(Configuration parameters) throws Exception {
set = getRuntimeContext().getMapState(new MapStateDescriptor<>("set", Long.class, Boolean.class));
}
#Override
public void processElement(Long x, Context context, Collector<List<Long>> out) throws Exception {
if (set.contains(x)) {
System.out.println("set contains " + x);
} else {
set.put(x, true);
List<Long> list = new ArrayList<>();
Iterator<Long> iter = set.keys().iterator();
iter.forEachRemaining(list::add);
out.collect(list);
}
}
})
.print();
env.execute();
}
Note that I wanted to use keyed state, but didn't have anything in the events to use as a key, so I just keyed the stream by a constant. This is normally not a good idea, as it prevents the processing from being done in parallel -- but since you are aggregating as a Set, that's not something you can do in parallel, so no harm done.
I'm representing the set of Longs as the keys of a MapState object. And when I want to output the set, I collect it as a List. When I just want to print something for debugging, I just use System.out.
What I see when I run this job in my IDE is this:
[1]
[1, 2]
[1, 2, 3]
[1, 2, 3, 4]
set contains 3
set contains 2
set contains 1
[0, 1, 2, 3, 4]
If you'd rather see what's in the MapState every second, you can use a Timer in the process function.

Resources