Flink - Memory Consumption always increasing - apache-flink

NOTE: As the David suggestion, I have just updated my flink project to version v1.12.3. With v1.12.3, It seems flink has done some improvements. Right now, my problem is resolved.
My current flink application runs with 48 task slots on 3 nodes. Also I am using rocksdb as state management. (I do not care about Savepoints and Checkpoint mechanism about in flink, I am just creating state which almost 5 mins ttl)
However memory consumption for all nodes are always increasing and I have to do stop flink application via stop-cluster.sh, then re-start again.
I have many keyedstreams based on the client ip address. In daily basis, millions of users are visiting my site.
Some of the keyed streams are using StateTtlConfig while others using onTimer mechanism.
My assumption about memory consumption (or leak) is that: calling the registerProcessingTimer creates an entry which holds in memory and because there are many ip addresses I will have many entries and memory consumption is always increasing ?
Should i remove onTimer solution and only using StateTtlConfig? (I am using onTimer method because in StateTtlConfig every time I update the state it also updates the ttls which creates invalid data in my application)
Examples for state managements
// EXAMPLE FOR STATETTLCONFIG
public class State1 extends KeyedProcessFunction<Tuple, ..., ...>{
private transient ValueState<Integer> state;
#Override
public void open(Configuration parameters) throws Exception{
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.minutes(2))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.cleanupInBackground()
.build();
ValueStateDescriptor<Integer> valueStateDesc = new ValueStateDescriptor<Integer>(
..
);
valueStateDesc.enableTimeToLive(ttlConfig);
state = getRuntimeContext().getState(valueStateDesc);
}
#Override
public void processElement(LogObject value, Context ctx, Collector<LogObject> out) throws Exception{
Integer stateVal = valueState.value();
// do something and update state
}
}
// EXAMPLE FOR ONTIMER METHOD
public class State2 extends KeyedProcessFunction<Tuple, ..., ...> {
private transient ValueState<Integer> state;
#Override
public void open(Configuration parameters){
ValueStateDescriptor<Integer> stateDesc = new ValueStateDescriptor<>(
...;
state = getRuntimeContext().getState(stateDesc);
}
#Override
public void processElement(LogObject value, Context ctx, Collector<LogObject> out) throws Exception{
Integer stateVal = state.value();
if (stateVal == null)
{
stateVal = 0;
ctx.timerService().registerProcessingTimeTimer(value.getTimestamp() + 5 MINS);
}
stateVal ++;
// do something and update state
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<LogObject> out)
{
stateVal.clear();
}
}

Related

Busy time is too high for simple process function

Finally, after a month of research I found the main reason.
The main reason was IP2Location. I am using IP2Location java library to search ip address location in the BIN files. In the peak time, it causes a problem. At least i can avoid to problem by passing IP2Proxy.IOModes.IP2PROXY_MEMORY_MAPPED parameter before reading the bin files.
And also I just found that a few state object doesn't match with POJO standard which causes high load.
I am using flink v1.13, there are 4 task managers (per 16 cpu) with 3800 tasks (default application parallelism is 28)
In my application one operator has always high busy time (around %80 - %90).
If I restart the flink application, then busy time decreases, but after 5-10 hours running busy time increases again.
In the grafana, I can see that busy time for ProcessStream increases. Here is the PromethuesQuery: avg((avg_over_time(flink_taskmanager_job_task_busyTimeMsPerSecond[1m]))) by (task_name)
There is no backpressure in the ProcessStream task. To calculate backPressure time, I am using: flink_taskmanager_job_task_backPressuredTimeMsPerSecond
But I couldn't find any reason for that.
Here is the code :
private void processOne(DataStream<KafkaObject> kafkaLog) {
kafkaLog
.filter(new FilterRequest())
.name(FilterRequest.class.getSimpleName())
.map(new MapToUserIdAndTimeStampMs())
.name(MapToUserIdAndTimeStampMs.class.getSimpleName())
.keyBy(UserObject::getUserId) // returns of type int
.process(new ProcessStream())
.name(ProcessStream.class.getSimpleName())
.addSink(...)
;
}
// ...
// ...
public class ProcessStream extends KeyedProcessFunction<Integer, UserObject, Output>
{
private static final long STATE_TIMER = // 5 min in milliseconds;
private static final int AVERAGE_REQUEST = 74;
private static final int STANDARD_DEVIATION = 32;
private static final int MINIMUM_REQUEST = 50;
private static final int THRESHOLD = 70;
private transient ValueState<Tuple2<Integer, Integer>> state;
#Override
public void open(Configuration parameters) throws Exception
{
ValueStateDescriptor<Tuple2<Integer, Integer>> stateDescriptor = new ValueStateDescriptor<Tuple2<Integer, Integer>>(
ProcessStream.class.getSimpleName(),
TypeInformation.of(new TypeHint<Tuple2<Integer, Integer>>() {}));
state = getRuntimeContext().getState(stateDescriptor);
}
#Override
public void processElement(UserObject value, KeyedProcessFunction<Integer, UserObject, Output>.Context ctx, Collector<Output> out) throws Exception
{
Tuple2<Integer, Integer> stateValue = state.value();
if (Objects.isNull(stateValue)) {
stateValue = Tuple2.of(1, 0);
ctx.timerService().registerProcessingTimeTimer(value.getTimestampMs() + STATE_TIMER);
}
int totalRequest = stateValue.f0;
int currentScore = stateValue.f1;
if (totalRequest >= MINIMUM_REQUEST && currentScore >= THRESHOLD)
{
out.collect({convert_to_output});
state.clear();
}
else
{
stateValue.f0 = totalRequest + 1;
stateValue.f1 = calculateNextScore(stateValue.f0);
state.update(stateValue);
}
}
private int calculateNextScore(int totalRequest)
{
return (totalRequest - AVERAGE_REQUEST ) / STANDARD_DEVIATION;
}
#Override
public void onTimer(long timestamp, KeyedProcessFunction<Integer, UserObject, Output>.OnTimerContext ctx, Collector<Output> out) throws Exception
{
state.clear();
}
}
Since you're using a timestamp value from your incoming record (value.getTimestampMs() + STATE_TIMER), you want to be running with event time, and setting watermarks based on that incoming record's timestamp. Otherwise you have no idea when the timer is actually firing, as the record's timestamp might be something completely different than your current processor time.
This means you also want to use .registerEventTimeTimer().
Without these changes you might be filling up TM heap with uncleared state, which can lead to high CPU load.

Process Function Event Time Behaviour

I am referring to the Process Function example mentioned in https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/dev/datastream/operators/process_function/
/**
* The data type stored in the state
*/
public class CountWithTimestamp {
public String key;
public long count;
public long lastModified;
}
/**
* The implementation of the ProcessFunction that maintains the count and timeouts
*/
public class CountWithTimeoutFunction
extends KeyedProcessFunction<Tuple, Tuple2<String, String>, Tuple2<String, Long>> {
/** The state that is maintained by this process function */
private ValueState<CountWithTimestamp> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("myState", CountWithTimestamp.class));
}
#Override
public void processElement(
Tuple2<String, String> value,
Context ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Long>> out) throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
In this scenario my datastream is being produced by KafkaSource with no idleness behaviour configured
DataStream<Tuple2<Integer, Integer>> inputStream = env.fromSource(inputSource, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(1)), "Input Kafka Source")
Now consider a scenario where there is only 1 key that is being emitted by source, let's say key1
At time T1 when the first event comes, processElement is called and the CountWithTimestamp object is set accordingly ie count = 1 and lastModified = T1
Now there are no more events for lets say 70 secs(T2). Then another event comes in for the same key key1
Now here are my questions:
When the second event comes, during my debugging, processElement always gets called first then onTimer. This is because watermark gets generated only after the event has been processed. Is my understanding correct?
Since processElement is getting called first the lastModified is getting modified to T2 (earlier it was T1). This means that even if now timer triggers it won't process as lastModified got updated. And it won't process if the above mentioned scenario keeps repeating.
Thanks.
I believe you've got that right.
Yes, watermarks follow the events that justify their creation.
Yes, that example is flawed. It makes (unstated) assumptions about there being events for other keys.

Flink KeyedProcessFunction integrating anonymous methods

I am attempting to write a keyedProcessFunction, the code looks like this below:
DataStream<Tuple2<Long, Integer>> busyMachinesPerWindow = busyMachines
// group by timestamp (window end)
.keyBy(event -> event.getField(1))
.process(new KeyedProcessFunction<Tuple1<Long>, Tuple3<Long, Long, Long>, Tuple2<Long, Integer>>() {
private ValueState<Integer> state;
#Override
public void open(Configuration config) throws IOException {
// initialize the state descriptors here
state = getRuntimeContext().getState(new ValueStateDescriptor<>("machine-counts", Integer.class));
if (state.value() == null) {
state.update(0);
}
}
#Override
public void processElement(Tuple3<Long, Long, Long> inWindow, Context ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
if (state.value() != null) {
state.update(state.value() + 1);
} else {
state.update(1);
}
ctx.timerService().registerEventTimeTimer(inWindow.f1);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
int counter = state.value();
state.clear();
// we can now output the window and the machine count
out.collect(new Tuple2<>(((Tuple1<Long>) ctx.getCurrentKey()).f0, counter));
}
});
However this pops up an error saying cannot derive anonymous method. I don't see what the problem is with this code. Is there some type ambiguity that I am not doing right?
One problem with this code is that you are calling state.value() and state.update(0) in the open method. This is not allowed. These methods can only be used in processElement and in onTimer, because only then is there a specific event being processed whose key can be used to access/update the appropriate entry in the state backend.
An instance of a KeyedProcessFunction is multiplexed across all of the keys assigned to a given task slot. The open method is called just once, at a time when there is no specific key in the runtime context, so the state cannot be accessed or updated at this time.

How to check if a MapState is empty in flink 1.8

I have an application where I am reading all the data from a DB for the first time and add it to MapState. Here is my RichCoFlatMapFunction
private transient MapState<String, Record> mapState;
#Override
public void open(Configuration parameters) throws Exception {
mapState = getRuntimeContext().getMapState(new MapStateDescriptor<String, Record>("recordState",
TypeInformation.of(new TypeHint<String>(){}), TypeInformation.of(new TypeHint<Record>() {})));
}
#Override
public void flatMap1(Record record, Collector<OutputRecord> collector) throws Exception {
readForFirstTime();
mapState.put(item.getId(), item);
}
#Override
public void flatMap2(Item item, Collector<OutputRecord> collector) throws Exception {
readForFirstTime();
Record record = mapState.get(item.getId);
System.out.println("Item arrived at time:"+ item.getTimestamp() +
". Record at the exact same time:" + record.toString());
}
private void readForFirstTime() {
// I need a mechanism here to detect if recordState is empty
// then only listAllFromDB
for(Record record: listAllFromDB) {
mapState.put(record.getId(), record);
}
}
So when I start my application from snapshot, I assume MapState will contain data and I do not want to read from DB. How can I check if the MapState is empty or contains data ?
If I understand correctly, you want to load database data only once, usually you do this at the open(.) method. Or you can use another MapState to the database data and use the MapState::isEmpty().

onTimer method, why timer state is null?

I have a simple application like this (inside keyed process function).
As you can in the code section below, I am always first getting timerObject from state and if it does not exists, I am creating a new one, and update the state. Thus, there will never be a empty/null state.
And basically this state is just for keeping the object last time, for example:
If an object was seen at time 10:15 then register time will be 10:30.
However if an object was seen again at time 10:25, then register time will be updated to 10:40
If process function runs onTimer at time 10:40, that's means there was no object in 15 mins interval, then i am just clearing my state.
Problem is that logger sometimes prints null for the state object. This should not be the case right?
public class ProcessRule extends KeyedProcessFunction<Tuple, LogEntity, Result> {
private static final Logger LOGGER = LoggerFactory.getLogger(ProcessRule.class);
private transient ValueState<TimerObject> timerState;
#Override
public void open(Configuration parameters) throws Exception{
ValueStateDescriptor<TimerObject> timerValueStateDescriptor = new ValueStateDescriptor<TimerObject>(
"timerStateForProcessRule",
TypeInformation.of(TimerObject.class)
);
timerState = getRuntimeContext().getState(timerValueStateDescriptor);
}
#Override
public void processElement(LogEntity value, Context ctx, Collector<Result> out) throws Exception{
registerTimer(value, ctx);
if (conditionTrue) {
convert Result add to collector
}
}
private void registerTimer(LogEntity element, Context ctx) throws Exception{
TimerObject stateTimer = timerState.value();
if (stateTimer == null){
stateTimer = new TimerObject();
long timeInterval = 15 * 60 * 1000;
stateTimer.setTimeInterval(timeInterval);
}
stateTimer.setCurrentTimeInMilliseconds(element.getTimestampMs());
timerState.update(stateTimer);
ctx.timerService().registerProcessingTimeTimer(stateTimer.getNextTimer());
// getNextTimer => currentTime + timeInterval
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<ValidationResult> out) throws Exception{
TimerObject stateTimer = timerState.value();
LOGGER.info("Timer fired at the the timestamps: {} for: {}", timestamp, stateTimer);
timerState.clear();
}
}
The issue here is most probably coming from the fact that You are registering multiple different timers, but You don't seem to delete them when registering new ones. So, this basically means that when first-timer fires the timerState is cleared, but seconds again next timer may also fire since it might have been registered to fire 3 sec after the first one and in this case the timerState may already be null.

Resources