MapState does not store the previous session with EventTimeSessionWindows in Flink java

MapState does not store the previous session with EventTimeSessionWindows in Flink java - apache-flink

I need to compare the previous session to averages from different sessions for the same user. I'm using MapState to keep the previous session, but somehow the mapstate never contains any previous keys, so every session is new. here's my code:
SessionIdentificationProcessFunction (this is a function that gather all the events that belongs to the same session.
static SingleOutputStreamOperator<SessionEvent> sessionUser(KeyedStream<Event, String> stream) {
return stream.window(EventTimeSessionWindows.withGap(Time.minutes(PropertyFileReader.getGAP_SECTION())))
.allowedLateness(Time.minutes(PropertyFileReader.getLATENCY_ALLOWED()))
.process(new SessionIdentificationProcessFunction<Event, SessionEvent, String, TimeWindow>() {
#Override
public void open(Configuration parameters) {
/*state configured to live just one day to avoid garbage accumulation*/
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(org.apache.flink.api.common.time.Time.days(1))
.cleanupFullSnapshot()
.build();
MapStateDescriptor<String, SessionEvent> map_descriptor = new MapStateDescriptor<>("prevMapUserSession", String.class, SessionEvent.class);
map_descriptor.enableTimeToLive(ttlConfig);
previous_user_sessions_state = getRuntimeContext().getMapState(map_descriptor);
}
#Override
public SessionEvent generateSessionRecord(String s, Context context, Iterable<Event> elements) {
Comparator<Event> sortFunc = (o1, o2) -> ((o1.timestamp.before(o2.timestamp)) ? 0 : 1);
Event start = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
Event end = StreamSupport.stream(elements.spliterator(), false).max(sortFunc).orElse(new Event());
SessionEvent session_user = (end.timestamp.equals(Timestamp.from(Instant.EPOCH))) ? new SessionEvent(start) : new SessionEvent(end);
session_user.sessionEvents = StreamSupport.stream(elements.spliterator(), false).count();
session_user.sessionDuration = sd;
try {
if (previous_user_sessions_state.contains(s)) {
SessionEvent previous = previous_user_sessions_state.get(s);
/*Update values of the session with the values of the previous which never exist and delete the previous session in the map to create a new entry with the new values updated*/
previous_user_sessions_state.remove(s);
} else {
/*always get here and create a new session*/
}
previous_user_sessions_state.put(s, session_user);
} catch (Exception e) {
e.printStackTrace();
}
return session_user;
}
})
.name("User Sessions");
}

Without seeing how SessionIdentificationProcessFunction is implemented, I'm not sure exactly what's going wrong, but Flink's session windows are rather special, so it's not terribly surprising that this isn't working. Part of the problem is that any given session window has a very short lifetime before it is merged with another session window. (As each new event arrives it is initially assigned to its own session window, after which the set of all current session windows is processed and any possible merges are performed (based on the session gap).)
What I can recommend is rather than using getRuntimeContext().getMapState(), use context.globalState().getMapState() instead (where context is the ProcessWindowFunction.Context passed to the process() method of a ProcessWindowFunction). This globalState is a KeyedStateStore meant for precisely this purpose -- keeping keyed state that is global/shared among all window instances for that key.

Related

Room Data Base Create Instance

I want to Create An Instance Of Room Data base in Composable
But
val db = Room.databaseBuilder(applicationContext, UserDatabase::class.java,"users.db").build()
is not working here not getting applicationContext
How to create an instance of context in composable

Have you tried getting the context with : val context = LocalContext.current and then adding this to get your applicationContext?
Like this: context.applicationContext or using simply val db = Room.databaseBuilder(context, UserDatabase::class.java,"users.db").build()

Room (and the underlying SQliteOpenHelper) only need the context to open the database (or more correctly to instantiate the underlying SQLiteOpenHelper).
Room/Android SQLiteOpenHelper uses the context to ascertain the Application's standard (recommended) location (data/data/<the_package_name>/databases). e.g. in the following demo (via Device Explorer):-
The database, as it is still open includes 3 files (the -wal and -shm are the Write Ahead Logging files that will at sometime be committed/written to the actual database (SQLite handles that)).
so roughly speaking Room only needs to have the context so that it can ascertain /data/data/a.a.so75008030kotlinroomgetinstancewithoutcontext/databases/testit.db (in the case of the demo).
So if you cannot use the applicationContext method then you can circumvent the need to provide the context, if using a singleton approach AND if after instantiating the singleton.
Perhaps consider this demo:-
First some pretty basic DB Stuff (table (#Entity annotated class), DAO functions and #Database annotated abstract class WITH singleton approach). BUT with some additional functions for accessing the instance without the context.
#Entity
data class TestIt(
#PrimaryKey
val testIt_id: Long?=null,
val testIt_name: String
)
#Dao
interface DAOs {
#Insert(onConflict = OnConflictStrategy.IGNORE)
fun insert(testIt: TestIt): Long
#Query("SELECT * FROM testit")
fun getAllTestItRows(): List<TestIt>
}
#Database(entities = [TestIt::class], exportSchema = false, version = 1)
abstract class TestItDatabase: RoomDatabase() {
abstract fun getDAOs(): DAOs
companion object {
private var instance: TestItDatabase?=null
/* Extra/not typical for without a context (if wanted)*/
fun isInstanceWithoutContextAvailable() : Boolean {
return instance != null
}
/******************************************************/
/* Extra/not typical for without a context */
/******************************************************/
fun getInstanceWithoutContext(): TestItDatabase? {
if (instance != null) {
return instance as TestItDatabase
}
return null
}
/* Typically the only function*/
fun getInstance(context: Context): TestItDatabase {
if (instance==null) {
instance = Room.databaseBuilder(context,TestItDatabase::class.java,"testit.db")
.allowMainThreadQueries() /* for convenience/brevity of demo */
.build()
}
return instance as TestItDatabase
}
}
}
And to demonstrate (within an activity for brevity) :-
class MainActivity : AppCompatActivity() {
lateinit var roomInstance: TestItDatabase
lateinit var dao: DAOs
override fun onCreate(savedInstanceState: Bundle?) {
super.onCreate(savedInstanceState)
setContentView(R.layout.activity_main)
roomInstance = TestItDatabase.getInstance(this) /* MUST be used before withoutContext functions but could be elsewhere shown here for brevity */
dao = roomInstance.getDAOs()
//dao.insert(TestIt(testIt_name = "New001")) /* Removed to test actually doing the database open with the without context */
logDataWithoutContext()
addRowWithoutContext()
addRowWithApplicationContext()
logDataWithoutContext()
}
private fun logDataWithoutContext() {
Log.d("${TAG}_LDWC","Room DB Instantiated = ${TestItDatabase.isInstanceWithoutContextAvailable()}")
for (t in TestItDatabase.getInstanceWithoutContext()!!.getDAOs().getAllTestItRows()) {
Log.d("${TAG}_LDWC_DATA","TestIt Name is ${t.testIt_name} ID is ${t.testIt_id}")
}
}
private fun addRowWithoutContext() {
Log.d("${TAG}_LDWC","Room DB Instantiated = ${TestItDatabase.isInstanceWithoutContextAvailable()}")
if (TestItDatabase.getInstanceWithoutContext()!!.getDAOs()
.insert(TestIt(System.currentTimeMillis(),"NEW AS PER ID (the time to millis) WITHOUT CONTEXT")) > 0) {
Log.d("${TAG}_ARWC_OK","Row successfully inserted.")
} else {
Log.d("${TAG}_ARWC_OUCH","Row was not successfully inserted (duplicate ID)")
}
}
private fun addRowWithApplicationContext() {
TestItDatabase.getInstance(applicationContext).getDAOs().insert(TestIt(System.currentTimeMillis() / 1000,"NEW AS PER ID (the time to seconds) WITH CONTEXT"))
}
}
The result output to the log showing that the database access, either way, worked:-
2023-01-05 12:45:39.020 D/DBINFO_LDWC: Room DB Instantiated = true
2023-01-05 12:45:39.074 D/DBINFO_LDWC: Room DB Instantiated = true
2023-01-05 12:45:39.077 D/DBINFO_ARWC_OK: Row successfully inserted.
2023-01-05 12:45:39.096 D/DBINFO_LDWC: Room DB Instantiated = true
2023-01-05 12:45:39.098 D/DBINFO_LDWC_DATA: TestIt Name is NEW AS PER ID (the time to seconds) WITH CONTEXT ID is 1672883139
2023-01-05 12:45:39.098 D/DBINFO_LDWC_DATA: TestIt Name is NEW AS PER ID (the time to millis) WITHOUT CONTEXT ID is 1672883139075
note that the shorter id was the last added but appears first due to it being selected first as it appears earlier in the index that the SQlite Query Optimiser would have used (aka the Primary Key).
basically the same date time second wise but the first insert included milliseconds whilst the insert via AddRowWithApplicationContext drops the milliseconds.

Huge checkpoint size using ValueState leading to event processing lag

I have an application in flink, which does deduplication of multiple streams.
It does key-by on one string field and dedupes it by using value-state.
Using value state in RichFilterFunction.
public class DedupeWithState extends RichFilterFunction<Tuple2<String, Message>> {
private ValueState<Boolean> seen;
private final ValueStateDescriptor<Boolean> desc;
public DedupeWithState(long cacheExpirationTimeMs) {
StateTtlConfig ttlConfig = StateTtlConfig
.newBuilder(Time.milliseconds(cacheExpirationTimeMs))
.setUpdateType(StateTtlConfig.UpdateType.OnCreateAndWrite)
.setStateVisibility(StateTtlConfig.StateVisibility.NeverReturnExpired)
.build();
desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
desc.enableTimeToLive(ttlConfig);
}
#Override
public void open(Configuration conf) {
seen = getRuntimeContext().getState(desc);
}
#Override
public boolean filter(Tuple2<String, Message> stringMessageTuple2) throws Exception {
if (seen.value() == null) {
seen.update(true);
return true;
}
return false;
}
}
The application consumes 3 streams from kafka, and each stream has its own dedupe function with ttl of 4hours.
DataStream<Tuple2<String, Message>> event1 = event1Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
DataStream<Tuple2<String, Message>> event2 = event2Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
DataStream<Tuple2<String, Message>> event3 = event3Input
.keyBy(x->x.f0)
.filter(new DedupeWithState(14400000));
Screenshots attached.
Backend properties are:
state.backend: rocksdb
state.backend.incremental: true
state.checkpoints.dir: <azure blob store>
Checkpoint configuration as on WebUI
We are using Flink 1.13.6.
The QPS of each stream is event1 - 7k, event2 - 6k, event3 - 200
Key size is ~110 bytes
Checkpoint interval is 5 mins and incremental checkpoint is enabled.
As per above configs (given that incremental checkpoint is enabled) each stream should have following checkpoint size:
event1 -> ((7000 * 60 * 5) * 110bytes) = ~220MB
Issue is the checkpoint size is very huge. It starts from 400 MB (as expected) but is going upto 2-3GB per checkpoint Checkpoint history. This results in huge backpressure in Dedupe function and overall lag in the system. Checkpoint per operator

Maybe the state is not being cleaned since it is done lazily (on read). From the initial release post (a bit old but may still stand):
When a state object is accessed in a read operation, Flink will check its timestamp and clear the state if it is expired (depending on the configured state visibility, the expired state is returned or not). Due to this lazy removal, expired state that is never accessed again will forever occupy storage space unless it is garbage collected.
I would try with a MapState per stream (without keying by) instead of a ValueState per key as you have now so the same state is continuously accessed. Or you may also be able to set up a timer in DedupeWithState that accesses the state and forces the cleanup (you may need to use a ProcessFunction to be able to set up timers) or that simply clears it.

Try something like this -
/**
* #author sucheth.shivakumar
*/
public class Check extends KeyedProcessFunction {
private ValueState<Boolean> seen;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
// defines the time the state has to be stored in the state backend before it is auto cleared
seen = getRuntimeContext().getState(desc);
}
#Override
public void processElement(Object value, Context ctx, Collector out) throws Exception {
if (seen.value() == null) {
seen.update(true);
// emits the record
out.collect(stringMessageTuple2);
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + 14400000);
}
}
#Override
// this fires after 4 hrs is passed and clears the state
public void onTimer(long timestamp, OnTimerContext ctx, Collector out)
throws Exception {
// triggers after ttl has passed
if (seen.value()) {
seen.clear();
}
}

Flink counter with timestamp

I was reading the the Flink example CountWithTimestamp and below is a code snippet from the example:
#Override
public void processElement(Tuple2<String, String> value, Context ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// retrieve the current count
CountWithTimestamp current = state.value();
if (current == null) {
current = new CountWithTimestamp();
current.key = value.f0;
}
// update the state's count
current.count++;
// set the state's timestamp to the record's assigned event time timestamp
current.lastModified = ctx.timestamp();
// write the state back
state.update(current);
// schedule the next timer 60 seconds from the current event time
ctx.timerService().registerEventTimeTimer(current.lastModified + 60000);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<String, Long>> out)
throws Exception {
// get the state for the key that scheduled the timer
CountWithTimestamp result = state.value();
// check if this is an outdated timer or the latest timer
if (timestamp == result.lastModified + 60000) {
// emit the state on timeout
out.collect(new Tuple2<String, Long>(result.key, result.count));
}
}
}
My question is that if I remove the if statment timestamp == result.lastModified + 60000 (collect stmt not touched) in the onTimer, and instead replace it by another if statment if(ctx.timestamp < current.lastModified + 60000) { deleteEventTimeTimer(current.lastModified + 60000)} in the begining of processElement, would the semnatics of the program be the same? any preference of one version over the other in case of same semantics?

You are correct to think that the implementation that deletes the timer has the same semantics. And in fact I recently changed the example used in our training materials to do just that, as I prefer this approach. The reason I find it preferable is that all of the complex business logic is then in one place (in processElement), and whenever onTimer is called, you know exactly what to do, no questions asked. Plus, it's more performant, as there are fewer timers to checkpoint and eventually trigger.
This example was written for the docs back before timers could be deleted, and hasn't been updated.
You can find the reworked example I mentioned in these slides -- https://training.ververica.com/decks/process-function/ -- once you get past the registration page.
FWIW, I also recently reworked the reference solution to the corresponding training exercise along the same lines: https://github.com/apache/flink-training/tree/master/long-ride-alerts.

Flink session window with getting result on end

I have a kafka messages something like the following pattern:
{ user: 'someUser', value: 'SomeValue' , timestamp:000000000}
With Flink stream calculation that do some count action on those items .
Now I want to declare a session , to collect same user + value in a range of X seconds as a single , with the latest timestamp , then it will be forwarded to the next stream just one time
So I Wrote something like that:
data.assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Data>() {
.....
})
.keyBy(new KeySelector<Data, String>(){
.......
})
.window(EventTimeSessionWindows.withGap(Time.minutes(10)))
.aggregate(new AggregateFunction<Data, Data, Data>() {
#Override
public Data createAccumulator() {
return null;
}
#Override
public Data add(Data value, Data accumulator) {
if(accumulator == null) {
accumulator = value;
}
return accumulator;
}
#Override
public Data getResult(Data accumulator) {
return accumulator;
}
#Override
public Data merge(Data a, Data b) {
return a;
}
});
But the problem is that the getResult function is called on each element , not just in the end of the window.
My problem is how to not to forward the aggregation result until the end of the window to the next stream. as far that I know also process stream result is moving forward when there is no more elements, even though the windows isn't end yes
any advice?
Thanks

Flink provides two distinct approaches for evaluating windows. In this case you want to use the other one.
One approach evaluates each window's contents incrementally. This is what you get with reduce and aggregate. As elements are assigned to the window, the ReduceFunction or AggregateFunction is called and that element immediately makes its contribution to the final result.
The alternative is to use process with a ProcessWindowFunction. With this approach, the window isn't evaluated until the window is complete, at which point the ProcessWindowFunction is called once with an Iterable containing all of the elements that were assigned to the window. This has the disadvantage of needing to store all of the elements until the window is triggered, and if the ProcessWindowFunction has to do a lot of work to compute its result that can temporarily disrupt the pipeline, but some calculations need to be done this way -- like counting distinct elements.
See the documentation for more info.

Too many callouts: 11 in DataBase.Batchable

i have a code segment in my DataBase.Batchable
if(pageToken!=null)
deleteEvents(accessToken , pageToken,c.CalendarId__c);
}
and delete event function code is
public static void deleteEvents(String accessToken,String pageToken,String CalendarId){
while(pageToken != null)
{ String re = GCalendarUtil.doApiCall(null,'GET','https://www.googleapis.com/calendar/v3/calendars/'+CalendarId+'/events?pageToken='+pageToken,accessToken);
System.debug('next page response is'+ re);
re = re.replaceAll('"end":','"end1":');
re = re.replaceAll('"dateTime":','"dateTime1":');
JSON2Apex1 aa = JSON2Apex1.parse(re);
List<JSON2Apex1.Items> ii = aa.items ;
System.debug('size of ii'+ii.size());
pageToken = aa.nextPageToken ;
List<String> event_id =new List<String>();
if(ii!= null){
for(JSON2Apex1.Items i: ii){
event_id.add(i.id);
}
}
for(String ml: event_id)
System.debug('Hello Worlds'+ml);
for(String s:event_id)
{
GCalendarUtil.doApiCall(null,'DELETE','https://www.googleapis.com/calendar/v3/calendars/'+CalendarId+'/events/'+s,accessToken);
}
}
}
because there were nearly about 180 event thats why i am getting this error but for deleting them i have an option that i create a custom object and a text field (id of event)in it and then create one more Database.Batchable class for deleting them and pass it as a batch of 9 or any other method so that i am able to do more than 100 callouts with in an Database.Batchable interface??

10 is the max. And it doesn't seem like Google's calendar API supports mass delete.
You could try chunking your batch job (the size of list of records passed to each execute() call). For example something like this:
global Database.QueryLocator start(Database.BatchableContext bc){
return Database.getQueryLocator([SELECT Id FROM Event]); // anything that loops through Events and not say Accounts
}
global void execute(Database.BatchableContext BC, List<Account> scope){
// your deletes here
}
Database.executeBatch(new MyBatchClass(), 10); // this will make sure to work on 10 events (or less) at a time
Another option would be to read up about daisy-chaining of batches. In execute you'd be deleting up to 10, in finish() method you check again if there's still more work to do and you can fire a batch again.
P.S. Don't use hardcoded "10". Use Limits.getCallouts() and Limits.getLimitCallouts() so it will automatically update if Salesforce ever increases the limit.
In a trigger you could for example set up 10 #future methods, each with fresh limit of 10 callouts... Still batch sounds like a better idea.