I am trying to de-duplicate events in my Flink pipeline. I am trying to do that using guava cache.
My requirement is that, I want to de-duplicate over a 1 minute window. But at any given point I want to maintain not more than 10000 elements in the cache.
A small background on my experiment with Flink windowing:
Tumbling Windows: I was able to implement this using Tumbling windows + custom trigger. But the problem is, if an element occurs in the 59th minute and 61st minute, it is not recognized as a duplicate.
Sliding Windows: I also tried sliding window with 10 second overlap + custom trigger. But an element that came in the 55th second is part of 5 different windows and it is written to the sink 5 times.
Please let me know if I should not be seeing the above behavior with windowing.
Back to Guava:
I have Event which looks like this and a EventsWrapper for these events which looks like this. I will be getting a stream of EventsWrappers. I should remove duplicate Events across different EventsWrappers.
Example if I have 2 EventsWrappers like below:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e1',
name='event1'}, Event{id='e3', name='event3'}]}
I should emit as output the following:
[EventsWrapper{id='ew1', org='org1', events=[Event{id='e1',
name='event1'}, Event{id='e2', name='event2'}]},
EventsWrapper{id='ew2', org='org2', events=[Event{id='e3', name='event3'}]}
i.e Making sure that e1 event is emitted only once assuming these two events are within the time and size requirements of the cache.
I created a RichFlatmap function where I initiate a guava cache and value state like this. And set the Guava cache in the value state like this. My overall pipeline looks like this.
But each time I try to update the guava cache inside the value state:
eventsState.value().put(eventId, true);
I get the following error:
java.lang.NullPointerException
at com.google.common.cache.LocalCache.hash(LocalCache.java:1696)
at com.google.common.cache.LocalCache.put(LocalCache.java:4180)
at com.google.common.cache.LocalCache$LocalManualCache.put(LocalCache.java:4888)
at events.piepline.DeduplicatingFlatmap.lambda$flatMap$0(DeduplicatingFlatmap.java:59)
at java.base/java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:176)
On further digging, I found out that the error is because the keyEquivalence inside the Guava cache is null.
I checked by directly setting on the Guava cache(not through state, but directly on the cache) and that works fine.
I felt this could be because, ValueState is not able to serialize GuavaCache. So I added a Serializer like this and registered it like this:
env.registerTypeWithKryoSerializer((Class<Cache<String,Boolean>>)(Class<?>)Cache.class, CacheSerializer.class);
But this din't help either.
I have the following questions:
Any idea what I might be doing wrong with the Guava cache in the above case.
Is what I am seeing with my Tumbling and Slinding windows implementation is what is expected or am I doing something wrong?
What will happen if I don't set the Guava Cache in ValueState, instead just use it as a plain object in the DeduplicatingFlatmap class and operate directly on the Guava Cache instead of operating through the ValueState? My understanding is, the Guava cache won't be part of the Checkpoint. So when the pipeline fails and restarts, the GuavaCahe would have lost all the values in it and it will be empty on restart. Is this understanding correct?
Thanks a lot in advance for the help.
See below.
These windows are behaving as expected.
Your understanding is correct.
Even if you do get it working, using a Guava cache as ValueState will perform very poorly, because RocksDB is going to deserialize the entire cache on every access, and re-serialize it on every update.
Moreover, it looks like you are trying to share a single cache instance across all of the orgs that happen to be multiplexed across a single flatmap instance. That's not going to work, because the RocksDB state backend will make a copy of the cache for each org (a side effect of the serialization involved).
Your requirements aren't entirely clear, but a deduplication query might help. But I'm thinking MapState in combination with timers in a KeyedProcessFunction is more likely to be the building block you need. Here's an example that might help you get started (but you'll be wanting to handle the timers differently).
Related
We have an application that consumes events from a kafka source. The logic from processing each element needs to take into account the events that were previously received (having the same partition key), without using time for windowing. The first implementation used a GlobalWindow, with an AggregateFunction for keeping the current state information and a trigger that would always fire in onElement call. I am guessing that the alternative of using a KeyedProcessFunction that and holds the state in a ValueState object would be more adequate, since we are not really taking timing into account, nor using any custom triggering. Is this assumption correct and are there any downsides to either one of these approaces?
In prefer using a KeyedProcessFunction in cases like this. It puts all of the related logic into one object -- rather than having to coordinate what's going on in a GlobalWindow, an AggregateFunction, and a Trigger (and perhaps also an Evictor). I find this results in implementations that are more maintainable and testable, plus you have more straightforward control over state management.
I don't see any advantages to a solution based on windows.
We are aggregating some data for 1 minute which we then flush onto a file. The data itself is like a map where key is an object and value is also an object.
Since we need to flush the data together hence we are not doing any keyBy and hence are using windowAll.
The problem that we are facing is that we get better throughput if we use window function with ProcessAllWindowFunction and then aggregate in the process call vs when we use aggregate with window function. We are also seeing timeouts in state checkpointing when we use aggregate.
I tried to go through the code base and the only hypothesis I could think of is probably it is easier to checkpoint ListState that process will use vs the AggregateState that aggregate will use.
Is the hypothesis correct? Are we doing something wrong? If not, is there a way to improve the performance on aggregate?
Based on what you've said, I'm going to jump to some conclusions.
I assume you are using the RocksDB state backend, and are aggregating each incoming event into into some sort of collection. In that case, the RocksDB state backend is having to deserialize that collection, add the new event to it, and then re-serialize it -- for every event. This is very expensive.
When you use a ProcessAllWindowFunction, each incoming event is appended to a ListState object, which has a very efficient implementation -- the serialized bytes for the new event are simply appended (the list doesn't have to be deserialized and re-serialized).
Checkpoints are timing out because the throughput is so poor.
Switching to the FsStateBackend would help. Or use a ProcessAllWindowFunction. Or implement your own windowing with a KeyedProcessFunction, and then use ListState or MapState for the aggregation.
I would like to do a window aggregation with an early trigger logic (you can think that the aggregation is triggered either by window is closed, or by a specific event), and I read on the doc: https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/stream/operators/windows.html#incremental-window-aggregation-with-aggregatefunction
The doc mentioned that Note that using ProcessWindowFunction for simple aggregates such as count is quite inefficient. so the suggestion is to pair with incremental window aggregation.
My question is that AverageAggregate in the doc, the state is not saved anywhere, so if the application crashed, the averageAggregate will loose all the intermediate value, right?
So If that is the case, is there a way to do a window aggregation, still supports incremental aggregation, and has a state backend to recover from crash?
The AggregateFunction is indeed only describing the mechanism for combining the input events into some result, that specific class does not store any data.
The state is persisted for us by Flink behind the scene though, when we write something like this:
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate(), new MyProcessWindowFunction());
the .keyBy(<key selector>).window(<window assigner>) is indicating to Flink to hold a piece of state for us for each key and time bucket, and to call our code in AverageAggregate() and MyProcessWindowFunction() when relevant.
In case of crash or restart, no data is lost (assuming state backend are configured properly): as with other parts of Flink state, the state here will either be retrieved from the state backend or recomputed from first principles from upstream data.
I have an always one application, listening to a Kafka stream, and processing events. Events are part of a session. And I need to do calculations based off of a sessions data. I am running into a problem trying to correctly run my calculations due to the length of my sessions. 90% of my sessions are done after 5 minutes. 99% are done after 1 hour. Sessions may last more than a day, due to this being a real-time system, there is no determined end. Session are unique, and show never collide.
I am looking for a way where I can process a window multiple times, either with an initial wait period and processing any later events after that, or a pure process per event type structure. I will need to keep all previous events around(ListState), as well as previously processed values(ValueState).
I previously thought allowedLateness would allow me to do this, but it seems the lateness is only considered for when the event should have been processed, it does not extend an actual window. GlobalWindows may also work, but I am unsure if there is a way to process a window multiple times. I believe I can used an evictor with GlobalWindows to purge the Windows after a period of inactivity(although admittedly, I did not research this yet, because I was unsure of how to trigger a GlobalWindow multiple times.
Any suggestions on how to achieve what I am looking to do would be greatly appreciated, I would also be happy to clarify any points needed.
If SessionWindows won't do the job, then you can use GlobalWindows with a custom Trigger and Evictor. The Trigger interface has onElement and timer-based callbacks that can fire whenever and as often as you like. If you go down this route, then yes, you'll also need to implement an Evictor to dispose of elements when they are no longer needed.
The documentation and the source code are helpful when trying to understand how this all fits together.
I am developing a Sitecore project that has several data import jobs running on daily basis. Every time a job is executed, it may update a large amount of Sitecore items (thousands) and I've noticed that all these editings trigger Solr index updates.
My concern is, I don't really sure if this is better or update everything at the end of the job is. So, I would love to try both options. Could anyone tell me how can I use code to temporarily disable Lucene/Solr indexing and enable it later when I finish editing all items?
This is a common requirement, and you're right to have such concerns. In general it's considered good practice to disable indexing during big import jobs, then rebuild afterwards.
Assuming you're using Sitecore 7 or above, this is pretty much what you need:
IndexCustodian.PauseIndexing();
IndexCustodian.ResumeIndexing();
Here's a comprehensive article discussing this:
http://blog.krusen.dk/disable-indexing-temporarily-in-sitecore-7/
In addition to #Martin answer, you can pass (silent=true) when you finish the editing of the item, Something like:
item.Editing.BeginEdit();
//Change fields values
item.Editing.EndEdit(true,true);
The second parameter in EndEdit() method force a silent update of the item, which means no Events/Indexing will be triggered on item save.
I feel this is safer than pausing indexing on the whole application level during import process, you just skip indexing of the items you are updating.
EDIT:
In case you need to rebuild the index for the updated items after the import process is done, you can use the following code, It will index the content tree starting from RootItemInTree and below:
var index = Sitecore.ContentSearch.ContentSearchManager.GetIndex("Your_Index_Name")
index.Refresh(new SitecoreIndexableItem(RootItemInTree));
To disable indexing during large import/update tasks you should wrap your logic inside a BulkUpdateContext block. You can also use other wrappers like the EventDisabler to stop events from being fired if that is appropriate in your context. Alternatively you could wrap your code in an EditContext and set it to silent. So your code could end up something like this:
using (new BulkUpdateContext())
using (new EditContext(targetItem, false, true))
{
// insert update logic here...
}
here is a older question that discusses this topic: Optimisation tips when migrating data into Sitecore CMS