I have an always one application, listening to a Kafka stream, and processing events. Events are part of a session. And I need to do calculations based off of a sessions data. I am running into a problem trying to correctly run my calculations due to the length of my sessions. 90% of my sessions are done after 5 minutes. 99% are done after 1 hour. Sessions may last more than a day, due to this being a real-time system, there is no determined end. Session are unique, and show never collide.
I am looking for a way where I can process a window multiple times, either with an initial wait period and processing any later events after that, or a pure process per event type structure. I will need to keep all previous events around(ListState), as well as previously processed values(ValueState).
I previously thought allowedLateness would allow me to do this, but it seems the lateness is only considered for when the event should have been processed, it does not extend an actual window. GlobalWindows may also work, but I am unsure if there is a way to process a window multiple times. I believe I can used an evictor with GlobalWindows to purge the Windows after a period of inactivity(although admittedly, I did not research this yet, because I was unsure of how to trigger a GlobalWindow multiple times.
Any suggestions on how to achieve what I am looking to do would be greatly appreciated, I would also be happy to clarify any points needed.
If SessionWindows won't do the job, then you can use GlobalWindows with a custom Trigger and Evictor. The Trigger interface has onElement and timer-based callbacks that can fire whenever and as often as you like. If you go down this route, then yes, you'll also need to implement an Evictor to dispose of elements when they are no longer needed.
The documentation and the source code are helpful when trying to understand how this all fits together.
Related
I use flink to process dynamoDB stream data.
Watermark strategy: periodic, extract approximate time stamp from stream events and use it under withTimeStampAssigner.
Idleness: 10s(may not be useful at all as we only use parallism of 1.)
The data work flow looks like this:
InputStream.assignTimeStampsAndWatermarks().keyby().window(TumblingEventTimeWindow.of(1min).sideOutputLateData().reduce().map()
Then I getSideOutput(), and process the late events using exactly similar above workflow with small change such as no need to assign time stamp and watermark, no need for late output.
My logs show that all things work perfectly if ddb stream data has right timstamp, the corresponding window can close without issue and I can see the output after window is closed.
However, after I introduced late events, the late records processing logic is never triggered. I am sure that the late record’s timestamp corresponding window has closed. I put a log after I call getSideOutPut(), it never triggered. I used debugger and I am sure the getSideOutput() code is not triggered as well.
Can someone help to check this issue? Thank you.
I tried to use a different watermark strategy for late records logic. This doesn’t work as well. I want to understand why the late records are not collected to the late stream.
Without seeing more details from your implementation is difficult to give an accurate diagnosis, but based on your description, I wouldn't expect this to work:
Then I getSideOutput(), and process the late events using exactly similar above workflow with small change such as no need to assign time stamp and watermark, no need for late output.
If you are trying to apply event time windowing to the stream of late events, that's not going to work unless you adjust the allowed lateness for those windows enough to accommodate them.
As a starting point, have you tried printing the stream of late events?
currently we are having issue with an CPU Limit. We do have a lot of processes that are most likely not optimized, I have already combined some processes for the same object but it is not enough. I am trying to understand logs rights now - as you can see on the screenshots, there is one process that is being called multiple times (I assume each time for created record). Even if I create, for example, 60 records in one operation/dml statement, the Process Builders still gets called 60 times? (this is what I think is happening) Is that a problem we are having right now? If so, is there a better way to do it? Because right now we need updates from PB to run, but I expected it should get bulkified or something like that. I was also thinking there might be some looping between processes. If there are more information you need, please let me know. Thank you.
Well, yes, the process builder will be invoked 60 times, 1 record at a time. But that shouldn't be your problem. The final update / create child records / email send (or whatever your action is) will be bulkified, it won't save 1 record at a time. If the process calls some apex actions - they're supposed to support passing collection of records, not just single record.
You maybe looking at wrong place. CPU time suggests code problems, not config (flow, workflow, process builder... although if you're doing updates of fields on "this" record it's possible you'd benefit from before-save flows). Try to compare timestamps related to METHOD_BEGIN, METHOD_END for triggers, code methods (including invocable action / process plugin interfaces).
Maybe there's code that doesn't need to run because key fields didn't change, there's nothing to recalculate, rollup. Hard to say without seeing the debug log.
Maybe the operation doesn't have to be immediate. Think if you can offload some stuff to "scheduled actions", "time based workflows" or in apex terms "#future, batchable, queueable". But they'd have to be relatively safe to run, if there's error - it won't display to the user because the action will be in the background, you'd need to handle the errors manually (send an email, create a record, make chatter post or bell notification).
You could try uploading the log to https://apextimeline.herokuapp.com/ and try to make sense out of that Gantt-chart-like output. Or capture the log "pro" way, with https://help.salesforce.com/s/articleView?id=sf.code_dev_console_solving_problems_using_system_log.htm&type=5 or https://marketplace.visualstudio.com/items?itemName=financialforce.lana (you'll likely need developer's help to make sense out of it).
I am currently working on a streaming program which aggregates the data from a number of messages (8), the aggregation requires all 8 messages, so i am using a count window. All 8 messages share the same unique key. However there is no guarantee that all 8 messages will arrive. So my question is two-fold:
First what happens to a Flink count window that never closes? I am assuming the windows simply accumulate overtime, consuming more and more ram.
Secondly can I close a count window if it does not receive all of its messages within a given time? I am looking for a solution that is as real-time as possible, I already tried using a time window, however the time-of-flight of the messages varies between a few millisecond and 40 seconds.
So essentially is there a way to define a window that triggers at 8 messages, and evicts all messages from the window after a given time (in this case after 60 seconds)?
The answer for your question regarding never closing windows is that the part of the state reserved for them will never be freed.
Your described behaviour could be implemented with custom trigger and evictor on Global Window. The trigger could either wait the expected time or number of elements before emitting window while the evictor would evict all messages if there are less than 8. For some referential implementation you can have a look at CountTrigger(emits on count) and EventTimeTrigger(emits on time). For the evictor have a look at CountEvictor.
For cases like this where you need to combine stateful stream processing with timers, ProcessFunction can be a good choice. See https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/stream/process_function.html.
In Amazon Web Services, their queues allow you to post messages with a visibility delay up to 15 minutes. What if I don't want messages visible for 6 months?
I'm trying to come up with an elegant solution to the poll/push problem. I can write code to poll the SQS (or a database) every few seconds, check for messages that are ready to be visible, then move them to a "visible queue", or something like that. I wish there was a simpler, more reliable method to have messages become visible in queues far into the future without me having to worry about my polling application working perfectly all the time.
I'm not married to AWS, SQS or any of that, but I'd prefer to find a cloud-friendly solution that is stable, reliable and will trigger an event far into the future without me having to worry about checking on its status every day.
Any thoughts or alternate trees for me to explore barking up are welcome.
Thanks!
It sounds like you might be misunderstanding the visibility delay. Its purpose is to make sure that the polling application doesn't pull the same item off the queue more than once.
In other words, when the item is pulled off the queue it becomes invisible for a predetermined period of time (default is 30 seconds, max is 15 minutes) in case the polling system has a cluster of machines reading from the queue all at once.
Here's the relevant documentation:
http://docs.amazonwebservices.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/IntroductionArticle.html#AboutVT
...and the sentence in particular that relates to my comment is:
"Immediately after the component receives the message, the message is still in the queue. However, you don't want other components in the system receiving and processing the message again. Therefore, Amazon SQS blocks them with a visibility timeout, which is a period of time during which Amazon SQS prevents other consuming components from receiving and processing that message."
You should be able to use SQS for your purpose since you can leave an item in the queue for as long as you want.
7 years later, and Amazon still doesn't support the feature you need!
The two ways you can sort of get it to work are:
have messages contain a delivery target datetime in their message_attributes, and have the workers that consume the queue's messages just delete and recreate any message that is consumed before its target, with delay = max(0, min(secs_until_target_datetime, 900)) ; that would allow you to effectively schedule a message for any arbitrary future time;
or,
(slightly less frequent and costly:) similarly, if a message isn't due to be handled yet, recreate it and change its visibility timeout to be timeout = max(0, min(secs_until_target_datetime, 43200))
The disadvantage of using visibility timeout is that any read will re-trigger it.
There has been a direct AWS solution possible since 2016-12-01: AWS Step Functions
Each execution can last/idle up to one year, persists the state between transitions, and doesn't cost you any money while it waits.
I am designing an application which has the potential to hang while waiting for data from servers (either Database or internet) the problem is that I don't know how best to cope with the multitude of different places things may take time.
I am happy to display a 'loading' dialog to the user while access is happening but ideally I don't want this to flick up and disappear for short running operations.
Microsoft word appears to handle this quite nicely, as if you click a button and the operation takes a long time, after a few seconds you get a 'working..' dialog. The operation is still synchronous and you can't interrupt the operation. However if the same operation happens quickly you obviously don't get the dialog.
I am happy(ish) to devise some generic background worker thread handler and 99% of my data processing is already done in static atomic methods, but I would like to go best practice on this one if I can.
If anyone has patterns, code or suggestions I welcome them all
Cheers
I would definitely think asynchronously using a pattern with 2 events. The first "event" is that you actually got your data from wherever/whenever you had to wait for it. The second event is a delay timer. If you get your data before this timer pops, all is well and good. If not, THEN you pop up your "I'm busy" and allow them to cancel the request. Usually cancel just mean "ignore" when you finally get the response.
Microsoft word appears to handle this quite nicely, as if you click a button and the operation takes a long time, after a few seconds you get a 'working..' dialog. The operation is still synchronous and you can't interrupt the operation. However if the same operation happens quickly you obviously don't get the dialog.
If this is the behavior your want...
You could handle this, fairly easily, by wrapping a class around BackgroundWorker. Just time the start of the DoWork event, and the time to the first progress report. If a certain amount of time passes, you could show your dialog - otherwise, block (since it's a short process).
That being said, any time you're doing work that can be processed asynchronously, I'd recommend doing it that way. It's much nicer to never block your UI for a noticable interval, even if it's short. This becomes much simpler in .NET 4 (or 3.5 with Rx framework) by using the task parallel library.
Ideally you should be running any IO or non-UI processing either in a background thread or asynchronously to avoid locking up the UI.