We have a requirement to trigger a pipeline based on conditions in multiple intervals,
for example : Pipeline can be triggered at 1pm if it meets condition-A, and pipeline can be triggered at 2pm if it meets the condition-B. these conditions are part of tasks under deploy stage (exec). will it be possible ? can we put multiple timers ??
Related
How can set a trigger in Flink to perform some operation when a particular time has passed?
Eg: Sum of the stream at 1 PM everyday
A KeyedProcessFunction can use timers to trigger actions at specific times (on a per-key basis). These can be either processing time timers, which use system time, or they can be event time timers, which are triggered by Watermarks.
Here are examples of each, from the tutorials in the docs:
processing time timer
event time timer
Also see the more detailed docs about process functions and timers.
Note that if you don't want to apply timers in a key-partitioned manner, but instead need to operate on the entire datastream (i.e., not in parallel), you can use keyBy(constant) to get yourself into a keyed context without actually partitioning the stream.
In Apache Flink, setAutoWatermarkInterval(interval) produces watermarks to downstream operators so that they advance their event time.
If the watermark has not been changed during the specified interval (no events arrived) the runtime will not emit any watermarks? On the other hand, if a new event is arrived before the next interval, a new watermark will be immediately emitted or it will be queued/waiting until the next setAutoWatermarkInterval interval is reached.
I am curious on what is the best configuration AutoWatermarkInterval (especially for high rate sources): the more this value is small, the more lag between processing time and event time will be small, but at the overhead of more BW usage to send the watermarks. Is that true accurate?
On the other hand, If I used env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime), Flink runtime will automatically assign timestamps and watermarks (timestamps correspond to the time the event entered the Flink dataflow pipeline i.e. the source operator), nevertheless even with ingestionTime we can still define a processing time timer (in the processElement function) as show below:
long timer = context.timestamp() + Timeout.
context.timerService().registerProcessingTimeTimer(timer);
where context.timestamp() is the ingestion time set by Flink.
Thank you.
The autoWatermarkInterval only affects watermark generators that pay attention to it. They also have an opportunity to generate a watermark in combination with event processing.
For those watermark generators that use the autoWatermarkInterval (which is definitely the normal case), they are collecting evidence for what the next watermark should be as a side effect of assigning timestamps for each event. When a timer fires (based on the autoWatermarkInterval), the watermark generator is then asked by the Flink runtime to produce the next watermark. The watermark wasn't waiting somewhere, nor was it queued, but rather it is created on demand, based on information that had been stored by the timestamp assigner -- which is typically the maximum timestamp seen so far in the stream.
Yes, more frequent watermarks means more overhead to communicate and process them, and lower latency. You have to decide how to handle this throughput/latency tradeoff based on your application's requirements.
You can always use processing time timers, regardless of the TimeCharacteristic. (By the way, at a low level, the only thing watermarks do is to trigger event time timers, be they in process functions, windows, etc.)
We have a flink application that has a map operator at the start. The output stream of this operation is routed to multiple window functions using filters. The window functions all have a parallelism of 1. We form a union of the output of the window functions and pass it to another map function and then send it to a sink.
We need the parallelism of both the map functions to take the parallelism of the environment. This happens as expected and the parallelism of the window function does turn out to be 1.
We have set 1 slot per task manager.
The issue is that all the window function tasks end up going to only the 1st task manager when we set the parallelism of the environment to greater than 1. The events end up going to this task manager alone and end up causing a bottleneck. Is there a way to distribute the window function task across multiple task managers when we have parallelism > 1? Will doing a rebalance() help?
If each task manager has only one slot, and all of the window function tasks are in the same task manager, then apparently all of the window function tasks are in the same slot.
That being the case, you could use slot sharing groups to force different windows into different slots, and thus onto different task managers.
With Flink 1.9.2/1.10.0 or later, you can set the cluster.evenly-spread-out-slots config boolean to true.
Side note - instead of using a filter on multiple streams to create a router, use a ProcessFunction with multiple side outputs, one per target window operator. This is more efficient, as you're not replicating the data N times and then filtering down to a subset.
I have a push task queue and each of my jobs consists of multiple similar TaskQueue tasks. Each of these tasks takes less than a second to finish and can add new tasks to the queue (they should be also completed to consider the job finished). Task results are written to a DataStore.
The goal is to understand when a job has finished, i.e. all of its tasks are completed.
Writes are really frequent and I can't store the results inside one entity group. Is there a good workaround for this?
In a similar context I used a scheme based on memcache, which doesn't have a significant write rate limitation as datastore entity groups:
each job gets a unique memcache key associated with it, which it passes to each of subsequent execution tasks it may enqueue
every execution task updates the memcache value corresponding to the job key with the current timestamp and also enqueues a completion check task, delayed with an idle timeout value, large enough to declare the job completed if elapsed.
every completion check task compares the memcache value corresponding to the job key against the current timestamp:
if the delta is less than the idle timeout it means the job is not complete (some other task was executed since this completion check task was enqueued, hence some other completion check task is in the queue)
otherwise the job is completed
Note: the idle timeout should be larger than the maximum time a task might spend in the queue.
I have set a time dependent workflow rule on certain condition which is below...
after 6 days of particular(1st follow-up date ->in my case) date the workflow rule should update the picklist field(current status -> in my case)....
My question is at what time on 6th day it will execute ?
Do we have control on this time ?
Regards,
Ankit
We only control the day of execution because :
1. Salesforce evaluates time-based workflow on the organization's time zone, not the users'. Users in different time zones may see differences in behavior.
2. Time-dependent actions aren't executed independently. They're grouped into a single batch that starts executing within one hour after the first action enters the batch.
3. Time triggers don't support minutes or seconds.
4. Salesforce limits the number of time triggers an organization can execute per hour. If an organization exceeds the limits for its Edition,
Salesforce defers the execution of the additional time triggers to the next hour.