the policy of new watermark generation when defined in DDL - apache-flink

In the https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/table/streaming/time_attributes.html
The DDL for the event time attribute and watermark is:
CREATE TABLE user_actions (
user_name STRING,
data STRING,
user_action_time TIMESTAMP(3),
-- declare user_action_time as event time attribute and use 5 seconds delayed watermark strategy
WATERMARK FOR user_action_time AS user_action_time - INTERVAL '5' SECOND
) WITH (
...
);
I would ask the policy of new watermark generation:
With data stream, flink provides following two policies for watermark generation,
what about in ddl?
periodically like AssignerWithPeriodicWatermarks does,that is, try to generate new watermark periodically
punctuated like AssignerWithPunctuatedWatermarks, that is,try to generate new watermark when new event comes.

The watermark is periodically assigned. You can specify the interval via the configuration pipeline.auto-watermark-interval.
Also note, that the API for Watermarks was changed in the DataStream API and the two classes you mention are deprecated by now.
[1] https://ci.apache.org/projects/flink/flink-docs-release-1.12/dev/event_timestamps_watermarks.html#introduction-to-watermark-strategies

Related

Flink SQL Windows not Reporting Final Results

I'm using Flink SQL to compute event-time-based windowed analytics. Everything works fine until my data source becomes idle each evening, after which the results for the last minute aren't produced until the next day when data begins to flow again.
CREATE TABLE input
id STRING,
data BIGINT,
rowtime TIMESTAMP(3) METADATA FROM 'timestamp',
WATERMARK FOR rowtime AS rowtime - INTERVAL '1' SECOND
WITH (
'connector' = 'kafka',
'topic' = 'input',
'properties.bootstrap.servers' = 'localhost:9092',
'scan.startup.mode' = 'latest-offset',
'format' = 'json'
)
SELECT ...
FROM
(SELECT * FROM
TABLE(TUMBLE(TABLE input, DESCRIPTOR(rowtime), INTERVAL '1' MINUTES)))
GROUP BY ..., window_start, window_end
I've tried setting table.exec.source.idle-timeout, but it didn't help. What can I do?
table.exec.source.idle-timeout (and the corresponding withIdleness construct used with the DataStream API for WatermarkStrategy) detects idle input partitions and prevents them from holding back the progress of the overall watermark. However, for the overall watermark to advance, there must still be some input, somewhere.
Some options:
(1) Live with the problem, which means waiting until the watermark can advance normally, based on observing larger timestamps in the input stream. As you've indicated, in your use case this can require waiting several hours.
(2) Arrange for the input stream(s) to contain keep-alive messages. This way the watermark generator will have evidence (based on the timestamps in the keep-alive messages) that it can advance the watermark. You'll have to modify your queries to ignore these otherwise extraneous events.
(3) Upon reaching the point where the job has fully ingested all of the daily input, but hasn't yet produced the final set of results, stop the job and specify --drain. This will send a watermark with value MAX_WATERMARK through the pipeline, which will close all pending windows. You can then restart the job.
(4) Implement a custom watermark strategy that uses a processing-time timer to detect idleness and artificially advance the watermark based on the passage of wall clock time. This will require converting your table input to a DataStream, adding the watermarks there, and then converting back to a table for the windowing. See https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/data_stream_api/ for examples of these conversions.

flink cep sql Event Not triggering

I use CEP Pattern in Flink SQL which is working as expected connecting to Kafka broker. But when i connecting to cluster based cloud kafka setup, the Flink CEP is not triggering. Here is my sql:
create table agent_action_detail
(
agent_id String,
room_id String,
create_time Bigint,
call_type String,
application_id String,
connect_time Bigint,
row_time TIMESTAMP_LTZ(3), WATERMARK for row_time as row_time - INTERVAL '1' MINUTE)
with ('connector'='kafka', 'topic'='agent-action-detail', ...)
then I send messages in json format like
{"agent_id":"agent_221","room_id":"room1","create_time":1635206828877,"call_type":"inbound","application_id":"app1","connect_time":1635206501735,"row_time":"2021-10-25 16:07:09.019Z"}
in flink web ui, watermark works fine
flink web ui
I run my cep sql :
select * from agent_action_detail
match_recognize(
partition by agent_id
order by row_time
measures
last(BF.create_time) as create_time,
first(AF.connect_time) as connect_time
one row per match AFTER MATCH SKIP PAST LAST ROW
pattern (BF+ AF) define BF as BF.connect_time > 0 ,AF as AF.connect_time > 0
)
every kafka message, connect_time is > 0, but flink not triggering.
Can somebody help to this issue, thanks in advance!
select * from agent_action_detail match_recognize( partition by agent_id order by row_time measures AF.connect_time as connect_time one row per match pattern (BF AF) WITHIN INTERVAL '1' second define BF as (last(BF.connect_time, 1) < 1), AF as AF.connect_time >= 100)
Here is another cep sql still not working.
And the agent_action_detail table is insert by another flink sql as
insert into agent_action_detail select data.agent_id, data.room_id, data.create_time, data.call_type, data.application_id, data.connect_time, now() from source_table where type = 'xxx'
There are several things that can cause pattern matching to produce no results:
the input doesn't actually contain the pattern
watermarking is being done incorrectly
the pattern is pathological in some way
This particular pattern loops with no exit condition. This sort of pattern doesn't allow the internal state of the pattern matching engine to ever be cleared, which will lead to problems.
If you were using Flink CEP directly, I would tell you to
try adding until(condition) or within(time) to constrain the number of possible matches.
With MATCH_RECOGNIZE, see if you can add a distinct terminating element to the pattern.
Update: since you are still getting no results after modifying the pattern, you should determine if watermarking is the source of your problem. CEP relies on sorting the input stream by time, which depends on watermarking -- but only if you are using event time.
The easiest way to test this would be to switch to using processing time:
create table agent_action_detail
(
agent_id String,
...
row_time AS PROCTIME()
)
with (...)
If that works, then either the timestamps or watermarks are the problem. For example, if all of the events are late, you'll get no results. In your case, I'm wondering the row_time column has any data in it.
If that doesn't reveal the problem, please share a minimal reproducible example, including the data needed to observe the problem.

Flink Watermark

In Flink, I found 2 ways to set up watermark,
the first is
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setAutoWatermarkInterval(5000)
the second is
env.addSource(
new FlinkKafkaConsumer[...](...)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[...](Duration.ofSeconds(10)).withTimestampAssigner(...)
)
I would like to know which will take effect eventually.
There's no conflict at all between those two -- they are dealing with separate concerns. Everything specified will take effect.
The first one,
env.getConfig.setAutoWatermarkInterval(5000)
is specifying how often you want watermarks to be generated (one watermark every 5000 msec). If this wasn't specified, the default of 200 msec would be used instead.
The second,
env.addSource(
new FlinkKafkaConsumer[...](...)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[...](Duration.ofSeconds(10)).withTimestampAssigner(...)
)
is specifying the details of how those watermarks are to be computed. I.e., they should be generated by the FlinkKafkaConsumer using a BoundedOutOfOrderness strategy, with a bounded delay of 10 seconds. The WatermarkStrategy also needs a timestamp assigner.
There's no default WatermarkStrategy, so something like this second code snippet is required if you want to work with event time.

How to get DataStream key after keyBy() in Flink Java API

I'm reading from a Kafka cluster in a Flink streaming app. After getting the source stream i want to aggregate events by a composite key and a timeEvent tumbling window and then write result to a table.
The problem is that after applying my aggregateFunction that just counts number of clicks by clientId i don't find the way to get the key of each output record since the api returns an instance of accumulated result but not the corresponding key.
DataStream<Event> stream = environment.addSource(mySource)
stream.keyBy(new KeySelector<Event,Integer>() {
public Integer getKey(Event event) { return event.getClientId(); })
.window(TumblingEventTimeWindows.of(Time.minutes(1))).aggregate(new MyAggregateFunction)
How do i get the key that i specified before? I did not inject key of the input events in the accumulator as i felt i wouldn't be nice.
Rather than
.aggregate(new MyAggregateFunction)
you can use
.aggregate(new MyAggregateFunction, new MyProcessWindowFunction)
and in this case the process method of your ProcessWindowFunction will be passed the key, along with the pre-aggregated result of your AggregateFunction and a Context object with other potentially relevant info. See the section in the docs on ProcessWindowFunction with Incremental Aggregation for more details.

Custom Watermarks with Apache Flink

I am investigating the types of watermarks that can be inserted into the data stream. 
While this may go outside of the purpose of watermarks, I'll ask it anyway.
Can you create a watermark that holds a timestamp and k/v pair(s) (this=that, that=this)? 
Hence the watermark will hold {12DEC180500GMT,this=that, that=this}.
Or
{Timestamp, kvp1, kvp2, kvpN}
Is something like this possible? I have reviewed the user and API docs but may have overlooked something
No, the Watermark class in Flink
(found in
flink/flink-streaming/java/src/main/java/org/apache/flink/streaming/api/watermark/Watermark.java)
has one one instance variable besides MAX_WATERMARK, which is
/** The timestamp of the watermark in milliseconds. */
private final long timestamp;
So watermarks cannot carry any information besides a timestamp, which must be a long value.

Resources