Flink Watermark - apache-flink

In Flink, I found 2 ways to set up watermark,
the first is
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.getConfig.setAutoWatermarkInterval(5000)
the second is
env.addSource(
new FlinkKafkaConsumer[...](...)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[...](Duration.ofSeconds(10)).withTimestampAssigner(...)
)
I would like to know which will take effect eventually.

There's no conflict at all between those two -- they are dealing with separate concerns. Everything specified will take effect.
The first one,
env.getConfig.setAutoWatermarkInterval(5000)
is specifying how often you want watermarks to be generated (one watermark every 5000 msec). If this wasn't specified, the default of 200 msec would be used instead.
The second,
env.addSource(
new FlinkKafkaConsumer[...](...)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness[...](Duration.ofSeconds(10)).withTimestampAssigner(...)
)
is specifying the details of how those watermarks are to be computed. I.e., they should be generated by the FlinkKafkaConsumer using a BoundedOutOfOrderness strategy, with a bounded delay of 10 seconds. The WatermarkStrategy also needs a timestamp assigner.
There's no default WatermarkStrategy, so something like this second code snippet is required if you want to work with event time.

Related

how to buffer a batch of data in flink

I want to buffer a datastream in flink. My initial idea is caching 100 pieces of data into a list or tuple and then using insert into values (???) to insert data into clickhouse in bulk. Do you have better ways to do this?
The first solution that you post works but it is flaky. It can lead to starvation due to a simplistic logic. For instance, let's say that you have a counter of 100 to create a batch. It is possible that your stream never receives 100 events, or it takes hours to receive the 100th event. Then your basic and working solution can have events stuck in the window batch because it is a count window. In other words, your batch can generate windows of 30 seconds in a high throughput, or windows of 1 hour when your throughput is very low.
DataStream<User> stream = ...;
DataStream<Tuple2<User, Long>> stream1 = stream
.countWindowAll(100)
.process(new MyProcessWindowFunction());
In general, it depends on your use case. However, I would use a time window to make sure that my job always has the flush batch even though there are few or no events on the window.
DataStream<Tuple2<User, Long>> stream1 = stream
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.process(new MyProcessWindowFunction());;
Thanks for all the answers. I use a window function to solve this problem.
SingleOutputStreamOperator<ArrayList<User>> stream2 =
stream1.countWindowAll(batchSize).process(new MyProcessWindowFunction());
Then I overwrite the process function in which the batch size of data is buffered in an ArrayList.
If you want to import data into the database in batches, you can use the window(countWindow or timeWindow)to aggregate the data.

Flink CEP cannot get correct results on a unioned table

I use Flink SQL and CEP to recognize some really simple patterns. However, I found a weird thing (likely a bug). I have two example tables password_change and transfer as below.
transfer
transid,accountnumber,sortcode,value,channel,eventtime,eventtype
1,123,1,100,ONL,2020-01-01T01:00:01Z,transfer
3,123,1,100,ONL,2020-01-01T01:00:02Z,transfer
4,123,1,200,ONL,2020-01-01T01:00:03Z,transfer
5,456,1,200,ONL,2020-01-01T01:00:04Z,transfer
password_change
accountnumber,channel,eventtime,eventtype
123,ONL,2020-01-01T01:00:05Z,password_change
456,ONL,2020-01-01T01:00:06Z,password_change
123,ONL,2020-01-01T01:00:08Z,password_change
123,ONL,2020-01-01T01:00:09Z,password_change
Here are my SQL queries.
First create a temporary view event as
(SELECT accountnumber,rowtime,eventtype FROM password_change WHERE channel='ONL')
UNION ALL
(SELECT accountnumber,rowtime, eventtype FROM transfer WHERE channel = 'ONL' )
rowtime column is the event time extracted directly from original eventtime col with watermark periodic bound 1 second.
Then output the query result of
SELECT * FROM `event`
MATCH_RECOGNIZE (
PARTITION BY accountnumber
ORDER BY rowtime
MEASURES
transfer.eventtype AS event_type,
transfer.rowtime AS transfer_time
ONE ROW PER MATCH
AFTER MATCH SKIP PAST LAST ROW
PATTERN (transfer password_change ) WITHIN INTERVAL '5' SECOND
DEFINE
password_change AS eventtype='password_change',
transfer AS eventtype='transfer'
)
It should output
123,transfer,2020-01-01T01:00:03Z
456,transfer,2020-01-01T01:00:04Z
But I got nothing when running Flink 1.11.1 (also no output for 1.10.1).
What's more, I change the pattern to only password_change, it still output nothing, but if I change the pattern to transfer then it outputs several rows but not all transfer rows. If I exchange the eventtime of two tables which means let password_changes happen first, then the pattern password_change will output several rows while transfer not.
On the other hand, if I extract those columns from two tables and merge them in one table manually, then emit them into Flink, the running result is correct.
I searched and tried a lot to get it right including changing the SQL statement, watermark, buffer timeout and so on, but nothing helped. Hope anyone here can help. Thanks.
10/10/2020 update:
I use Kafka as the table source. tEnv is the StreamTableEnvironment.
Kafka kafka=new Kafka()
.version("universal")
.property("bootstrap.servers", "localhost:9092");
tEnv.connect(
kafka.topic("transfer")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
.field("transid",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("value",DataTypes.DECIMAL(38,18))
).createTemporaryTable("transfer");
tEnv.connect(
kafka.topic("pchange")
).withFormat(
new Json()
.failOnMissingField(true)
).withSchema(
new Schema()
.field("rowtime",DataTypes.TIMESTAMP(3))
.rowtime(new Rowtime()
.timestampsFromField("eventtime")
.watermarksPeriodicBounded(1000)
)
.field("channel",DataTypes.STRING())
.field("accountnumber",DataTypes.STRING())
.field("eventtype",DataTypes.STRING())
).createTemporaryTable("password_change");
Thank #Dawid Wysakowicz's answer. To confirm that, I added 4,123,1,200,ONL,2020-01-01T01:00:10Z,transfer to the end of transfer table, then the output becomes right, which means it is really some problem about watermarks.
So now the question is how to fix it. Since a user will not change his/her password frequently, the time gap between these two table is unavoidable. I just need the UNION ALL table has the same behavior as that I merged manually.
Update Nov. 4th 2020:
WatermarkStrategy with idle sources may help.
Most likely the problem is somewhere around watermark generation in conjunction with the UNION ALL operator. Could you share how you create the two tables including how you define the time attributes and what are the connectors? It could let me confirm my suspicions.
I think the problem is that one of the sources stops emitting Watermarks. If the transfer table (or the table with lower timestamps) does not finish and produces no records it emits no Watermarks. After emitting the fourth row it will emit Watermark = 3 (4-1 second). The Watermark of a union of inputs is the smallest of values of the two. Therefore the first table will pause/hold the Watermark with value Watermark = 3 and thus you see no progress for the original query and you see some records emitted for the table with smaller timestamps.
If you manually join the two tables, you have just a single input with a single source of Watermarks and thus it progresses further and you see some results.

Flink CEP - timeout and Dynamic Patterns

Is there a way to achieve the following with Flink CEP
Continue matching pattern of all events have not arrived within time. For example A -> B -> C are supposed to match within 20 seconds, and only event A has arrived, right now I see timeout happens at 20 seconds. After timeout if B arrives the pattern doesn't match. How do we make sure it continues to match after the timeout has occurred.
Multiple timeouts - Is it possible to alert multiple times on a given pattern. I have a use case where in I need to alert at time t1, t2, t3. Is there a way to achieve that ?
Multiple Patterns - Is have seen some articles where NFACompiler.NFAFactory is maintained in a map and based on the data, the right one is looked up and used. Is there an example that I can find how to do it with current version of Flink.

Custom Watermarks with Apache Flink

I am investigating the types of watermarks that can be inserted into the data stream. 
While this may go outside of the purpose of watermarks, I'll ask it anyway.
Can you create a watermark that holds a timestamp and k/v pair(s) (this=that, that=this)? 
Hence the watermark will hold {12DEC180500GMT,this=that, that=this}.
Or
{Timestamp, kvp1, kvp2, kvpN}
Is something like this possible? I have reviewed the user and API docs but may have overlooked something
No, the Watermark class in Flink
(found in
flink/flink-streaming/java/src/main/java/org/apache/flink/streaming/api/watermark/Watermark.java)
has one one instance variable besides MAX_WATERMARK, which is
/** The timestamp of the watermark in milliseconds. */
private final long timestamp;
So watermarks cannot carry any information besides a timestamp, which must be a long value.

Two type of triggers on one window?

I need your advice, really
In my task i need to aggregate events by two type of aggregation.
First type - is onCount, second type - is onTime.
If event is for onCount aggregation - it has fields number - number of event, and totalCount - what count of events we should accumulate before aggregate.
If event is for onTime aggregation - it has field time - it's date after which we should get all accumulate events and start aggregating.
I can groupped events by type, start window and set trigger:
stream
.keyBy(e => (e.clientSystemId, e.onMode))
.window(GlobalWindows.create())
.trigger(new WindowAggregationTrigger())
But in trigger i need to have state - total count or time.
And in best solution - i need two different triggers - first is about counting and second - is about time aggregation.
My question is - how beautifully to solve this problem?
When i need two triggers with different logic - first about counting, second- about time trigger.
I do not ask to solve the problem for me, I ask for advice.
We developing on Apache Flink 1.4.
It is not possible to apply two different triggers in the same window operator, but you can implement a single trigger to distinguish the onCount and onTime cases.
However, I would recommend to split the stream into two streams (using split() or side outputs), apply window operators with different triggers on the splitted streams, and later union() the streams together (if that is necessary).

Resources