Apache Flink EventTime processing not working - apache-flink

I am trying to perform stream-stream join using Flink v1.11 app on KDA. Join wrt to ProcessingTime works, but with EventTime I don’t see any output records from Flink.
Here is my code with EventTime processing which is not working,
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
DataStream<Trade> input1 = createSourceFromInputStreamName1(env)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Trade>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
);
DataStream<Company> input2 = createSourceFromInputStreamName2(env)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Company>forMonotonousTimestamps()
.withTimestampAssigner(((event, l) -> event.getEventTime()))
);
DataStream<String> joinedStream = input1.join(input2)
.where(new TradeKeySelector())
.equalTo(new CompanyKeySelector())
.window(TumblingEventTimeWindows.of(Time.seconds(30)))
.apply(new JoinFunction<Trade, Company, String>() {
#Override
public String join(Trade t, Company c) {
return t.getEventTime() + ", " + t.getTicker() + ", " + c.getName() + ", " + t.getPrice();
}
});
joinedStream.addSink(createS3SinkFromStaticConfig());
env.execute("Flink S3 Streaming Sink Job");
}
I got a similar join working with ProcessingTime
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime);
DataStream<Trade> input1 = createSourceFromInputStreamName1(env);
DataStream<Company> input2 = createSourceFromInputStreamName2(env);
DataStream<String> joinedStream = input1.join(input2)
.where(new TradeKeySelector())
.equalTo(new CompanyKeySelector())
.window(TumblingProcessingTimeWindows.of(Time.milliseconds(10000)))
.apply (new JoinFunction<Trade, Company, String> (){
#Override
public String join(Trade t, Company c) {
return t.getEventTime() + ", " + t.getTicker() + ", " + c.getName() + ", " + t.getPrice();
}
});
joinedStream.addSink(createS3SinkFromStaticConfig());
env.execute("Flink S3 Streaming Sink Job");
}
Sample records from two streams which I am trying to join:
{'eventTime': 1611773705, 'ticker': 'TBV', 'price': 71.5}
{'eventTime': 1611773705, 'ticker': 'TBV', 'name': 'The Bavaria'}

I don't see anything obviously wrong, but any of the following could cause this job to not produce any output:
A problem with watermarking. For example, if one of the streams becomes idle, then the watermarks will cease to advance. Or if there are no events after a window, then the watermark will not advance far enough to close that window. Or if the timestamps aren't actually in ascending order (with the forMonotonousTimestamps strategy, the events should be in order by timestamp), the pipeline could be silently dropping all of the out-of-order events.
The StreamingFileSink only finalizes its output during checkpointing, and does not finalize whatever files are pending if and when the job is stopped.
A windowed join behaves like an inner join, and requires at least one event from each input stream in order to produce any results for a given window interval. From the example you shared, it looks like this is not the issue.
Update:
Given that what you (appear to) want to do is to join each Trade with the latest Company record available at the time of the Trade, a lookup join or a temporal table join seem like they might be good approaches.
Here are a couple of examples:
https://github.com/ververica/flink-sql-cookbook/blob/master/joins/04/04_lookup_joins.md
https://github.com/ververica/flink-sql-cookbook/blob/master/joins/03/03_kafka_join.md
Some documentation:
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/joins.html#event-time-temporal-join
https://ci.apache.org/projects/flink/flink-docs-stable/dev/table/streaming/versioned_tables.html

Related

Cumulative Acknowledgement is not happening in the flink-connector-pulsar

We are using the below libraries-
Flink - 1.15.0
Pulsar- 2.8.2
flink-connector-pulsar=1.15.0
TestJob.java
public class TestJob {
public static void main(String[] args) {
String authParams = String.format("token:%s", PULSAR_CLIENT_AUTH_TOKEN);
String topicPattern = "persistent://a/b/test";
List topics = new ArrayList();
topics.add(topicPattern);
Properties properties = new Properties();
properties.setProperty(PulsarOptions.PULSAR_AUTH_PLUGIN_CLASS_NAME.key(),
AuthenticationToken.class.getName());
properties.setProperty(PulsarOptions.PULSAR_AUTH_PARAMS.key(), authParams);
properties.setProperty(PulsarOptions.PULSAR_TLS_TRUST_CERTS_FILE_PATH.key(),PULSAR_CERT_PATH);
properties.setProperty(PulsarOptions.PULSAR_SERVICE_URL.key(), PULSAR_HOST);
properties.setProperty(PulsarOptions.PULSAR_CONNECT_TIMEOUT.key(),"600000");
properties.setProperty(PulsarOptions.PULSAR_READ_TIMEOUT.key(),"600000");
properties.setProperty(PulsarSourceOptions.PULSAR_ENABLE_AUTO_ACKNOWLEDGE_MESSAGE.key(),Boolean.TRUE.toString());
properties.setProperty(PulsarOptions.PULSAR_REQUEST_TIMEOUT.key(),"600000");
PulsarSource<String> src = PulsarSource.builder()
.setServiceUrl(PULSAR_HOST)
.setAdminUrl(PULSAR_ADMIN_HOST)
.setProperties(properties)
.setConfig(PulsarSourceOptions.PULSAR_PARTITION_DISCOVERY_INTERVAL_MS,10000000L)
.setStartCursor(StartCursor.earliest())
.setDeserializationSchema(PulsarDeserializationSchema.flinkSchema(new SimpleStringSchema()))
.setSubscriptionName("test-subscription-local")
.setSubscriptionType(SubscriptionType.Failover)
.setConsumerName(String.format("test-consumer-local"))
.setTopics(topics).build();
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.getConfig().setAutoWatermarkInterval(0L);
env.addDefaultKryoSerializer(DateTime.class, JodaDateTimeSerializer.class);
String sourceName = String.format("pulsar-source-local");
DataStream<String> stream = env.fromSource(src,
WatermarkStrategy.noWatermarks(),sourceName)
.setParallelism(1)
.uid(sourceName)
.name(sourceName);
stream
.process(new TestProcessFunction()).setParallelism(1)
.uid(String.format("test-job-pf"))
.name(String.format("test-job-pf"))
.addSink(new TestSink()).setParallelism(1)
.uid(String.format("sink-job"))
.name(String.format("sink-job"));
}}
Messages = M-1 ..... M-10
Expected behavior
Upon the acknowledgment, messages should not be appearing again.
Upon job restart after ensuring it has processed all the messages, the messages keep coming back.
We saw that the cumulativeAcknowledgement() function is invoked all the time with or without checkpoint enabled.

Flink watermark not advancing at all? Stuck at -9223372036854775808

I'm encountering similar issue to Flink EventTime Processing Watermark is always coming as -9223372036854725808 However, the suggested solutions (set parallelism and disable checkpointing) do not have any effect. In this example, I'm simply streaming 1000 events 1 second apart, and then comparing the event timestamp to ctx.timerService().currentWatermark()
>>> v=(61538659200000,0), watermark=-9223372036854775808
>>> v=(61538659201000,1), watermark=-9223372036854775808
>>> v=(61538660198000,998), watermark=-9223372036854775808
>>> v=(61538660199000,999), watermark=-9223372036854775808
public void watermarks()
throws Exception
{
final var env = StreamExecutionEnvironment.createLocalEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.STREAMING);
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setMaxParallelism(1);
final long startMs = new Date(2020, 1, 1).getTime();
final var events = new ArrayList<Tuple2<Long, Integer>>();
for (var ii = 0; ii < 1000; ++ii ) {
events.add(new Tuple2<Long, Integer>(startMs + ii * 1000, ii));
}
env.fromCollection(events)
.assignTimestampsAndWatermarks(
WatermarkStrategy.<Tuple2<Long, Integer>>forMonotonousTimestamps()
.withTimestampAssigner((event, ts) -> event.f0))
.setParallelism(1)
.keyBy(row -> row.f1 % 2)
.process(new ProcessFunction<Tuple2<Long, Integer>, String>()
{
#Override
public void processElement(
final Tuple2<Long, Integer> value,
final Context ctx,
final Collector<String> out)
throws Exception
{
out.collect("v=" + value + ", watermark=" + ctx.timerService().currentWatermark());
}
})
.setParallelism(1)
.print()
.setParallelism(1);
final var result = env.execute();
System.out.println(result);
}
forMonotonousTimestamps is a periodic watermark generator that only generates watermarks when triggered by a timer. By default this timer fires every 200 msec (this is the autoWatermarkInterval). Your job doesn't run long enough for this timer to fire.
Bounded sources do generate a watermark with its timestamp set to MAX_WATERMARK when they reach the end of their input -- just before shutting down the job. You're not seeing this watermark in the output from your job because there are no events that follow it.
If you want to generate watermarks with every event, you can implement a custom watermark strategy that emits a watermarks in the onEvent method of the WatermarkGenerator (docs). This is usually a bad idea in production, as you'll waste CPU cycles and network bandwidth on these extra watermarks, but sometimes for testing this is helpful.
According to source code comments:
/**
* Creates a new enriched {#link WatermarkStrategy} that also does idleness detection in the
* created {#link WatermarkGenerator}.
*
* <p>Add an idle timeout to the watermark strategy. If no records flow in a partition of a
* stream for that amount of time, then that partition is considered "idle" and will not hold
* back the progress of watermarks in downstream operators.
*
* <p>Idleness can be important if some partitions have little data and might not have events
* during some periods. Without idleness, these streams can stall the overall event time
* progress of the application.
*/
default WatermarkStrategy<T> withIdleness(Duration idleTimeout) ...
So, You can try to use WatermarkStrategy.forMonotonousTimestamps.withIdleness(...)

Flink task Manager hangs

Here is the programme
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
env.setRuntimeMode(RuntimeExecutionMode.BATCH);
ParameterTool parameters = ParameterTool.fromArgs(args);
String ftpUri ;
env.readTextFile(ftpUri,"UTF-8")
.map(mapFunction)
.keyBy(tuple2 -> tuple2.f0)
.window(TumblingProcessingTimeWindows.of(Time.seconds(2)))
.reduce((tuple2, t1) -> {
Collection newCol = new ArrayList<OpisRecord>();
Collections.addAll(newCol,tuple2.f1.toArray());
Collections.addAll(newCol,t1.f1.toArray());
return new Tuple2(tuple2.f0,newCol);
})
.addSink(new SinktoDistributedCache());
env.execute();
Works fine with for record size : 10k to 40k. But hangs up for anything above 40k.
I have tried increasing number task managers and parallelism but no gain.
Any clues ?

Flink SQL: How can use a Long type column to Rowtime

Flink1.9.1
I read a csv file. I want to use a long type column to TUMBLE.
I use UDF transfer Long type to Timestamp type,but is can't work
error message: Window can only be defined over a time attribute column.
I try to debug. TimeIndicatorRelDataType is not Timestamp,I don't know how to transfer and why?
def isTimeIndicatorType(relDataType: RelDataType): Boolean = relDataType match {
case ti: TimeIndicatorRelDataType => true
case _ => false
}
CODE
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
env.setParallelism(1);
// read csv
URL fileUrl = HotItemsSql.class.getClassLoader().getResource("UserBehavior-less.csv");
CsvTableSource csvTableSource = CsvTableSource.builder().path(fileUrl.getPath())
.field("userId", BasicTypeInfo.LONG_TYPE_INFO)
.field("itemId", BasicTypeInfo.LONG_TYPE_INFO)
.field("categoryId", BasicTypeInfo.LONG_TYPE_INFO)
.field("behavior", BasicTypeInfo.LONG_TYPE_INFO)
.field("optime", BasicTypeInfo.LONG_TYPE_INFO)
.build();
// trans to stream
DataStream<Row> csvDataStream=csvTableSource.getDataStream(env).assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Row>() {
#Override
public long extractAscendingTimestamp(Row element) {
return Timestamp.valueOf(element.getField(5).toString()).getTime();
}
}).broadcast();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
tableEnv.registerDataStream("T_UserBehavior",csvDataStream,"userId,itemId,categoryId,behavior,optime");
tableEnv.registerFunction("Long2DateTime",new DateTransFunction());
Table result = tableEnv.sqlQuery("select userId," +
"TUMBLE_START(Long2DateTime(optime), INTERVAL '10' SECOND) as window_start," +
"TUMBLE_END(Long2DateTime(optime), INTERVAL '10' SECOND) as window_end " +
"from T_UserBehavior " +
"group by TUMBLE(Long2DateTime(optime),INTERVAL '10' SECOND),userId");
tableEnv.toRetractStream(result, Row.class).print();
UDF
import java.sql.Timestamp;
public class DateTransFunction extends ScalarFunction {
public Timestamp eval(Long longTime) {
try {
Timestamp t = new Timestamp(longTime);
return t;
} catch (Exception e) {
return null;
}
}
}
error stack
Exception in thread "main" org.apache.flink.table.api.ValidationException: Window can only be defined over a time attribute column.
at org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.getOperandAsTimeIndicator$1(DataStreamLogicalWindowAggregateRule.scala:85)
at org.apache.flink.table.plan.rules.datastream.DataStreamLogicalWindowAggregateRule.translateWindowExpression(DataStreamLogicalWindowAggregateRule.scala:90)
at org.apache.flink.table.plan.rules.common.LogicalWindowAggregateRule.onMatch(LogicalWindowAggregateRule.scala:68)
at org.apache.calcite.plan.AbstractRelOptPlanner.fireRule(AbstractRelOptPlanner.java:319)
at org.apache.calcite.plan.hep.HepPlanner.applyRule(HepPlanner.java:560)
at org.apache.calcite.plan.hep.HepPlanner.applyRules(HepPlanner.java:419)
at org.apache.calcite.plan.hep.HepPlanner.executeInstruction(HepPlanner.java:256)
at org.apache.calcite.plan.hep.HepInstruction$RuleInstance.execute(HepInstruction.java:127)
at org.apache.calcite.plan.hep.HepPlanner.executeProgram(HepPlanner.java:215)
at org.apache.calcite.plan.hep.HepPlanner.findBestExp(HepPlanner.java:202)
at org.apache.flink.table.plan.Optimizer.runHepPlanner(Optimizer.scala:228)
at org.apache.flink.table.plan.Optimizer.runHepPlannerSequentially(Optimizer.scala:194)
at org.apache.flink.table.plan.Optimizer.optimizeNormalizeLogicalPlan(Optimizer.scala:150)
at org.apache.flink.table.plan.StreamOptimizer.optimize(StreamOptimizer.scala:65)
at org.apache.flink.table.planner.StreamPlanner.translateToType(StreamPlanner.scala:410)
at org.apache.flink.table.planner.StreamPlanner.org$apache$flink$table$planner$StreamPlanner$$translate(StreamPlanner.scala:182)
Since you already managed to assign a timestamp in DataStream API, you should be able to call:
tableEnv.registerDataStream(
"T_UserBehavior",
csvDataStream,
"userId, itemId, categoryId, behavior, rt.rowtime");
The .rowtime instructs the API to create column with the timestamp stored in every stream record coming from DataStream API.
The community is currently working on making your program easier. In Flink 1.10 you should be able to define your CSV with rowtime table directly in a SQL DDL.

How to create batch or slide windows using Flink CEP?

I'm just starting with Flink CEP and I come from Esper CEP engine. As you may (or not) know, in Esper using their syntax (EPL) you can create a batch or slide window easily, grouping the events in those windows and allowing you to use this events with functions (avg, max, min, ...).
For example, with the following pattern you can create a batch windows of 5 seconds and calculate the average value of the attribute price of all the Stock events that you have received in that specified window.
select avg(price) from Stock#time_batch(5 sec)
The thing is I would like to know how to implement this on Flink CEP. I'm aware that, probably, the goal or approach in Flink CEP is different, so the way to implement this may not be as simple as in Esper CEP.
I have taken a look at the docs regarding to time windows, but I'm not able to implement this windows along with Flink CEP. So, given the following code:
DataStream<Stock> stream = ...; // Consume events from Kafka
// Filtering events with negative price
Pattern<Stock, ?> pattern = Pattern.<Stock>begin("start")
.where(
new SimpleCondition<Stock>() {
public boolean filter(Stock event) {
return event.getPrice() >= 0;
}
}
);
PatternStream<Stock> patternStream = CEP.pattern(stream, pattern);
/**
CREATE A BATCH WINDOW OF 5 SECONDS IN WHICH
I COMPUTE OVER THE AVERAGE PRICES AND, IF IT IS
GREATER THAN A THREESHOLD, AN ALERT IS DETECTED
return avg(allEventsInWindow.getPrice()) > 1;
*/
DataStream<Alert> result = patternStream.select(
new PatternSelectFunction<Stock, Alert>() {
#Override
public Alert select(Map<String, List<Stock>> pattern) throws Exception {
return new Alert(pattern.toString());
}
}
);
How can I create that window in which, from the first one received, I start to calculate the average for the following events within 5 seconds. For example:
t = 0 seconds
Stock(price = 1); (...starting batch window...)
Stock(price = 1);
Stock(price = 1);
Stock(price = 2);
Stock(price = 2);
Stock(price = 2);
t = 5 seconds (...end of batch window...)
Avg = 1.5 => Alert detected!
The average after 5 seconds would be 1.5, and will trigger the alert. How can I code this?
Thanks!
With Flink's CEP library this behavior is not expressible. I would rather recommend using Flink's DataStream or Table API to calculate the averages. Based on that you could again use CEP to generate other events.
final DataStream<Stock> input = env
.fromElements(
new Stock(1L, 1.0),
new Stock(2L, 2.0),
new Stock(3L, 1.0),
new Stock(4L, 2.0))
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor<Stock>(Time.seconds(0L)) {
#Override
public long extractTimestamp(Stock element) {
return element.getTimestamp();
}
});
final DataStream<Double> windowAggregation = input
.timeWindowAll(Time.milliseconds(2))
.aggregate(new AggregateFunction<Stock, Tuple2<Integer, Double>, Double>() {
#Override
public Tuple2<Integer, Double> createAccumulator() {
return Tuple2.of(0, 0.0);
}
#Override
public Tuple2<Integer, Double> add(Stock value, Tuple2<Integer, Double> accumulator) {
return Tuple2.of(accumulator.f0 + 1, accumulator.f1 + value.getValue());
}
#Override
public Double getResult(Tuple2<Integer, Double> accumulator) {
return accumulator.f1 / accumulator.f0;
}
#Override
public Tuple2<Integer, Double> merge(Tuple2<Integer, Double> a, Tuple2<Integer, Double> b) {
return Tuple2.of(a.f0 + b.f0, a.f1 + b.f1);
}
});
final DataStream<Double> result = windowAggregation.filter((FilterFunction<Double>) value -> value > THRESHOLD);

Resources