Run aggregation on kinesis stream data using Flink SQL query - apache-flink

I am trying to run aggregations like count, sum, etc and grouping them. My source and sink are kinesis topics. I am getting the below error,
org.apache.flink.client.program.ProgramInvocationException: The main method caused an error: Table sink '*anonymous_datastream_sink$1*' doesn't support consuming update changes which is produced by node GroupAggregate(select=[COUNT(*) AS EXPR$0])
Pasting the code sample:
DataStream<String> stream = env.addSource(new FlinkKinesisConsumer<>(inputStream, new SimpleStringSchema(), inputProperties));
tableEnv.createTemporaryView("kinesis", stream);
Table result = tableEnv.sqlQuery("SELECT count(*) FROM kinesis");
DataStream<String> rowDataStream = tableEnv.toDataStream(result).map(Row::toString);
KinesisStreamsSink<String> kdsSink = KinesisStreamsSink.<String>builder().setKinesisClientProperties(sinkProperties).setSerializationSchema(new SimpleStringSchema()).setPartitionKeyGenerator(element -> String.valueOf(element.hashCode()))
.setStreamName("dummy_topic").build();
rowDataStream.sinkTo(kdsSink);
I need to perform aggregations over time windows. Who do I proceed with it?

Related

Two flink jobs running in one application result in first to complete and second to fail with NPE

I have two flink jobs in one Application:
1)First is flink batch job that sends events to kafka, which is then written by someone else to s3
2)Second is flink batch job that checks generated data(reads s3).
Considerations. These 2 jobs work fine separately. When combined only first job is completed and sends events to kafka. But the second is failing when I'm traversing the result of SQL
...
//First job
val env = org.apache.flink.streaming.api.scala.StreamExecutionEnvironment.getExecutionEnvironment
...
//Creates Datastream from generated events and gets the store
streamingDataset.write(store)
env.execute()
...
// Second job
val flinkEnv: = org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.getExecutionEnvironment
val batchStream: DataStream[RowData] =
FlinkSource.forRowData()
.env(flinkEnv)
.tableLoader(tableLoader)
.streaming(false)
.build()
val tableEnv = StreamTableEnvironment.create(flinkEnv)
val inputTable = tableEnv.fromDataStream(batchStream)
tableEnv.createTemporaryView("InputTable", inputTable)
val resultTable: TableResult = tableEnv
.sqlQuery("SELECT * FROM InputTable")
.fetch(3)
.execute()
val results: CloseableIterator[Row] = resultTable.collect()
while (results.hasNext) {
print("Result test " + event)
}
...
org.apache.flink.streaming.api.operators.collect.CollectResultFetcher [] - An exception occurred when fetching query results
java.lang.NullPointerException: Unknown operator ID. This is a bug.
at org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:76)
at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.sendRequest(CollectResultFetcher.java:166)
at org.apache.flink.streaming.api.operators.collect.CollectResultFetcher.next(CollectResultFetcher.java:129)
at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.nextResultFromFetcher(CollectResultIterator.java:106)
at org.apache.flink.streaming.api.operators.collect.CollectResultIterator.hasNext(CollectResultIterator.java:80)
at org.apache.flink.table.planner.connectors.CollectDynamicSink$CloseableRowIteratorWrapper.hasNext(CollectDynamicSink.java:222) ~[?:?]
I want to have two jobs in one application to have generated data in-memory(so I don't have to take care of saving it somewhere else). Is it possible to combine these two jobs or do I have to run them separately? Or is there a better way to restructure my code to make it work?

Using KeyBy vs reinterpretAsKeyedStream() when reading from Kafka

I have a simple Flink stream processing application (Flink version 1.13). The Flink app reads from Kakfa, does stateful processing of the record, then writes the result back to Kafka.
After reading from Kafka topic, I choose to use reinterpretAsKeyedStream() and not keyBy() to avoid a shuffle, since the records are already partitioned in Kakfa. The key used to partition in Kakfa is a String field of the record (using the default kafka partitioner). The Kafka topic has 24 partitions.
The mapping class is defined as follows. It keeps track of the state of the record.
public class EnvelopeMapper extends
KeyedProcessFunction<String, Envelope, Envelope> {
...
}
The processing of the record is as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource)
DataStreamUtils.reinterpretAsKeyedStream(messageStream, Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
With parallelism of 1, the code runs fine. With parallelism greater than 1 (e.g. 4), I am running into the follow error:
2022-06-12 21:06:30,720 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Source: Custom Source -> Map -> Flat Map -> KeyedProcess -> Map -> Sink: Unnamed (4/4) (7ca12ec043a45e1436f45d4b20976bd7) switched from RUNNING to FAILED on 100.101.231.222:44685-bd10d5 # 100.101.231.222 (dataPort=37839).
java.lang.IllegalArgumentException: KeyGroupRange{startKeyGroup=96, endKeyGroup=127} does not contain key group 85
Based on the stack trace, it seems the exception happens when EnvelopeMapper class validates the record is sent to the right replica of the mapper object.
When reinterpretAsKeyedStream() is used, how are the records distributed among the different replicas of the EventMapper?
Thank you in advance,
Ahmed.
Update
After feedback from #David Anderson, replaced reinterpretAsKeyedStream() with keyBy(). The processing of the record is now as follows:
DataStream<Envelope> messageStream =
env.addSource(kafkaSource) // Line x
.map(statelessMapper1)
.flatMap(statelessMapper2);
messageStream.keyBy(Envelope::getId)
.process(new EnvelopeMapper(parameters))
.addSink(kafkaSink);
Is there any difference in performance if keyBy() is done right after reading from Kakfa (marked with "Line x") vs right before the stateful Mapper (EnvelopeMapper).
With
reinterpretAsKeyedStream(
DataStream<T> stream,
KeySelector<T, K> keySelector,
TypeInformation<K> typeInfo)
you are asserting that the records are already distributed exactly as they would be if you had instead used keyBy(keySelector). This will not normally be the case with records coming straight out of Kafka. Even if they are partitioned by key in Kafka, the Kafka partitions won't be correctly associated with Flink's key groups.
reinterpretAsKeyedStream is only straightforwardly useful in cases such as handling the output of a window or process function where you know that the output records are key partitioned in a particular way. To use it successfully with Kafka is can be very difficult: you must either be very careful in how the data is written to Kafka in the first place, or do something tricky with the keySelector so that the keyGroups it computes line up with how the keys are mapped to Kafka partitions.
One case where this isn't difficult is if the data is written to Kafka by a Flink job running with the same configuration as the downstream job that is reading the data and using reinterpretAsKeyedStream.

Handling output data from flink datastream

below is the pseudocode of my stream processing.
Datastream env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
Datastream stream = env.addSource() .map(mapping to java object)
.filter(filter for specific type of events)
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(2)){})
.timeWindowAll(Time.seconds(10));
//collect all records.
Datastream windowedStream = stream.apply(new AllWindowFunction(...))
Datastream processedStream = windowedStream.keyBy(...).reduce(...)
String outputPath = ""
final StreamingFileSink sink = StreamingFileSink.forRowFormat(...).build();
processedStream.addSink(sink)
The above code flow is creating multiple files and each file has records of different windows I guess. For example, records in each files have timestamps which ranges between 30-40 seconds, whereas window time is only 10 seconds.
My expected output pattern is writing each window data to separate file.
Any references or input on this would be of great help.
Take a look at the BucketAssigner interface. It should be flexible enough to meet your needs. You just need to make sure that your stream events contain enough information to determine the path you want them written to.

Flink: understanding the dataflow of my program

I've develop a Flink program that reads tweets from Twitter and push them on Kafka. Then it get back the tweets from Kafka and process them.
The "Tweets processing" transformation extracts hashtags and users from the text of the tweet and emit them in the default output and every pair of them in a side output.
The attached image is picked from the Flink Web UI. I don't understand why the Kafka Source and the Tweets processing operator are merged into a single task and primarily I want that the Tweets sink receive all the raw tweets from the Kafka Source not the output of the Tweets processing operator.
Is the program correct?
Datalow
This the relevant part of the code:
FlinkKafkaConsumer010<String> myConsumer = new FlinkKafkaConsumer010<String>(Constants.KAFKA_TWEETS_TOPIC, new SimpleStringSchema(), properties);
myConsumer.setStartFromLatest();
DataStream<String> tweetsStream = env
.addSource(myConsumer)
.name("Kafka tweets consumer");
SingleOutputStreamOperator<List<String>> tweetsAggregator = tweetsStream
.timeWindowAll(Time.seconds(7))
.aggregate(new StringAggregatorFunction())
.name("Tweets aggregation");
DataStreamSink tweetsSink = tweetsAggregator.addSink(new TweetsSink())
.name("Tweets sink")
.setParallelism(1);
SingleOutputStreamOperator<String> termsStream = tweetsStream
// extracting terms from tweets
.process(new TweetParse())
.name("Tweets processing");
DataStream<Tuple2<String, Integer>> counts = termsStream
.map(new ToTuple())
// Counting terms
.keyBy(0)
.timeWindow(Time.seconds(13))
.sum(1)
.name("Terms processing");
DataStream<Tuple3<String, String, Integer>> edgesStream = termsStream.getSideOutput(TweetParse.outputTag)
.map(new ToTuple3())
// Counting terms pairs
.keyBy(0, 1)
.timeWindow(Time.seconds(19))
.sum(2)
.name("Edges processing");
You are creating two different dataflow with tweetsStream. the first is tweetsAggregator and the second is termsStream. Then you are creating two different dataflow from termsStream again: counts and edgesStream. The Sink operators has no output. So it cannot generate data to another operator and it must be the last operator to use. You have to start with a data Source operator addSource(myConsumer), chain as much as transformation you want timeWindowAll, aggregate, map, keyBy, and then call a sink operator. You can call more than one sink if you want, but remember that sinks don't generate data stream to other operators, they are consumers.

Flink+Kafka 0.10: How to create a Table with the Kafka message timestamp as field?

I would like to extract the timestamp of the messages that are produced by FlinkKafkaConsumer010 as values in the data stream.
I am aware of the AssignerWithPeriodicWatermarks class, but this seems to only extract the timestamp for the purposes of time aggregates via the DataStream API.
I would like to make that Kafka message timestamp available in a Table so later on, I can use SQL on it.
EDIT: Tried this:
val consumer = new FlinkKafkaConsumer010("test", new SimpleStringSchema, properties)
consumer.setStartFromEarliest()
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tenv = TableEnvironment.getTableEnvironment(env)
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
class KafkaAssigner[T] extends AssignerWithPeriodicWatermarks[T] {
var maxTs = 0L
override def extractTimestamp(element: T, previousElementTimestamp: Long): Long = {
maxTs = Math.max(maxTs, previousElementTimestamp)
previousElementTimestamp
}
override def getCurrentWatermark: Watermark = new Watermark(maxTs - 1L)
}
val stream = env
.addSource(consumer)
.assignTimestampsAndWatermarks(new KafkaAssigner[String])
.flatMap(_.split("\\W+"))
val tbl = tenv.fromDataStream(stream, 'w, 'ts.rowtime)
It compiles, but throws:
Exception in thread "main" org.apache.flink.table.api.TableException: Field reference expression requested.
at org.apache.flink.table.api.TableEnvironment$$anonfun$1.apply(TableEnvironment.scala:630)
at org.apache.flink.table.api.TableEnvironment$$anonfun$1.apply(TableEnvironment.scala:624)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:186)
at org.apache.flink.table.api.TableEnvironment.getFieldInfo(TableEnvironment.scala:624)
at org.apache.flink.table.api.StreamTableEnvironment.registerDataStreamInternal(StreamTableEnvironment.scala:398)
at org.apache.flink.table.api.scala.StreamTableEnvironment.fromDataStream(StreamTableEnvironment.scala:85)
at the very last line of the above code.
EDIT2: Thanks to #fabian-hueske for pointing me to a workaround. Full code at https://github.com/andrey-savov/flink-kafka
Flink's Kafka 0.10 consumer automatically sets the timestamp of a Kafka message as the event-time timestamp of produced records if the time characteristic EventTime is configured (see docs).
After you have ingested the Kafka topic into a DataStream with timestamps (still not visible) and watermarks assigned, you can convert it into a Table with the StreamTableEnvironment.fromDataStream(stream, fieldExpr*) method. The fieldExpr* parameter is a list of expressions that describe the schema of the generated table. You can add a field that holds the record timestamp of the stream with an expression mytime.rowtime, where mytime is the name of the new field and rowtime indicates that the value is extracted from the record timestamp. Please check the docs for details.
NOTE: As #bfair pointed out, the conversion of a DataStream of an atomic type (such as DataStream[String]) fails with an exception in Flink 1.3.2 and earlier versions. The bug has been reported as FLINK-7939 and will be fixed in the next versions.

Resources