Handling output data from flink datastream - apache-flink

below is the pseudocode of my stream processing.
Datastream env = StreamExecutionEnvironment.getExecutionEnvironment()
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
Datastream stream = env.addSource() .map(mapping to java object)
.filter(filter for specific type of events)
.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor(Time.seconds(2)){})
.timeWindowAll(Time.seconds(10));
//collect all records.
Datastream windowedStream = stream.apply(new AllWindowFunction(...))
Datastream processedStream = windowedStream.keyBy(...).reduce(...)
String outputPath = ""
final StreamingFileSink sink = StreamingFileSink.forRowFormat(...).build();
processedStream.addSink(sink)
The above code flow is creating multiple files and each file has records of different windows I guess. For example, records in each files have timestamps which ranges between 30-40 seconds, whereas window time is only 10 seconds.
My expected output pattern is writing each window data to separate file.
Any references or input on this would be of great help.

Take a look at the BucketAssigner interface. It should be flexible enough to meet your needs. You just need to make sure that your stream events contain enough information to determine the path you want them written to.

Related

Apache Fink & Iceberg: Not able to process hundred of RowData types

I have a Flink application that reads arbitrary AVRO data, maps it to RowData and uses several FlinkSink instances to write data into ICEBERG tables. By arbitrary data I mean that I have 100 types of AVRO messages, all of them with a common property "tableName" but containing different columns. I would like to write each of these types of messages into a separated Iceberg table.
For doing this I'm using side outputs: when I have my data mapped to RowData I use a ProcessFunction to write each message into a specific OutputTag.
Later on, with the datastream already processed, I loop into the different output tags, get records using getSideOutput and create an specific IcebergSink for each of them. Something like:
final List<OutputTag<RowData>> tags = ... // list of all possible output tags
final DataStream<RowData> rowdata = stream
.map(new ToRowDataMap()) // Map Custom Avro Pojo into RowData
.uid("map-row-data")
.name("Map to RowData")
.process(new ProcessRecordFunction(tags)) // process elements one by one sending them to a specific OutputTag
.uid("id-process-record")
.name("Process Input records");;
CatalogLoader catalogLoader = ...
String upsertField = ...
outputTags
.stream()
.forEach(tag -> {
SingleOutputStreamOperator<RowData> outputStream = stream
.getSideOutput(tag);
TableIdentifier identifier = TableIdentifier.of("myDBName", tag.getId());
FlinkSink.Builder builder = FlinkSink
.forRowData(outputStream)
.table(catalog.loadTable(identifier))
.tableLoader(TableLoader.fromCatalog(catalogLoader, identifier))
.set("upsert-enabled", "true")
.uidPrefix("commiter-sink-" + tableName)
.equalityFieldColumns(Collections.singletonList(upsertField));
builder.append();
});
It works very well when I'm dealing with a few tables. But when the number of tables scales up, Flink cannot adquire enough task resources since each Sink requires two different operators (because of the internals of https://iceberg.apache.org/javadoc/0.10.0/org/apache/iceberg/flink/sink/FlinkSink.html).
Is there any other more efficient way of doing this? or maybe any way of optimizing it?
Thanks in advance ! :)
Given your question, I assume that about half of your operators are IcebergStreamWriter which are fully utilised and another half is IcebergFilesCommitter which are rarely used.
You can optimise the resource usage of the servers by:
Increasing the number of slots on the TaskManagers (taskmanager.numberOfTaskSlots) [1] - so the CPU not utilised by the idle IcebergFilesCommitter Operators are then used by the other operators on the TaskManager
Increasing the resources provided to the TaskManagers (taskmanager.memory.process.size) [2] - this helps by distributing the JVM Memory overhead between the running Operators on this TaskManager (do not forget to increase the slots in parallel this change to start using the extra resources :) )
The possible downside in adding more slots for the TaskManagers could cause Operators competing for CPU, and the memory is still reserved for the "idle" tasks. [3]
Maybe this Flink architecture could useful too [4]
I hope this helps,
Peter

What TimestampsAndWatermarksTransformation class does in assignTimestampsAndWatermarks()

In the following code
public SingleOutputStreamOperator<T> assignTimestampsAndWatermarks(
WatermarkStrategy<T> watermarkStrategy) {
final WatermarkStrategy<T> cleanedStrategy = clean(watermarkStrategy);
// match parallelism to input, to have a 1:1 source -> timestamps/watermarks relationship
// and chain
final int inputParallelism = getTransformation().getParallelism();
final TimestampsAndWatermarksTransformation<T> transformation =
new TimestampsAndWatermarksTransformation<>(
"Timestamps/Watermarks",
inputParallelism,
getTransformation(),
cleanedStrategy);
getExecutionEnvironment().addOperator(transformation);
return new SingleOutputStreamOperator<>(getExecutionEnvironment(), transformation);
}
The assignTimestampsAndWatermarks() receives the main stream and assigns timestamps and watermarks based on the strategy specified in params, at the end, it will return SingleOutputStreamOperator which is the updated stream with timestamps and watermarks generated.
My question is, what TimestampsAndWatermarksTransformation does here (internally) and what is the effect of this line getExecutionEnvironment().addOperator(transformation); as well.
When you call assignTimestampsAndWatermarks on a stream, this code adds an operator to the job graph to do the timestamp extraction and watermark generation. This is wiring things up so that the specified watermarking will actually get done.
Internally there are two types of Transformation: (1) physical transformations, such as map or assignTimestampsAndWatermarks, which alter the stream records, and (2) logical transformations, such as union, that only affect the topology.

how to buffer a batch of data in flink

I want to buffer a datastream in flink. My initial idea is caching 100 pieces of data into a list or tuple and then using insert into values (???) to insert data into clickhouse in bulk. Do you have better ways to do this?
The first solution that you post works but it is flaky. It can lead to starvation due to a simplistic logic. For instance, let's say that you have a counter of 100 to create a batch. It is possible that your stream never receives 100 events, or it takes hours to receive the 100th event. Then your basic and working solution can have events stuck in the window batch because it is a count window. In other words, your batch can generate windows of 30 seconds in a high throughput, or windows of 1 hour when your throughput is very low.
DataStream<User> stream = ...;
DataStream<Tuple2<User, Long>> stream1 = stream
.countWindowAll(100)
.process(new MyProcessWindowFunction());
In general, it depends on your use case. However, I would use a time window to make sure that my job always has the flush batch even though there are few or no events on the window.
DataStream<Tuple2<User, Long>> stream1 = stream
.windowAll(TumblingProcessingTimeWindows.of(Time.seconds(30)))
.process(new MyProcessWindowFunction());;
Thanks for all the answers. I use a window function to solve this problem.
SingleOutputStreamOperator<ArrayList<User>> stream2 =
stream1.countWindowAll(batchSize).process(new MyProcessWindowFunction());
Then I overwrite the process function in which the batch size of data is buffered in an ArrayList.
If you want to import data into the database in batches, you can use the window(countWindow or timeWindow)to aggregate the data.

Flink: understanding the dataflow of my program

I've develop a Flink program that reads tweets from Twitter and push them on Kafka. Then it get back the tweets from Kafka and process them.
The "Tweets processing" transformation extracts hashtags and users from the text of the tweet and emit them in the default output and every pair of them in a side output.
The attached image is picked from the Flink Web UI. I don't understand why the Kafka Source and the Tweets processing operator are merged into a single task and primarily I want that the Tweets sink receive all the raw tweets from the Kafka Source not the output of the Tweets processing operator.
Is the program correct?
Datalow
This the relevant part of the code:
FlinkKafkaConsumer010<String> myConsumer = new FlinkKafkaConsumer010<String>(Constants.KAFKA_TWEETS_TOPIC, new SimpleStringSchema(), properties);
myConsumer.setStartFromLatest();
DataStream<String> tweetsStream = env
.addSource(myConsumer)
.name("Kafka tweets consumer");
SingleOutputStreamOperator<List<String>> tweetsAggregator = tweetsStream
.timeWindowAll(Time.seconds(7))
.aggregate(new StringAggregatorFunction())
.name("Tweets aggregation");
DataStreamSink tweetsSink = tweetsAggregator.addSink(new TweetsSink())
.name("Tweets sink")
.setParallelism(1);
SingleOutputStreamOperator<String> termsStream = tweetsStream
// extracting terms from tweets
.process(new TweetParse())
.name("Tweets processing");
DataStream<Tuple2<String, Integer>> counts = termsStream
.map(new ToTuple())
// Counting terms
.keyBy(0)
.timeWindow(Time.seconds(13))
.sum(1)
.name("Terms processing");
DataStream<Tuple3<String, String, Integer>> edgesStream = termsStream.getSideOutput(TweetParse.outputTag)
.map(new ToTuple3())
// Counting terms pairs
.keyBy(0, 1)
.timeWindow(Time.seconds(19))
.sum(2)
.name("Edges processing");
You are creating two different dataflow with tweetsStream. the first is tweetsAggregator and the second is termsStream. Then you are creating two different dataflow from termsStream again: counts and edgesStream. The Sink operators has no output. So it cannot generate data to another operator and it must be the last operator to use. You have to start with a data Source operator addSource(myConsumer), chain as much as transformation you want timeWindowAll, aggregate, map, keyBy, and then call a sink operator. You can call more than one sink if you want, but remember that sinks don't generate data stream to other operators, they are consumers.

merging datastreams of two different types in Flink or any other system

I want to use Flink for a remote patient monitoring case scenario which includes various sensors like gyroscope, accelerometer, ECG Stream, HR rate stream, RR rate etc. So in this case scenario it's not possible that we should have the same data type or input rate etc, but still I want to detect arrhythmia or other medical condition which involves doing CEP on these multiple sensors
What I know is that ,If I want to perform some complex event processing on these sensors, then I have 2 options that needs to be done before the CEP
Join diff streams
merge diff streams
Earlier I was performing a join based upon timestamps of sensors, but it does not result in joining all the events as diff streams can have diff rates and different timestamps in microseconds, so it will be a rare case such that the timestamps are exactly equal.
So I would like to go with option # 2 i.e performing a merge before doing CEP. for doing this, I have found on Flink documentation, that I can merge the two streams but they should have same data type, I tried to do the same but I am unsuccessful as I got following error
Exception in thread "main" java.lang.IllegalArgumentException: Cannot union streams of different types: GenericType<org.carleton.cep.monitoring.latest.Events.RRIntervalStreamEvent> and GenericType<org.carleton.cep.monitoring.latest.Events.qrsIntervalStreamEvent>
at org.apache.flink.streaming.api.datastream.DataStream.union(DataStream.java:217)
Now let's see how I tried to perform a merge. So basically I had two stream classes, their attributes are as follows
RRIntervalStreamEvent Stream
public Integer Sensor_id;
public Long time;
public Integer RRInterval;
qrsIntervalStreamEvent Stream
public Integer Sensor_id;
public Long time;
public Integer qrsInterval;
Both of these streams have generators classes which also sends the events at in same data types at the specified rate.Below is code by which I tried to merge them.
// getting qrs interval stream
DataStream<qrsIntervalStreamEvent> qrs_stream_raw = envrionment.
addSource(new Qrs_interval_Gen(input_rate_qrs_S,Total_Number_Of_Events_in_qrs)).name("qrs stream");
// getting RR interval stream
DataStream<RRIntervalStreamEvent> rr_stream_raw = envrionment.
addSource(new RR_interval_Gen(input_rate_rr_S,Total_Number_Of_Events_in_RR)).name("RR stream");
//merging both streams
DataStream<Tuple3<Integer,Long,Integer>> mergedStream;
mergedStream = rr_stream_raw.union(new DataStream[]{qrs_stream_raw});
I have to use new DataStream[] as just using qrs_stream_raw was resulting in error as shown below.
Can someone please give me an idea about
how should I merge these two streams?
how should I merge more than two streams?
is there some engine which can merge more than two streams having different structures, if yes which engine should I use
As pointed out by Alex, we can use the same data type of both the streams and can join them in Flink, another option is to use Siddhi or Flink-Siddhi extension. But I want to do everything in Flink only
So here are couple of changes I made in my program to make it work
Step # 1: made both of my generator classes to return common type
public class RR_interval_Gen extends RichParallelSourceFunction<Tuple3<Integer,Long, Integer>>
step# 2: made both of stream generators to have Tuple types and then merged 2 streams.
// getting qrs interval stream
DataStream<Tuple3<Integer,Long,Integer>> qrs_stream_raw = envrionment.
addSource(new Qrs_interval_Gen(input_rate_qrs_S,Total_Number_Of_Events_in_qrs)).name("qrs stream");
// getting RR interval stream
DataStream<Tuple3<Integer,Long,Integer>> rr_stream_raw = envrionment.
addSource(new RR_interval_Gen(input_rate_rr_S,Total_Number_Of_Events_in_RR)).name("RR stream");
//merging both streams
DataStream<Tuple3<Integer,Long,Integer>> mergedStream = rr_stream_raw.union(qrs_stream_raw);

Resources