Apache Flink flatMap with millions of outputs - apache-flink

Whenever i receive a message, i want to do a read from a database, possibly returning millions of rows, which i then want to pass on down the stream. Is this considered good practice in Flink?
public static class StatsReader implements FlatMapFunction<Msg, Json> {
Transactor txor =
...;
#Override
public void flatMap(Msg msg, Collector<Json> out) {
//Possibly lazy and async stream
java.util.Stream<Json> results =
txor.exec(Stats.read(msg));
results.foreach(stat->out.collect(stat));
}
}
Edit:
Background: I would like to dynamically run a report. The db basically acts as a huge window. The report is based on that window + live data. The report is highly customizable, threfore its hard to preprocess results or define pipelines a priori.
I use vanilla java today, and the pipeline is roughly like this:
ReportDefinition -> ( elasticsearch query + realtime stream ) -> ( ReportProcessingPipeline ) -> ( Websocket push )

In principle this should be possible. However, I'd recommend to use an AsyncFunction instead of a FlatMapFunction.
Please note that, such a setup might require tuning the checkpointing parameters, such as the checkpoint interval.

Related

Flink 1.12.x DataSet --> Flink 1.14.x DataStream

I am trying to migrate from Flink 1.12.x DataSet api to Flink 1.14.x DataStream api. mapPartition is not available in Flink DataStream.
Our Code using Flink 1.12.x DataSet
dataset
.<few operations>
.mapPartition(new SomeMapParitionFn())
.<few more operations>
public static class SomeMapPartitionFn extends RichMapPartitionFunction<InputModel, OutputModel> {
#Override
public void mapPartition(Iterable<InputModel> records, Collector<OutputModel> out) throws Exception {
for (InputModel record : records) {
/*
do some operation
*/
if (/* some condition based on processing *MULTIPLE* records */) {
out.collect(...); // Conditional collect ---> (1)
}
}
// At the end of the data, collect
out.collect(...); // Collect processed data ---> (2)
}
}
(1) - Collector.collect invoked based on some condition after processing few records
(2) - Collector.collect invoked at the end of data
Initially we thought of using flatMap instead of mapPartition, but collector not available in close function.
https://issues.apache.org/jira/browse/FLINK-14709 - Only available in case of chained drivers
How to implement this in Flink 1.14.x DataStream? Please advise...
Note: Our application works with only finite set of data (Batch Mode)
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement similar function, you need to define a window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility
Solution provided by Zhipeng
One solution could be using a streamOperator to implement BoundedOneInput
interface.
An example code could be found here [1].
[1]
https://github.com/apache/flink-ml/blob/56b441d85c3356c0ffedeef9c27969aee5b3ecfc/flink-ml-core/src/main/java/org/apache/flink/ml/common/datastream/DataStreamUtils.java#L75
Flink user mailing link: https://lists.apache.org/thread/ktck2y96d0q1odnjjkfks0dmrwh7kb3z

Is there a way to asynchronously modify state in Flink KeyedProcessFunction?

I have two sources, kafka and hbase. In Kafka, there is a data stream in only 24 hours. In Hbase, there is an aggregated data from the beginning. My purpose is that the two data merge on stream processing, when stream input(Kafka) of some session is occurred. I tried a couple of methods but it is not satisfied because of performance.
After some searching, I have an idea with state in keyed process function. The idea is down below. (caching using state of keyed process function)
make input to keyed process function using session information
check keyed process's state
if state is not initialized -> then query from hbase and initialize into state -> go to 5
else (state is initialized) -> go to 5
do business logic using state
During coding the idea, I have faced performance issue that querying to hbase is slow with sync way. So, I tried async version but it's complicated.
I have faced two issues. One of them is thread-safe issue between processElement and hbase Async worker thread, the other is Context of the process function is expired after end of processElement function (not end of hbase Async worker).
val sourceStream = env.addsource(kafkaConsumer.setStartFromGroupOffsets())
sourceStream.keyBy(new KeySelector[InputMessage, KeyInfo]() {
override def getKey(v: InputMessage): KeyInfo = v.toKeyInfo()
})
.process(new KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]() {
var state: MapState[String, (String, Long)] = _
override def open(parameters: Configuration): Unit = {
val conn = ConnectionFactory.createAsyncConnection(hbaseConfInstance).join
table = conn.getTable(TableName.valueOf("tablename"))
state = getRuntimeContext.getMapState(stateDescripter)
}
def request(action: Consumer[CacheResult] ): Unit = {
if ( !state.isEmpty ) {
action.accept(new CacheResult(state))
}
else { // state is empty, so load from hbase
table.get(new Get(key)).thenAccept((hbaseResult: Result) => {
// this is called by worker thread
hbaseResult.toState(state) // convert from hbase result into state
action.accept(new CacheResult(state))
}
}
}
override def processElement(value: InputMessage
, ctx: KeyedProcessFunction[KeyInfo, InputMessage, OUTPUTTYPE]#Context
, out: Collector[OUTPUTTYPE]): Unit = {
val businessAction = new Consumer[CacheResult]() {
override def accept(t: CacheResult): Unit = {
// .. do business logic here.
out.collect( /* final result */ )
}
}
request(businessAction)
}
}).addSink()
Is there any suggestion to make KeyedProcessFunction available with async call in third party?
Or any other idea to approach using mixed-up Kafka and Hbase in Flink?
I think your general assumptions are wrong. I faced similar issue but regarding quite different problem and didn't resolve it yet. Keeping state in the program is contradictory with async function and Flink prevents using state in async code by its design (which is a good thing). If you want to make your function async, then you must get rid of the state. To achieve your goal, you probably need to redesign your solution. I don't know all details regarding your problem, but you can think of splitting your process into more pipelines. E.g. you can create pipeline consuming data from hbase and passing it into kafka topic. Then another pipeline can consume data sent by pipeline gathering data from hbase. In such approach you don't have to care about the state becasue each pipeline is doing its own thing, just consuming data and passing it further.

Reading file that is being appended in Flink

We have a legacy application that is writing results as records to some local files. We want to process these records in real-time thus we are planning to use Flink as an engine. I know that I can read text files using StreamingExecutionEnvironment#readFile. It seems that we need something similar to PROCESS_CONTINUOUSLY there but this flag causes a whole file to be reprocessed on each change, what is not what we want here.
Of course, I can write my custom source that saves number of records per file in its state. But I suppose there might be some problem with such approach with checkpointing or something - my reasoning is that if that would be easy to implement reliably, it would have been already implemented in Flink.
Any tips / suggestions how to approach this?
You can do this rather easily with a custom source, so long as you are content to be reading from a single file (per source instance). You will need to use operator state and implement checkpointing. The state handling and checkpointing will look something like this:
public class CheckpointedFileSource implements SourceFunction<Event>, ListCheckpointed<Long> {
private long eventCnt = 0;
public void run(SourceContext<Event> sourceContext) throws Exception {
final Object lock = sourceContext.getCheckpointLock();
// skip over previously emitted events
...
while (not cancelled) {
read event from file;
synchronized (lock) {
eventCnt++;
sourceContext.collectWithTimestamp(event, timestamp);
}
}
}
#Override
public List<Long> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return Collections.singletonList(eventCnt);
}
#Override
public void restoreState(List<Long> state) throws Exception {
for (Long s : state)
this.eventCnt = s;
}
}
For a complete example see the checkpointed taxi ride data source used in the Flink training exercises. You’ll have to adapt it a bit, since it’s designed to read a static file, rather than one that is being appended to.

how to achieve exactly once semantics in apache kafka connector

I am using flink version 1.8.0 . My application reads data from kafka -> transform -> publish to Kafka. To avoid any duplicates during restart, i want to use kafka producer with Exactly once semantics , read about it here :
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-011-and-newer
My kafka version is 1.1 .
return new FlinkKafkaProducer<String>( topic, new KeyedSerializationSchema<String>() {
public byte[] serializeKey(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public byte[] serializeValue(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public String getTargetTopic(String element) {
// TODO Auto-generated method stub
return topic;
}
},prop, opt, FlinkKafkaProducer.Semantic.EXACTLY_ONCE, 1);
Checkpoint Code :
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointTimeout(15 * 1000 );
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.enableCheckpointing(5000 );
If I add exactly once sematics in kafka producer , my flink consumer is not reading any new data.
Can any one please share any sample code/application with Exactly once Semantics ?
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
Thanks
Can any one please share any sample code/application with Exactly once Semantics ?
An exactly once example is hidden in an end-to-end test in flink. Since it uses some convenience functions, it may be hard to follow without checking out the whole repo.
If I add exactly once sematics in kafka producer , my flink consumer
is not reading any new data.
[...]
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
I checked out your code and found the issue (had to fix the whole setup/code to actually get it running). The sink can actually not configure the transactions correctly. As written in the Flink Kafka connector documentation, you need to adjust the transaction.timeout.ms either in your Kafka broker up to 1 hour or in your application down to 15 min:
prop.setProperty("transaction.timeout.ms", "900000");
The respective excerpt is:
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.

Apache Flink DataStream API doesn't have a mapPartition transformation

Spark DStream has mapPartition API, while Flink DataStream API doesn't. Is there anyone who could help explain the reason. What I want to do is to implement a API similar to Spark reduceByKey on Flink.
Flink's stream processing model is quite different from Spark Streaming which is centered around mini batches. In Spark Streaming each mini batch is executed like a regular batch program on a finite set of data, whereas Flink DataStream programs continuously process records.
In Flink's DataSet API, a MapPartitionFunction has two parameters. An iterator for the input and a collector for the result of the function. A MapPartitionFunction in a Flink DataStream program would never return from the first function call, because the iterator would iterate over an endless stream of records. However, Flink's internal stream processing model requires that user functions return in order to checkpoint function state. Therefore, the DataStream API does not offer a mapPartition transformation.
In order to implement functionality similar to Spark Streaming's reduceByKey, you need to define a keyed window over the stream. Windows discretize streams which is somewhat similar to mini batches but windows offer way more flexibility. Since a window is of finite size, you can call reduce the window.
This could look like:
yourStream.keyBy("myKey") // organize stream by key "myKey"
.timeWindow(Time.seconds(5)) // build 5 sec tumbling windows
.reduce(new YourReduceFunction); // apply a reduce function on each window
The DataStream documentation shows how to define various window types and explains all available functions.
Note: The DataStream API has been reworked recently. The example assumes the latest version (0.10-SNAPSHOT) which will be release as 0.10.0 in the next days.
Assuming your input stream is single partition data (say String)
val new_number_of_partitions = 4
//below line partitions your data, you can broadcast data to all partitions
val step1stream = yourStream.rescale.setParallelism(new_number_of_partitions)
//flexibility for mapping
val step2stream = step1stream.map(new RichMapFunction[String, (String, Int)]{
// var local_val_to_different_part : Type = null
var myTaskId : Int = null
//below function is executed once for each mapper function (one mapper per partition)
override def open(config: Configuration): Unit = {
myTaskId = getRuntimeContext.getIndexOfThisSubtask
//do whatever initialization you want to do. read from data sources..
}
def map(value: String): (String, Int) = {
(value, myTasKId)
}
})
val step3stream = step2stream.keyBy(0).countWindow(new_number_of_partitions).sum(1).print
//Instead of sum(1), you can use .reduce((x,y)=>(x._1,x._2+y._2))
//.countWindow will first wait for a certain number of records for perticular key
// and then apply the function
Flink streaming is pure streaming (not the batched one). Take a look at Iterate API.

Resources