Apache Flink:Operator State Checkpoint - apache-flink

I want to use the Operator State API in No-Keyed Stream to save the sate of count in the example below. what should I do?
public static class MapFunction implements MapFunction<String, String>,CheckpointedFunction{
int count = 0;
#Override
public String map(String value) throws Exception {
// TODO Auto-generated method stub
String message;
message = value;
count++;
return message;
}
#Override
public void snapshotState(FunctionSnapshotContext context) throws Exception {
// TODO Auto-generated method stub
}
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
// TODO Auto-generated method stub
}
}
Thank you for your answer.

As Dawid noted, the docs are a good starting point. The easiest approach would be for you to implement the ListCheckpointed interface. When snapshotState() is called, you'd return a singleton list of your count (as an Integer). When restoreState() is called, you'd iterate over the list of Integer values, and sum them to set your count variable.

Related

ProcessFunction in Flink to output a result every 5 seconds

My input stream is Tuple2<String, String>, I want to group by the first field and sum the integer in the second field. This is my ProcessFunction:
public static class MyKeyedProcessFunction
extends KeyedProcessFunction<String, Tuple2<String, Integer>, Tuple2<String, Integer>> {
private ValueState<Integer> state;
#Override
public void open(Configuration parameters) throws Exception {
state = getRuntimeContext().getState(new ValueStateDescriptor<>("sum", Integer.class));
}
#Override
public void processElement(
Tuple2<String, Integer> value,
Context ctx,
Collector<Tuple2<String, Integer>> out) throws Exception {
Integer sum = state.value();
if (sum == null) {
sum = 0;
}
sum += value.f1;
state.update(sum);
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 5000);
}
#Override
public void onTimer(
long timestamp,
OnTimerContext ctx,
Collector<Tuple2<String, Integer>> out) throws Exception {
out.collect(Tuple2.of(ctx.getCurrentKey(), state.value()));
state.clear();
}
}
Now the onTimer is called for every element. I specified the input as:
aaa,50
aaa,40
aaa,10
I see the output like:
(aaa,100)
(aaa, null)
(aaa, null)
How can I get the output as (aaa,100)?
You registered a timer for every incoming event. Maybe you can also try to create a new ValueState of type boolean that indicates if an initial timer has been registered.
As soon as the first event comes in, you register a timer in the processElement method like:
if(timerRegistered.value()==false){
ctx.timerService().registerProcessingTimeTimer(ctx.timerService().currentProcessingTime() + 5000);
timerRegistered.update(true);
}
Then you will just go on and register new timers for the 5 sec interval in the onTimer method instead of the processElement.
I didn't test the code but it should give an idea.
Kind Regards
Dominik

Flink KeyedProcessFunction integrating anonymous methods

I am attempting to write a keyedProcessFunction, the code looks like this below:
DataStream<Tuple2<Long, Integer>> busyMachinesPerWindow = busyMachines
// group by timestamp (window end)
.keyBy(event -> event.getField(1))
.process(new KeyedProcessFunction<Tuple1<Long>, Tuple3<Long, Long, Long>, Tuple2<Long, Integer>>() {
private ValueState<Integer> state;
#Override
public void open(Configuration config) throws IOException {
// initialize the state descriptors here
state = getRuntimeContext().getState(new ValueStateDescriptor<>("machine-counts", Integer.class));
if (state.value() == null) {
state.update(0);
}
}
#Override
public void processElement(Tuple3<Long, Long, Long> inWindow, Context ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
if (state.value() != null) {
state.update(state.value() + 1);
} else {
state.update(1);
}
ctx.timerService().registerEventTimeTimer(inWindow.f1);
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<Tuple2<Long, Integer>> out) throws Exception {
int counter = state.value();
state.clear();
// we can now output the window and the machine count
out.collect(new Tuple2<>(((Tuple1<Long>) ctx.getCurrentKey()).f0, counter));
}
});
However this pops up an error saying cannot derive anonymous method. I don't see what the problem is with this code. Is there some type ambiguity that I am not doing right?
One problem with this code is that you are calling state.value() and state.update(0) in the open method. This is not allowed. These methods can only be used in processElement and in onTimer, because only then is there a specific event being processed whose key can be used to access/update the appropriate entry in the state backend.
An instance of a KeyedProcessFunction is multiplexed across all of the keys assigned to a given task slot. The open method is called just once, at a time when there is no specific key in the runtime context, so the state cannot be accessed or updated at this time.

Scheduled Task with Apache Flink

I have a flink job with parallelism 5 (for now !!). And one of the richFlatMap stream opens one file in the open(Configuration parameters) method. In the flatMapoperation there is no any open action, it just read the file to search something. (There is a utility class which has a method like utilityClass.searchText('abc')). Here is the boilerplate code:
public class MyFlatMap extends RichFlatMapFunction<...> {
private MyUtilityFile myFile;
#Override
public void open(Configuration parameters) throws Exception {
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
}
This file is being updated by the python script every day at specific time. Therefore I should also open the newly created file (by python script) in the flatMap stream.
I just though that this can be done by ScheduledExecutorService with only one thread pool.
I can not open this file every flatMap calls because it is big.
Here is the boilerplate code I am trying to write:
public class MyFlatMap extends RichFlatMapFunction<...> implements Runnable {
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
private MyUtilityFile myFile;
#Override
public void run() {
myFile.Open("fileLocation");
}
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
}
Is this boilerplate okey for Flink environment? If not, how can i open the file with a scheduled manner? (There is no option such as "after updating file send event with kafka and read event by flink")
Perhaps you can directly implement the ProcessingTimeCallback interface, which supports timer operations
public class MyFlatMap extends RichFlatMapFunction<...> implements ProcessingTimeCallback {
private MyUtilityFile myFile;
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
#Override
public void onProcessingTime(long timestamp) throws Exception {
myFile.Open("fileLocation");
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
}
}

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

Apache Flink - use values from a data stream to dynamically create a streaming data source

I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.

Resources