How to cache the local variable at process level in Flink streaming? - apache-flink

Inside Flink task instance I need to access remote web service to get some data when the event coming ,however I don't want to access remote web service every time when event coming, so I need to cache the data in local memory and can be accessed by all task of the process , how to do it ? storing the data in the static private variable at the class level ?
Such as the following example ,if set the local variable localCache at class Splitter, it cached at operator level instead of process level .
public class WindowWordCount {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1);
dataStream.print();
env.execute("Window WordCount");
}
public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
***private object localCache ;***
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word: sentence.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
}

Exactly like you said. You'd use a static variable in a RichFlatMapFunction and initialize it in open. open will be called on each TaskManager before feeding in any record. Note that there is an instance of Splitter being created for each different slot, so in most cases there are several Splitter instances on one TaskManager. Thus, you need to guard against double creation.
public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
private transient Object localCache;
#Override
public void open(Configuration parameters) throws Exception {
if (localCache == null)
localCache = ... ;
}
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word: sentence.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}

A scalable approach might use a Source operator to actually perform the call to the web service and then write the result to a stream. You can then access that stream as a broadcast stream to your operator resulting in the one object (web call result) emitted to the broadcast stream being sent to each instance of the receiving operator. This will share the result of that single web call across all machines and JVM's in your cluster. You can also persist broadcast state and share it with new instances of your operator as the cluster scales up.

Related

why data is not processed in RichFlatMapFunction

In order to improve the performance of data process, we store events to a map and do not process them untill event count reaches 100.
in the meantime, start a timer in open method, so data is processed every 60 seconds
this works when flink version is 1.11.3,
after upgrading flink version to 1.13.0
I found sometimes events were consumed from Kafka continuously, but were not processed in RichFlatMapFunction, it means data was missing.
after restarting service, it works well, but several hours later the same thing happened again.
any known issue for this flink version? any suggestions are appreciated.
public class MyJob {
public static void main(String[] args) throws Exception {
...
DataStream<String> rawEventSource = env.addSource(flinkKafkaConsumer);
...
}
public class MyMapFunction extends RichFlatMapFunction<String, String> implements Serializable {
#Override
public void open(Configuration parameters) {
...
long periodTimeout = 60;
pool.scheduleAtFixedRate(() -> {
// processing data
}, periodTimeout, periodTimeout, TimeUnit.SECONDS);
}
#Override
public void flatMap(String message, Collector<String> out) {
// store event to map
// count event,
// when count = 100, start data processing
}
}
You should avoid doing things with user threads and timers in Flink functions. The supported mechanism for this is to use a KeyedProcessFunction with processing time timers.

Use Cases of Flink CheckpointedFunction

While going through the Flink official documentation, I came across CheckpointedFunction.
Wondering why and when would you use this function. I am currently working on a stateful Flink job that heavily relies on ProcessFunction to save state in RocksDB. Just wondering if CheckpointedFunction is better than the ProcessFunction.
CheckpointedFunction is for cases where you need to work with state that should be managed by Flink and included in checkpoints, but where you aren't working with a KeyedStream and so you cannot use keyed state like you would in a KeyedProcessFunction.
The most common use cases of CheckpointedFunction are in sources and sinks.
In addition to the answer of #David I have another use case in which I don't use CheckpointedFunction with the source or sink. I do use it in a ProcessFunction where I want to count (programmatically) how many times my job has restarted. I use MyProcessFunction and CheckpointedFunction and I update ListState<Long> restarts when the job restarts. I use this state on the integration tests to ensure that the job was restarted upon a failure. I based my example on the Flink checkpoint example for Sinks.
public class MyProcessFunction<V> extends ProcessFunction<V, V> implements CheckpointedFunction {
...
private transient ListState<Long> restarts;
#Override
public void snapshotState(FunctionSnapshotContext context) throws Exception { ... }
#Override
public void initializeState(FunctionInitializationContext context) throws Exception {
restarts = context.getOperatorStateStore().getListState(new ListStateDescriptor<Long>("restarts", Long.class));
if (context.isRestored()) {
List<Long> restoreList = Lists.newArrayList(restarts.get());
if (restoreList == null || restoreList.isEmpty()) {
restarts.add(1L);
System.out.println("restarts: 1");
} else {
Long max = Collections.max(restoreList);
System.out.println("restarts: " + max);
restarts.add(max + 1);
}
} else {
System.out.println("restarts: never restored");
}
}
#Override
public void open(Configuration parameters) throws Exception { ... }
#Override
public void processElement(V value, Context ctx, Collector<V> out) throws Exception { ... }
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<V> out) throws Exception { ... }
}

Filtering unique events in apache flink

I am defining certain variables in one java class and i am accessing it with a different class so as to filter the stream for unique elements. Please refer code to understand the issue better.
The problem i am facing is this Filter function doesn't work well and fails to filter unique events. I doubt the variable is shared among different threads and it is the cause!? Please suggest another method if this is not the correct way to do it. Thanks in advance.
**ClassWithVariables.java**
public static HashMap<String, ArrayList<String>> uniqueMap = new HashMap<>();
**FilterClass.java**
public boolean filter(String val) throws Exception {
if(ClassWithVariables.uniqueMap.containsKey(key)) {
Arraylist<String> al = uniqueMap.get(key);
if(al.contains(val) {
return false;
} else {
//Update the hashmap list(uniqueMap)
return true;
}
} else {
//Add to hashmap list(uniqueMap)
return true;
}
}
The correct way to de-duplicate a stream involves partitioning the stream by the key, so that all elements containing the same key will be processed by the same worker, and using flink's managed, keyed state mechanism so that the state is fault-tolerant and re-scalable. Here's a sample implementation:
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.addSource(new EventSource())
.keyBy(e -> e.key)
.flatMap(new Deduplicate())
.print();
env.execute();
}
public static class Deduplicate extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> seen;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("seen", Types.BOOLEAN);
seen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (seen.value() == null) {
out.collect(event);
seen.update(true);
}
}
}
This could also be implemented as a RichFilterFunction, btw. But note that if you have an unbounded key space, the state being used will grow indefinitely until you run out of heap, or space on the disk, depending on which of Flink's state backends you choose. If this is an issue, you might want to set up a state retention policy via State Time-to-Live.
Note also that sharing state between different parts of a Flink pipeline isn't possible. You need to turn things inside-out compared to what might seem normal, and bring the event stream to the state, rather than fetching it.

Running apache flink locally on multicore processor

I am running flink from within eclipse where necessary jars have been fetched by Maven. My machine has a processor with eight cores and the streaming application I have to write reads lines from its input and calculates some statistics.
When I run the program on my machine, I expected flink to use all the cores of the CPU as well-threaded code. However, when I watch the cores, I see that only one core is being used. I tried many things and left in the following code my last try, i.e. setting the parallelism of the environment. I also tried to set it for the stream alone and so on.
public class SemSeMi {
public static void main(String[] args) throws Exception {
System.out.println("Starting Main!");
System.out.println(org.apache.flink.core.fs.local.LocalFileSystem
.getLocalFileSystem().getWorkingDirectory());
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.setParallelism(8);
env.socketTextStream("localhost", 9999).flatMap(new SplitterX());
env.execute("Something");
}
public static class SplitterX implements
FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String sentence,
Collector<Tuple2<String, Integer>> out) throws Exception {
// Do Nothing!
}
}
}
I fed the programm with data using netcat:
nc -lk 9999 < fileName
The question is how to make the program scale locally and use all available cores?
You don't have to specify the degree of parallelism explicitly. Jobs which are run with the default setting will set the parallelism automatically to the number of available cores.
In your case, the source will be run with parallelism of 1 since reading from a socket cannot be distributed. However, for the flatMap operation the system will instantiate 8 instances. If you turn on logging, then you will also see it. Now the input data is distributed to the flatMap tasks in a round-robin fashion. Each of the flatMap tasks is executed by an individual thread.
I would suspect that the reason why you only see load on a single core is because the SplitterX does not do any work. Try the following code which counts the number of characters in each String and then prints the result to the console:
public static void main(String[] args) throws Exception {
System.out.println("Starting Main!");
System.out.println(org.apache.flink.core.fs.local.LocalFileSystem
.getLocalFileSystem().getWorkingDirectory());
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream("localhost", 9999).flatMap(new SplitterX()).print();
env.execute("Something");
}
public static class SplitterX implements
FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String sentence,
Collector<Tuple2<String, Integer>> out) throws Exception {
out.collect(Tuple2.of(sentence, sentence.length()));
}
}
The numbers at the start of each line tell you which task printed the result.

Event time window on kafka source streaming

There is a topic in Kafka server. In the program, we read this topic as a stream and assign event timestamp. Then do window operation on this stream. But the program doesn't work. After debug, it seems that processWatermark method of WindowOperator is not executed. Here is my code.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).assignTimestamps(timestampExtractor);
advertisement
.keyBy(keySelector)
.window(TumblingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.apply(new WindowFunction<Tuple2<String,Long>, Integer, String, TimeWindow>() {
private static final long serialVersionUID = 5151607280638477891L;
#Override
public void apply(String s, TimeWindow window, Iterable<Tuple2<String, Long>> values, Collector<Integer> out) throws Exception {
out.collect(Iterables.size(values));
}
}).print();
Why this happened? if I add "keyBy(keySelector)" before "assignTimestamps(timestampExtractor)" then the program works. Anyone could help to explain the reason?
You are affected by a known bug in Flink: FLINK-3121:Watermark forwarding does not work for sources not producing any data.
The problem is that there are more FlinkKafkaConsumer's running (most likely the number of CPU cores (say 4)) then you have partitions (1). Only one of the Kafka consumers is emitting watermarks, the other consumers are idling.
The window operator is not aware of that, waiting for watermarks to arrive from all consumers. That's why the windows never trigger.

Resources