Running apache flink locally on multicore processor

Running apache flink locally on multicore processor - apache-flink

I am running flink from within eclipse where necessary jars have been fetched by Maven. My machine has a processor with eight cores and the streaming application I have to write reads lines from its input and calculates some statistics.
When I run the program on my machine, I expected flink to use all the cores of the CPU as well-threaded code. However, when I watch the cores, I see that only one core is being used. I tried many things and left in the following code my last try, i.e. setting the parallelism of the environment. I also tried to set it for the stream alone and so on.
public class SemSeMi {
public static void main(String[] args) throws Exception {
System.out.println("Starting Main!");
System.out.println(org.apache.flink.core.fs.local.LocalFileSystem
.getLocalFileSystem().getWorkingDirectory());
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.setParallelism(8);
env.socketTextStream("localhost", 9999).flatMap(new SplitterX());
env.execute("Something");
}
public static class SplitterX implements
FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String sentence,
Collector<Tuple2<String, Integer>> out) throws Exception {
// Do Nothing!
}
}
}
I fed the programm with data using netcat:
nc -lk 9999 < fileName
The question is how to make the program scale locally and use all available cores?

You don't have to specify the degree of parallelism explicitly. Jobs which are run with the default setting will set the parallelism automatically to the number of available cores.
In your case, the source will be run with parallelism of 1 since reading from a socket cannot be distributed. However, for the flatMap operation the system will instantiate 8 instances. If you turn on logging, then you will also see it. Now the input data is distributed to the flatMap tasks in a round-robin fashion. Each of the flatMap tasks is executed by an individual thread.
I would suspect that the reason why you only see load on a single core is because the SplitterX does not do any work. Try the following code which counts the number of characters in each String and then prints the result to the console:
public static void main(String[] args) throws Exception {
System.out.println("Starting Main!");
System.out.println(org.apache.flink.core.fs.local.LocalFileSystem
.getLocalFileSystem().getWorkingDirectory());
StreamExecutionEnvironment env = StreamExecutionEnvironment
.getExecutionEnvironment();
env.socketTextStream("localhost", 9999).flatMap(new SplitterX()).print();
env.execute("Something");
}
public static class SplitterX implements
FlatMapFunction<String, Tuple2<String, Integer>> {
#Override
public void flatMap(String sentence,
Collector<Tuple2<String, Integer>> out) throws Exception {
out.collect(Tuple2.of(sentence, sentence.length()));
}
}
The numbers at the start of each line tell you which task printed the result.

Related

why data is not processed in RichFlatMapFunction

In order to improve the performance of data process, we store events to a map and do not process them untill event count reaches 100.
in the meantime, start a timer in open method, so data is processed every 60 seconds
this works when flink version is 1.11.3,
after upgrading flink version to 1.13.0
I found sometimes events were consumed from Kafka continuously, but were not processed in RichFlatMapFunction, it means data was missing.
after restarting service, it works well, but several hours later the same thing happened again.
any known issue for this flink version? any suggestions are appreciated.
public class MyJob {
public static void main(String[] args) throws Exception {
...
DataStream<String> rawEventSource = env.addSource(flinkKafkaConsumer);
...
}
public class MyMapFunction extends RichFlatMapFunction<String, String> implements Serializable {
#Override
public void open(Configuration parameters) {
...
long periodTimeout = 60;
pool.scheduleAtFixedRate(() -> {
// processing data
}, periodTimeout, periodTimeout, TimeUnit.SECONDS);
}
#Override
public void flatMap(String message, Collector<String> out) {
// store event to map
// count event,
// when count = 100, start data processing
}
}

You should avoid doing things with user threads and timers in Flink functions. The supported mechanism for this is to use a KeyedProcessFunction with processing time timers.

Apache Flink - Counter value displayed but meter values not displayed

We are using Flink 1.8.0 and running it on EMR - Yarn and would like to measure the throughput.
Because our operators are chained, we have added meters and counters in our code - essentially an async operator that makes API calls with kinesis as both source and sync. In the Application Master i.e. Flink's web UI, we are able to get the value for the counters but not the meters.
public class AsyncClass extends RichAsyncFunction<String, String> {
private transient Counter counter;
private transient Meter meter;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("myCounter");
this.meter = getRuntimeContext()
.getMetricGroup()
.meter("myMeter", new DropwizardMeterWrapper(new com.codahale.metrics.Meter()));
}
#Override
public void close() throws Exception {
super.close();
ExecutorUtils.gracefulShutdown(20000, TimeUnit.MILLISECONDS, executorService);
}
#Override
public void asyncInvoke(String key, final ResultFuture<String> resultFuture) throws Exception {
resultFuture.complete(key);
this.meter.markEvent();
this.counter.inc();
}
}
To measure the complete throughput of the application, we obviously need the throughput of all the task managers together. Using meters, we are able to get the metrics for individual task managers. Is there any way to measure it at the operator level?

Turns out the meter displays whole number values and the rate is measured in decimals. When my load was a constant 1 event per second, it was actually measured as 0.9xxx something and hence was showing only 0 events per second.

How to cache the local variable at process level in Flink streaming?

Inside Flink task instance I need to access remote web service to get some data when the event coming ,however I don't want to access remote web service every time when event coming, so I need to cache the data in local memory and can be accessed by all task of the process , how to do it ? storing the data in the static private variable at the class level ?
Such as the following example ,if set the local variable localCache at class Splitter, it cached at operator level instead of process level .
public class WindowWordCount {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> dataStream = env
.socketTextStream("localhost", 9999)
.flatMap(new Splitter())
.keyBy(0)
.timeWindow(Time.seconds(5))
.sum(1);
dataStream.print();
env.execute("Window WordCount");
}
public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
***private object localCache ;***
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word: sentence.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}
}

Exactly like you said. You'd use a static variable in a RichFlatMapFunction and initialize it in open. open will be called on each TaskManager before feeding in any record. Note that there is an instance of Splitter being created for each different slot, so in most cases there are several Splitter instances on one TaskManager. Thus, you need to guard against double creation.
public static class Splitter implements FlatMapFunction<String, Tuple2<String, Integer>> {
private transient Object localCache;
#Override
public void open(Configuration parameters) throws Exception {
if (localCache == null)
localCache = ... ;
}
#Override
public void flatMap(String sentence, Collector<Tuple2<String, Integer>> out) throws Exception {
for (String word: sentence.split(" ")) {
out.collect(new Tuple2<String, Integer>(word, 1));
}
}
}

A scalable approach might use a Source operator to actually perform the call to the web service and then write the result to a stream. You can then access that stream as a broadcast stream to your operator resulting in the one object (web call result) emitted to the broadcast stream being sent to each instance of the receiving operator. This will share the result of that single web call across all machines and JVM's in your cluster. You can also persist broadcast state and share it with new instances of your operator as the cluster scales up.

Flink streaming job is not scaling as expected

We are in the middle of testing scaling ability of Flink. But we found that scaling not working, no matter increase more slot or increase number of Task Manager. We would expect a linear, if not close-to-linear scaling performance but the result even show degradation. Appreciated any comments.
Test Details,
-VMWare vsphere
-Just a simple pass through test,
- auto gen source 3mil records, each 1kb in size, parallelism=1
- source pass into next map operator, which just return the same record, and sent counter to statsD, parallelism is in cases = 2,4,6
3 TM, total 6 slots(2/TM) each JM/TM has 32 vCPU, 100GB memory
Result:
2 slots: 26 seconds, 3mil/26=115k TPS
4 slots: 23 seconds, 3mil/23=130k TPS
6 slots: 22 seconds, 3mil/22=136k TPS
As shown the scaling is almost nothing. Any clue? Thanks.

You really should be using a RichParallelSourceFunction. If you care about making the records from different instances of the source distinct, you can get ahold of each instance's index from the RuntimeContext, which is available via the getRuntimeContext() method in the RichFunction interface.
Also, Flink has a built-in statsd metrics reporter that you should be using instead of rolling your own. Moreover, numRecordsIn, numRecordsOut, numRecordsInPerSecond, and numRecordsOutPerSecond are already being computed for you, so no need to create this instrumentation yourself. You can also access these metrics via Flink's web interface, or the REST API.
As for why you might be experiencing poor scalability with the Kafka consumer, there are many things that could cause this. If you are using event time processing, then idle partitions could be holding things up (see https://issues.apache.org/jira/browse/FLINK-5479). If the stream is keyed, then data skew could be an issue. If you are connecting to an external database or service, then it could easily be a bottleneck. If checkpointing is misconfigured it could cause this. Or insufficient network capacity.
I would start to debug this by looking at some key metrics in the Flink web UI. Is the load well balanced across the sub-tasks, or is it skewed? You could turn on latency tracking and see if one of the kafka partitions is misbehaving (by inspecting the latency at the sink(s), which will be reported on a per-partition basis). And you could look for back pressure.

please refer to the sample code,
public class passthru extends RichMapFunction<String, String> {
public void open(Configuration configuration) throws Exception {
... ...
stats = new NonBlockingStatsDClient();
}
public String map(String value) throws Exception {
... ...
stats.increment();
return value;
}
}
public class datagen extends RichSourceFunction<String> {
... ...
public void run(SourceContext<String> ctx) throws Exception {
int i = 0;
while (run){
String idx = String.format("%09d", i);
ctx.collect("{\"<a 1kb json content with idx in certain json field>\"}");
i++;
if(i == loop)
run = false;
}
}
... ...
}
public class Job {
public static void main(String[] args) throws Exception {
... ...
DataStream<String> stream = env.addSource(new datagen(loop)).rebalance();
DataStream<String> convert = stream.map(new passthru(statsdUrl));
env.execute("Flink");
}
}
the reductionState code,
dataStream.flatMap(xxx).keyBy(new KeySelector<xxx, AggregationKey>() {
public AggregationKey getKey(rec r) throws Exception {
... ...
}
}).process(new Aggr());
public class Aggr extends ProcessFunction<rec, rec> {
private ReducingState<rec> store;
public void open(Configuration parameters) throws Exception {
store= getRuntimeContext().getReducingState(new ReducingStateDescriptor<>(
"reduction store", new ReduceFunction<rec>() {
... ...
}
public void processElement(rec r, Context ctx, Collector<rec> out)
throws Exception {
... ...
store.add(r);

Event time window on kafka source streaming

There is a topic in Kafka server. In the program, we read this topic as a stream and assign event timestamp. Then do window operation on this stream. But the program doesn't work. After debug, it seems that processWatermark method of WindowOperator is not executed. Here is my code.
DataStream<Tuple2<String, Long>> advertisement = env
.addSource(new FlinkKafkaConsumer082<String>("advertisement", new SimpleStringSchema(), properties))
.map(new MapFunction<String, Tuple2<String, Long>>() {
private static final long serialVersionUID = -6564495005753073342L;
#Override
public Tuple2<String, Long> map(String value) throws Exception {
String[] splits = value.split(" ");
return new Tuple2<String, Long>(splits[0], Long.parseLong(splits[1]));
}
}).assignTimestamps(timestampExtractor);
advertisement
.keyBy(keySelector)
.window(TumblingTimeWindows.of(Time.of(10, TimeUnit.SECONDS)))
.apply(new WindowFunction<Tuple2<String,Long>, Integer, String, TimeWindow>() {
private static final long serialVersionUID = 5151607280638477891L;
#Override
public void apply(String s, TimeWindow window, Iterable<Tuple2<String, Long>> values, Collector<Integer> out) throws Exception {
out.collect(Iterables.size(values));
}
}).print();
Why this happened? if I add "keyBy(keySelector)" before "assignTimestamps(timestampExtractor)" then the program works. Anyone could help to explain the reason?

You are affected by a known bug in Flink: FLINK-3121:Watermark forwarding does not work for sources not producing any data.
The problem is that there are more FlinkKafkaConsumer's running (most likely the number of CPU cores (say 4)) then you have partitions (1). Only one of the Kafka consumers is emitting watermarks, the other consumers are idling.
The window operator is not aware of that, waiting for watermarks to arrive from all consumers. That's why the windows never trigger.