How do we understand the Word-Count sample of Apache Flink

How do we understand the Word-Count sample of Apache Flink - apache-flink

I'm just learning Apache Flink and here is the Word Count sample:
https://ci.apache.org/projects/flink/flink-docs-stable/getting-started/tutorials/local_setup.html
I works but I have something that can't understand clearly.
Flink have three parts: JobManager, TaskManager and JobClient. As my understanding, the java code of the class SocketWindowWordCount should be a part of JobClient, this class should send what it asks to do to the JobClient then the JobClient can send the tasks to the JobManager.
Am I right?
If I'm right, I don't know which part of code in the file SocketWindowWordCount.java is responsible to send what it asks to do to the JobClient.
Is listening on the port also a part of the task which will be sent to the JobManager then to TaskManager?
// get input data by connecting to the socket
DataStream<String> text = env.socketTextStream("localhost", port, "\n");
// parse the data, group it, window it, and aggregate the counts
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
#Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction<WordWithCount>() {
#Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(1);
Is all of the codes above a part of the task?
In a word, I kind of understand the architecture of Flink but I want to know more details about how the JobClient works.

Your program itself is the JobClient from the architectural point of view. In particular, you have dependencies on the JobClient that are used when you execute the DataStream.
All of your code is the task definition that gets serialized and sent to the JobManager, which distributes it to the TaskManager.
You left out the "most" important part of the program
env.execute("Socket Window WordCount");
That is actually triggering the JobClient to package the DataStream program and send it to the configured JobManager.

Related

Flink - Need way to notify one stream from another

I have an Apache flink usecase that works as follows:
I have data events coming in through first stream. Part of each event is a foreign key for which I expect data from the second stream. E.g.: I am getting data for major cities in the first stream which has a city-code and I need the average temperature over time for this city code streamed through the second stream. It is not possible to have temperatures streamed for all possible cities, we have to request the city for which we need the data.
So we need some way to "notify" the second stream source that we need data for this city "pushed" when we encounter it the first time in the first stream.
This would have been easy if this notification could be done from the first stream. The problem is that the second stream is coming to us through a websocket part of which is a control channel through which we have to make the request - so the request HAS to be made from the second stream.
Check event in the first stream. Read city code x.
Have we seen this city code? If not, notify the second stream, we need data for city code x.
Second stream sends message to source for data for x.
Data starts flowing in for city x, which is used to join downstream.
If notification from the first stream was possible, this would be easy - I could have done it from step 2, so data starts flowing in the second stream. But that is not possible as the request needs to be send on the same websocket connection that feeds the second stream.
I have explored using CoProcessFunction or RichCoMapFunction for this - but it is not clear how this can be done. I have seen some examples of Broadcast State Pattern - but even that does not seem to fit the usecase.
Can someone help me with some pointers on possible solutions?

So I made it work using the suggestion of the side output stream. Thanks #whatisinthename and #kkrugler for the suggestions.
Still trying to figure out details, but here's a summary
From the notification stream (stream 1), create a side output stream (stream 1-1).
Use an extended class (TempRequester) of KeyedProcessFunction, to process the side output stream 1-1 and create Stream 2 from it. The KeyedProcessFunction has the websocket connection.
In the open method of the KeyedProcessFunction create the connection to websocket (handshaking etc.). Have a ListState state to keep the list of city codes.
In the processElement function of TempRequester, check the city code coming in from side output stream 1-1. If present in ListState, do nothing. Else, send a message through websocket control channel and request city data and add the code to ListState. Create a process timer (this is one time) to fire after 500 milliseconds or so. The websocket server writes the temp data very frequently and that is saved in a queue.
In the onTimer method, check the queue, read the data and push out (out.collect...). Create a timer again. So essentially, once the first city code gets in, we create a timer that runs every 500 milliseconds and dumps the records received out into the second stream.
Now the first and second streams can be joined downstream (I used the table API).
Not sure if this is the most elegant solution, but it worked. Thanks for the suggestions.
Here's the approximate main code:
DataStream<Event> notificationStream =
env.addSource(this.notificationSource)
.returns(TypeInformation.of(Event.class));
notificationStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
final OutputTag<String> outputTag = new OutputTag<String>("cities-seen"){};
SingleOutputStreamOperator<Event> mainDataStream = notificationStream.process(new ProcessFunction<Event, Event>() {
#Override
public void processElement(
Event value,
Context ctx,
Collector<Event> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag, event.cityCode);
}
});
DataStream<String> sideOutputStream = mainDataStream.getSideOutput(outputTag);
DataStream<TemperatureData> temperatureStream = sideOutputStream
.keyBy(value -> value)
.process(new TempRequester());
temperatureStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
// set up the Java Table API and rest of SQL based joins ...
And the approximate code for TempRequester (ProcessFunction):
public static class TempRequester extends KeyedProcessFunction<String, String, TemperatureData> {
private ListState<String> allCities;
private volatile boolean running = true;
//This is the queue for requesting city codes
private BlockingQueue<String> messagesToSend = new ArrayBlockingQueue<>(100);
//This is the queue for receiving temperature data
private ConcurrentLinkedQueue<TemperatureData> messages = new ConcurrentLinkedQueue<TemperatureData>();
private static final int TIMEOUT = 500;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
allCities = getRuntimeContext().getListState(new ListStateDescriptor<>("List of cities seen", String.class));
... rest of websocket client setup code ...
}
#Override
public void close() throws Exception {
running = false;
super.close();
}
private boolean initialized = false;
#Override
public void processElement(String cityCode, Context ctx, Collector<TemperatureData> collector) throws Exception {
boolean citycodeFound = StreamSupport.stream(allCities.get().spliterator(), false)
.anyMatch(s -> cityCode.equals(s));
if (!citycodeFound) {
allCities.add(cityCode);
messagesToSend.put(.. add city code ..);
if (!initialized) {
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp()+ TIMEOUT);
initialized = true;
}
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<TemperatureData> out) throws Exception {
TemperatureData p;
while ((p = messages.poll()) != null) {
out.collect(p);
}
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + TIMEOUT);
}
}

How to share variables & logging in Apache Flink?

Apache Flink offers the possibility to easy adapt operators. I am interested in the internal processings and want to log things that happen inside an operator. For this, a logger-object is handed out to the operator.
public class LogSink extends RichSinkFunction<TaxiRide> {
private static final Logger log = LoggerFactory.getLogger("myLogger");
public LogSink() {
String msg = "Log Sink initialized";
log.info(msg);
}
#Override
public void invoke(TaxiRide ride, Context context) throws Exception {
log.info("Name: " + ride.getName());
}
}
}
In my main method on the main server (master), I initialize the operator. So, the message "log sink initialized" appears in my custom log file, as desired.
But the log messages (e.g. "Name: TaxiRide324") that are logged within invoke() - which is called by a slave, e.g. another JVM - are written to in flinks taskexecutor.log.
I assume, this is because of the distributed processing. The TaskManager and JobManager have different JVMs and so the initialized logger is not used/seen by the JobManager in the execution. (But there is interestingly no NullPointerExeption...)
So my question is: How could I achieve to share the objectss between initialization and execution of an inner class on a distributed flink cluster?

Side input of size around 50Mb causing long GC pause

We are running Beam application on Flink cluster with side inputs of size 50Mb.
Side input refresh ( Pull from external data source ) based on the notification sent to the notification topic in Kafka.
As the application progress due to side input Full GC happening often and each GC taking ~30 sec which pauses task manager to send heart beat to the Master.
After consecutive heartbeat miss , master assuming worker is dead and start reassigning the jobs , results restarting of application.
We tried removing Side input , application works fine.
Questions :
Is there any limitation on size of side input in Apache Beam side input ?
I have created side input map using asSingleton() , is going to create seprate copy for each task ? I have given 15 parallelism. is it going to create 15 copy in a JVM ( assuming all tasks assigned to same worker )?
What is alternative for side inputs?
This is sample pipeline :
public class BeamApplication {
public static final CloseableHttpClient httpClient = HttpClients.createDefault();
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class).setRunner(FlinkRunner.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<Map<String, Double>> sideInput = pipeline
.apply(KafkaIO.<String, String>read().withBootstrapServers("localhost:9092")
.withKeyDeserializer(StringDeserializer.class).withValueDeserializer(StringDeserializer.class)
.withTopic("testing"))
.apply(ParDo.of(new DoFn<KafkaRecord<String, String>, Map<String, Double>>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
KafkaRecord<String, String> record = processContext.element();
String message = record.getKV().getValue().split("##")[0];
String change = record.getKV().getValue().split("##")[1];
if (message.equals("START_REST")) {
Map<String, Double> map = new HashMap<>();
Map<String,Double> changeMap = new HashMap<>();
HttpGet request = new HttpGet("http://localhost:8080/config-service/currency");
try (CloseableHttpResponse response = httpClient.execute(request)) {
HttpEntity entity = response.getEntity();
String responseString = EntityUtils.toString(entity, "UTF-8");
ObjectMapper objectMapper = new ObjectMapper();
CurrencyDTO jsonObject = objectMapper.readValue(responseString, CurrencyDTO.class);
map.putAll(jsonObject.getQuotes());
System.out.println(change);
Random rand = new Random();
Double db = rand.nextDouble();
System.out.println(db);
changeMap.put(change,db);
entity.getContent();
} catch (Exception e) {
e.printStackTrace();
}
processContext.output(changeMap);
}
}
}));
PCollection<Map<String, Double>> currency = sideInput
.apply(Window.<Map<String, Double>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO).discardingFiredPanes());
PCollectionView<Map<String, Double>> sideInputView = currency.apply(View.asSingleton());
PCollection<KafkaRecord<Long, String>> kafkaEvents = pipeline
.apply(KafkaIO.<Long, String>read().withBootstrapServers("localhost:9092")
.withKeyDeserializer(LongDeserializer.class).withValueDeserializer(StringDeserializer.class)
.withTopic("event_testing"));
PCollection<String> output = kafkaEvents
.apply("Extract lines", ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
String element = processContext.element().getKV().getValue();
Map<String, Double> map = processContext.sideInput(sideInputView);
System.out.println("This is it : " + map.entrySet());
}
}).withSideInputs(sideInputView));
pipeline.run().waitUntilFinish();
}
}

What state-backend are you using?
If i'm not mistaken, side inputs are implemented as state in Flink. If you're using MemoryStateBackend as state-backend, you might indeed reach pressure on you memory consumption.
Also, the processing of events will block until that side input is ready, buffering events. If preparing the side input take long time or the rate of incoming events is high, you might reach memory pressure.
Can try an alternative state-backend? Preferably RocksDBStateBackend, it holds in-flight data in a RocksDB database instead of in-memory.
It's difficult to guess what's the issue. I would recommend monitoring memory related metrics - see a good post on that here.
You could also run profiling on the Task Managers and analyse the dumps - see here
Is the memory increasing also if you only publish the first message to "testing" topic?
Maybe to isolate the problem I would use a simpler side-input. Remove the HTTP call and make the data static. Maybe a periodic triggered one instead of Kafka:
GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L))

Joining stream events only once over a sliding window

I'm trying to evaluate if apache flink would be usable for a distributed event driven system (only-once). The use case is that a user is signed up for a subscription and wants to change for a different subscription.
There are two separate processes that run asynchronously when the users clicks the submit button. One process cancels the existing subscription whilst another signs up for the new subscription. Once these two events have been triggered, the email notification is sent.
I've managed to create two streams in apache flink using the RabbitMQ connector. When I try joining these streams together using a sliding window, the events are duplicated for each slide in the window. I've tried setting a ValueStateDescriptor on the joined streams but this doesn't seem to expire after the window has passed.
Additionally I need to detect the events that have not been paired in the streams and send this event to a different RabbitMQ sink to cope with situations whereby the event has not be fired due to the process not completing successfully.
Do you have any tips/ideas on how I could achieve the above functionality?
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
final RMQConnectionConfig rmqConnectionConfig = new RMQConnectionConfig.Builder()
.setHost("localhost")
.setPort(5672)
.setVirtualHost("/")
.setUserName("admin")
.setPassword("password")
.build();
final DataStream<String> cancellation = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-cancellation", true, new SimpleStringSchema()))
.setParallelism(1);
final DataStream<String> subscription = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-subscription", true, new SimpleStringSchema()))
.setParallelism(1);
cancellation
.join(subscription)
.where(value -> value).equalTo(value -> value)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(15)))
.apply((left, right) -> left)
.keyBy(value -> value)
.process(new ProcessFunction<String, String>() {
private ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class);
private ValueState<Boolean> state;
#Override
public void open(Configuration parameters) {
state = this.getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
if (BooleanUtils.isNotTrue(state.value())) {
state.update(true);
out.collect(value);
ctx.timerService().registerEventTimeTimer(ctx.timestamp() + TimeUnit.MILLISECONDS.convert(10, TimeUnit.MINUTES));
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) {
state.clear();
}
})
.print()
.setParallelism(1);
environment.execute();

If you have duplicated values on the output of your window, you can add a reduce function into another window after your already defined slide window and that should be engough in most cases. But i think that should be a better solution than this, but we need an example fo your code to work on improvements.
On the other side, if you need to detect non paired events, i think that you need to work with the CoGroup operator, instead of use joins.

Flink Kafka consumer StreamExecutionEnvironment only?

I have pulling scenario,
HTTP -> Kafka -> Flink -> some output
If im not wrong i can use kafka consumer on stream only ?
Therefor i need to "block" the stream in order to sum/count the data im receiving from the HTTP call .
The easiest way to "block" is to add window/.
What is the best approach for this pulling scenario .
UPDATE
I want to prevent from the collector to sum each value
SingleOutputStreamOperator<Tuple2<String, Integer>> t =
in.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
#Override
public void flatMap(String s, Collector<Tuple2<String, Integer>>
collector) throws Exception {
ObjectMapper mapper = new ObjectMapper();
JsonNode node = mapper.readTree(s);
node.elements().forEachRemaining(v -> {
collector.collect(new Tuple2<>(v.textValue(), 1));
});
}
}).keyBy(0).sum(1);

If I understand correctly I think what you may want to use is a session window. This will continue to collect messages into the window and will only process the contents of the window when an event hasn't been received after a certain amount of time. See the documentation on session windows here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight