Flink Kafka consumer StreamExecutionEnvironment only?

Flink Kafka consumer StreamExecutionEnvironment only? - apache-flink

I have pulling scenario,
HTTP -> Kafka -> Flink -> some output
If im not wrong i can use kafka consumer on stream only ?
Therefor i need to "block" the stream in order to sum/count the data im receiving from the HTTP call .
The easiest way to "block" is to add window/.
What is the best approach for this pulling scenario .
UPDATE
I want to prevent from the collector to sum each value
SingleOutputStreamOperator<Tuple2<String, Integer>> t =
in.flatMap(new FlatMapFunction<String, Tuple2<String, Integer>>() {
#Override
public void flatMap(String s, Collector<Tuple2<String, Integer>>
collector) throws Exception {
ObjectMapper mapper = new ObjectMapper();
JsonNode node = mapper.readTree(s);
node.elements().forEachRemaining(v -> {
collector.collect(new Tuple2<>(v.textValue(), 1));
});
}
}).keyBy(0).sum(1);

If I understand correctly I think what you may want to use is a session window. This will continue to collect messages into the window and will only process the contents of the window when an event hasn't been received after a certain amount of time. See the documentation on session windows here: https://ci.apache.org/projects/flink/flink-docs-release-1.3/dev/windows.html

Related

Flink - Need way to notify one stream from another

I have an Apache flink usecase that works as follows:
I have data events coming in through first stream. Part of each event is a foreign key for which I expect data from the second stream. E.g.: I am getting data for major cities in the first stream which has a city-code and I need the average temperature over time for this city code streamed through the second stream. It is not possible to have temperatures streamed for all possible cities, we have to request the city for which we need the data.
So we need some way to "notify" the second stream source that we need data for this city "pushed" when we encounter it the first time in the first stream.
This would have been easy if this notification could be done from the first stream. The problem is that the second stream is coming to us through a websocket part of which is a control channel through which we have to make the request - so the request HAS to be made from the second stream.
Check event in the first stream. Read city code x.
Have we seen this city code? If not, notify the second stream, we need data for city code x.
Second stream sends message to source for data for x.
Data starts flowing in for city x, which is used to join downstream.
If notification from the first stream was possible, this would be easy - I could have done it from step 2, so data starts flowing in the second stream. But that is not possible as the request needs to be send on the same websocket connection that feeds the second stream.
I have explored using CoProcessFunction or RichCoMapFunction for this - but it is not clear how this can be done. I have seen some examples of Broadcast State Pattern - but even that does not seem to fit the usecase.
Can someone help me with some pointers on possible solutions?

So I made it work using the suggestion of the side output stream. Thanks #whatisinthename and #kkrugler for the suggestions.
Still trying to figure out details, but here's a summary
From the notification stream (stream 1), create a side output stream (stream 1-1).
Use an extended class (TempRequester) of KeyedProcessFunction, to process the side output stream 1-1 and create Stream 2 from it. The KeyedProcessFunction has the websocket connection.
In the open method of the KeyedProcessFunction create the connection to websocket (handshaking etc.). Have a ListState state to keep the list of city codes.
In the processElement function of TempRequester, check the city code coming in from side output stream 1-1. If present in ListState, do nothing. Else, send a message through websocket control channel and request city data and add the code to ListState. Create a process timer (this is one time) to fire after 500 milliseconds or so. The websocket server writes the temp data very frequently and that is saved in a queue.
In the onTimer method, check the queue, read the data and push out (out.collect...). Create a timer again. So essentially, once the first city code gets in, we create a timer that runs every 500 milliseconds and dumps the records received out into the second stream.
Now the first and second streams can be joined downstream (I used the table API).
Not sure if this is the most elegant solution, but it worked. Thanks for the suggestions.
Here's the approximate main code:
DataStream<Event> notificationStream =
env.addSource(this.notificationSource)
.returns(TypeInformation.of(Event.class));
notificationStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
final OutputTag<String> outputTag = new OutputTag<String>("cities-seen"){};
SingleOutputStreamOperator<Event> mainDataStream = notificationStream.process(new ProcessFunction<Event, Event>() {
#Override
public void processElement(
Event value,
Context ctx,
Collector<Event> out) throws Exception {
// emit data to regular output
out.collect(value);
// emit data to side output
ctx.output(outputTag, event.cityCode);
}
});
DataStream<String> sideOutputStream = mainDataStream.getSideOutput(outputTag);
DataStream<TemperatureData> temperatureStream = sideOutputStream
.keyBy(value -> value)
.process(new TempRequester());
temperatureStream.assignTimestampsAndWatermarks(WatermarkStrategy.forMonotonousTimestamps());
// set up the Java Table API and rest of SQL based joins ...
And the approximate code for TempRequester (ProcessFunction):
public static class TempRequester extends KeyedProcessFunction<String, String, TemperatureData> {
private ListState<String> allCities;
private volatile boolean running = true;
//This is the queue for requesting city codes
private BlockingQueue<String> messagesToSend = new ArrayBlockingQueue<>(100);
//This is the queue for receiving temperature data
private ConcurrentLinkedQueue<TemperatureData> messages = new ConcurrentLinkedQueue<TemperatureData>();
private static final int TIMEOUT = 500;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
allCities = getRuntimeContext().getListState(new ListStateDescriptor<>("List of cities seen", String.class));
... rest of websocket client setup code ...
}
#Override
public void close() throws Exception {
running = false;
super.close();
}
private boolean initialized = false;
#Override
public void processElement(String cityCode, Context ctx, Collector<TemperatureData> collector) throws Exception {
boolean citycodeFound = StreamSupport.stream(allCities.get().spliterator(), false)
.anyMatch(s -> cityCode.equals(s));
if (!citycodeFound) {
allCities.add(cityCode);
messagesToSend.put(.. add city code ..);
if (!initialized) {
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp()+ TIMEOUT);
initialized = true;
}
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<TemperatureData> out) throws Exception {
TemperatureData p;
while ((p = messages.poll()) != null) {
out.collect(p);
}
ctx.timerService().registerProcessingTimeTimer(ctx.timestamp() + TIMEOUT);
}
}

Flink Checkpointing mode ExactlyOnce is not working as expected

I am newbie to flink apologize if my understanding is wrong i am building a dataflow application and the flow contains multiple data streams which check if the required fields are present in the incoming DataStream or not. My application validate the incoming data and if the data is validated successfully it should append the data to file in the given if it is already existing. I am trying to simulate if any exception happens in one DataStream other data streams should not get impacted for that i am explicitly throwing an exception in one of the flow. In the below example for simplicity i am using windows text file to append data
Note: My flow don't have states since i don't have any thing to store in state
public class ExceptionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// env.setParallelism(1);
//env.setStateBackend(new RocksDBStateBackend("file:///C://flinkCheckpoint", true));
// to set minimum progress time to happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// checkpoints have to complete within 5000 ms, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(5000);
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // DELETE_ON_CANCELLATION
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // number of restart attempts
Time.of(10, TimeUnit.SECONDS) // delay
));
DataStream<String> input1 = env.fromElements("hello");
DataStream<String> input2 = env.fromElements("hello");
DataStream<String> output1 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
//out.collect(value.concat(" world"));
throw new Exception("=====================NO VALUE TO CHECK=================");
}
});
DataStream<String> output2 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
out.collect(value.concat(" world"));
}
});
output2.addSink(new SinkFunction<String>() {
#Override
public void invoke(String value) throws Exception {
try {
File myObj = new File("C://flinkOutput//filename.txt");
if (myObj.createNewFile()) {
System.out.println("File created: " + myObj.getName());
BufferedWriter out = new BufferedWriter(
new FileWriter("C://flinkOutput//filename.txt", true));
out.write(value);
out.close();
System.out.println("Successfully wrote to the file.");
} else {
System.out.println("File already exists.");
BufferedWriter out = new BufferedWriter(
new FileWriter("C://flinkOutput//filename.txt", true));
out.write(value);
out.close();
System.out.println("Successfully wrote to the file.");
}
} catch (IOException e) {
System.out.println("An error occurred.");
e.printStackTrace();
}
}
});
env.execute();
}
I have few doubts as below
When i am throwing exception in output1 stream the second flow output2 is running even after encountering the exception and writing data to the file in my local but when i check the file the output as below
hello world
hello world
hello world
hello world
As per my understanding from flink documentation if i use the checkpointing mode as EXACTLY_ONCE it should not write the data to file not more than one time as the process is already completed and written data to file. But its not happening in my case and i am not getting if i am doing anything wrong
Please help me to clear my doubts on checkpointing and how can i achieve the EXACTLY_ONCE mechanism i read about TWO_PHASE_COMMIT in flink but i didn't get any example on how to implement it.
As suggested by #Mikalai Lushchytski i implemented StreamingSinkFunction below
With StreamingSinkFunction
public class ExceptionTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000);
// env.setParallelism(1);
//env.setStateBackend(new RocksDBStateBackend("file:///C://flinkCheckpoint", true));
// to set minimum progress time to happen between checkpoints
env.getCheckpointConfig().setMinPauseBetweenCheckpoints(500);
// checkpoints have to complete within 5000 ms, or are discarded
env.getCheckpointConfig().setCheckpointTimeout(5000);
// set mode to exactly-once (this is the default)
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
// enable externalized checkpoints which are retained after job cancellation
env.getCheckpointConfig().enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION); // DELETE_ON_CANCELLATION
env.setRestartStrategy(RestartStrategies.fixedDelayRestart(
3, // number of restart attempts
Time.of(10, TimeUnit.SECONDS) // delay
));
DataStream<String> input1 = env.fromElements("hello");
DataStream<String> input2 = env.fromElements("hello");
DataStream<String> output1 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
//out.collect(value.concat(" world"));
throw new Exception("=====================NO VALUE TO CHECK=================");
}
});
DataStream<String> output2 = input.flatMap(new FlatMapFunction<String, String>() {
#Override
public void flatMap(String value, Collector<String> out) throws Exception {
out.collect(value.concat(" world"));
}
});
String outputPath = "C://flinkCheckpoint";
final StreamingFileSink<String> sink = StreamingFileSink
.forRowFormat(new Path(outputPath), new SimpleStringEncoder<String>("UTF-8"))
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(TimeUnit.MINUTES.toMillis(15))
.withInactivityInterval(TimeUnit.MINUTES.toMillis(5))
.withMaxPartSize(1)
.build())
.build();
output2.addSink(sink);
});
env.execute();
}
But when i check the Checkpoint folder i can see it created four part files with in progress as below
Is there anything i am doing because of that its creating multipart files?

In order to guarantee end-to-end exactly-once record delivery (in addition to exactly-once state semantics), the data sink needs to take part in the checkpointing mechanism (as well as the data source).
If you are going to write the data to a file, then you can use a StreamingFileSink, which emits its input elements to FileSystem files within buckets. This is integrated with the checkpointing mechanism to provide exactly once semantics out-of-the box.
If you are going to implement your own sink, then the sink function must implement the CheckpointedFunction interface and properly implement snapshotState(FunctionSnapshotContext context) method called when a snapshot for a checkpoint is requested and flushing the current application state. In addition I would recommend implementing the CheckpointListener interface to be notified once a distributed checkpoint has been completed.
Flink already provides an abstract TwoPhaseCommitSinkFunction, which is a recommended base class for all of the SinkFunction that intend to implement exactly-once semantic. It does that by implementing two phase commit algorithm on top of the CheckpointedFunction and
CheckpointListener. As an example, you can have a look at FlinkKafkaProducer.java source code.

How do we understand the Word-Count sample of Apache Flink

I'm just learning Apache Flink and here is the Word Count sample:
https://ci.apache.org/projects/flink/flink-docs-stable/getting-started/tutorials/local_setup.html
I works but I have something that can't understand clearly.
Flink have three parts: JobManager, TaskManager and JobClient. As my understanding, the java code of the class SocketWindowWordCount should be a part of JobClient, this class should send what it asks to do to the JobClient then the JobClient can send the tasks to the JobManager.
Am I right?
If I'm right, I don't know which part of code in the file SocketWindowWordCount.java is responsible to send what it asks to do to the JobClient.
Is listening on the port also a part of the task which will be sent to the JobManager then to TaskManager?
// get input data by connecting to the socket
DataStream<String> text = env.socketTextStream("localhost", port, "\n");
// parse the data, group it, window it, and aggregate the counts
DataStream<WordWithCount> windowCounts = text
.flatMap(new FlatMapFunction<String, WordWithCount>() {
#Override
public void flatMap(String value, Collector<WordWithCount> out) {
for (String word : value.split("\\s")) {
out.collect(new WordWithCount(word, 1L));
}
}
})
.keyBy("word")
.timeWindow(Time.seconds(5), Time.seconds(1))
.reduce(new ReduceFunction<WordWithCount>() {
#Override
public WordWithCount reduce(WordWithCount a, WordWithCount b) {
return new WordWithCount(a.word, a.count + b.count);
}
});
// print the results with a single thread, rather than in parallel
windowCounts.print().setParallelism(1);
Is all of the codes above a part of the task?
In a word, I kind of understand the architecture of Flink but I want to know more details about how the JobClient works.

Your program itself is the JobClient from the architectural point of view. In particular, you have dependencies on the JobClient that are used when you execute the DataStream.
All of your code is the task definition that gets serialized and sent to the JobManager, which distributes it to the TaskManager.
You left out the "most" important part of the program
env.execute("Socket Window WordCount");
That is actually triggering the JobClient to package the DataStream program and send it to the configured JobManager.

Side input of size around 50Mb causing long GC pause

We are running Beam application on Flink cluster with side inputs of size 50Mb.
Side input refresh ( Pull from external data source ) based on the notification sent to the notification topic in Kafka.
As the application progress due to side input Full GC happening often and each GC taking ~30 sec which pauses task manager to send heart beat to the Master.
After consecutive heartbeat miss , master assuming worker is dead and start reassigning the jobs , results restarting of application.
We tried removing Side input , application works fine.
Questions :
Is there any limitation on size of side input in Apache Beam side input ?
I have created side input map using asSingleton() , is going to create seprate copy for each task ? I have given 15 parallelism. is it going to create 15 copy in a JVM ( assuming all tasks assigned to same worker )?
What is alternative for side inputs?
This is sample pipeline :
public class BeamApplication {
public static final CloseableHttpClient httpClient = HttpClients.createDefault();
public static void main(String[] args) {
PipelineOptions options = PipelineOptionsFactory.create();
options.as(FlinkPipelineOptions.class).setRunner(FlinkRunner.class);
Pipeline pipeline = Pipeline.create(options);
PCollection<Map<String, Double>> sideInput = pipeline
.apply(KafkaIO.<String, String>read().withBootstrapServers("localhost:9092")
.withKeyDeserializer(StringDeserializer.class).withValueDeserializer(StringDeserializer.class)
.withTopic("testing"))
.apply(ParDo.of(new DoFn<KafkaRecord<String, String>, Map<String, Double>>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
KafkaRecord<String, String> record = processContext.element();
String message = record.getKV().getValue().split("##")[0];
String change = record.getKV().getValue().split("##")[1];
if (message.equals("START_REST")) {
Map<String, Double> map = new HashMap<>();
Map<String,Double> changeMap = new HashMap<>();
HttpGet request = new HttpGet("http://localhost:8080/config-service/currency");
try (CloseableHttpResponse response = httpClient.execute(request)) {
HttpEntity entity = response.getEntity();
String responseString = EntityUtils.toString(entity, "UTF-8");
ObjectMapper objectMapper = new ObjectMapper();
CurrencyDTO jsonObject = objectMapper.readValue(responseString, CurrencyDTO.class);
map.putAll(jsonObject.getQuotes());
System.out.println(change);
Random rand = new Random();
Double db = rand.nextDouble();
System.out.println(db);
changeMap.put(change,db);
entity.getContent();
} catch (Exception e) {
e.printStackTrace();
}
processContext.output(changeMap);
}
}
}));
PCollection<Map<String, Double>> currency = sideInput
.apply(Window.<Map<String, Double>>into(new GlobalWindows())
.triggering(Repeatedly.forever(AfterPane.elementCountAtLeast(1)))
.withAllowedLateness(Duration.ZERO).discardingFiredPanes());
PCollectionView<Map<String, Double>> sideInputView = currency.apply(View.asSingleton());
PCollection<KafkaRecord<Long, String>> kafkaEvents = pipeline
.apply(KafkaIO.<Long, String>read().withBootstrapServers("localhost:9092")
.withKeyDeserializer(LongDeserializer.class).withValueDeserializer(StringDeserializer.class)
.withTopic("event_testing"));
PCollection<String> output = kafkaEvents
.apply("Extract lines", ParDo.of(new DoFn<KafkaRecord<Long, String>, String>() {
#ProcessElement
public void processElement(ProcessContext processContext) {
String element = processContext.element().getKV().getValue();
Map<String, Double> map = processContext.sideInput(sideInputView);
System.out.println("This is it : " + map.entrySet());
}
}).withSideInputs(sideInputView));
pipeline.run().waitUntilFinish();
}
}

What state-backend are you using?
If i'm not mistaken, side inputs are implemented as state in Flink. If you're using MemoryStateBackend as state-backend, you might indeed reach pressure on you memory consumption.
Also, the processing of events will block until that side input is ready, buffering events. If preparing the side input take long time or the rate of incoming events is high, you might reach memory pressure.
Can try an alternative state-backend? Preferably RocksDBStateBackend, it holds in-flight data in a RocksDB database instead of in-memory.
It's difficult to guess what's the issue. I would recommend monitoring memory related metrics - see a good post on that here.
You could also run profiling on the Task Managers and analyse the dumps - see here
Is the memory increasing also if you only publish the first message to "testing" topic?
Maybe to isolate the problem I would use a simpler side-input. Remove the HTTP call and make the data static. Maybe a periodic triggered one instead of Kafka:
GenerateSequence.from(0).withRate(1, Duration.standardSeconds(5L))

Joining stream events only once over a sliding window

I'm trying to evaluate if apache flink would be usable for a distributed event driven system (only-once). The use case is that a user is signed up for a subscription and wants to change for a different subscription.
There are two separate processes that run asynchronously when the users clicks the submit button. One process cancels the existing subscription whilst another signs up for the new subscription. Once these two events have been triggered, the email notification is sent.
I've managed to create two streams in apache flink using the RabbitMQ connector. When I try joining these streams together using a sliding window, the events are duplicated for each slide in the window. I've tried setting a ValueStateDescriptor on the joined streams but this doesn't seem to expire after the window has passed.
Additionally I need to detect the events that have not been paired in the streams and send this event to a different RabbitMQ sink to cope with situations whereby the event has not be fired due to the process not completing successfully.
Do you have any tips/ideas on how I could achieve the above functionality?
final StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
final RMQConnectionConfig rmqConnectionConfig = new RMQConnectionConfig.Builder()
.setHost("localhost")
.setPort(5672)
.setVirtualHost("/")
.setUserName("admin")
.setPassword("password")
.build();
final DataStream<String> cancellation = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-cancellation", true, new SimpleStringSchema()))
.setParallelism(1);
final DataStream<String> subscription = environment
.addSource(new RMQSource<>(rmqConnectionConfig, "scratchpad-subscription", true, new SimpleStringSchema()))
.setParallelism(1);
cancellation
.join(subscription)
.where(value -> value).equalTo(value -> value)
.window(SlidingEventTimeWindows.of(Time.minutes(5), Time.seconds(15)))
.apply((left, right) -> left)
.keyBy(value -> value)
.process(new ProcessFunction<String, String>() {
private ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>("seen", Boolean.class);
private ValueState<Boolean> state;
#Override
public void open(Configuration parameters) {
state = this.getRuntimeContext().getState(descriptor);
}
#Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
if (BooleanUtils.isNotTrue(state.value())) {
state.update(true);
out.collect(value);
ctx.timerService().registerEventTimeTimer(ctx.timestamp() + TimeUnit.MILLISECONDS.convert(10, TimeUnit.MINUTES));
}
}
#Override
public void onTimer(long timestamp, OnTimerContext ctx, Collector<String> out) {
state.clear();
}
})
.print()
.setParallelism(1);
environment.execute();

If you have duplicated values on the output of your window, you can add a reduce function into another window after your already defined slide window and that should be engough in most cases. But i think that should be a better solution than this, but we need an example fo your code to work on improvements.
On the other side, if you need to detect non paired events, i think that you need to work with the CoGroup operator, instead of use joins.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight