Skip message in Kafka deserialization schema if any problems occur - apache-flink

I have a simple Apache Flink job that ends with a Kafka sink. I'm using a KafkaRecordSerializationSchema<CustomType> to handle the message from the previous (RichFlatMap) operator:
public final class CustomTypeSerializationSchema implements KafkaRecordSerializationSchema<CustomType> {
private static final long serialVersionUID = 5743933755381724692L;
private final String topic;
public CustomTypeSerializationSchema(final String topic) {
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(final CustomType input, final KafkaSinkContext context,
final Long timestamp) {
final var result = new CustomMessage(input);
try {
return new ProducerRecord<>(topic,
JacksonJsonMapper.writeValueAsString(result).getBytes(StandardCharsets.UTF_8));
} catch (final Exception e) {
logger.warn("Unable to serialize message [{}]. This was the reason:", result, e);
}
return new ProducerRecord<>(topic, new byte[0]);
}
}
The problem I'm trying to avoid is to send an "empty" ProducerRecord — like the one that will be executed by default if something happens within the try-catch. Basically, I'm looking for a behavior similar to KafkaRecordDeserializationSchema, where what's put in the collector is what's going to be received in subsequent operators, and the rest is discarded.
Is there a way to achieve this with another *SerializationSchema type?

Related

Apache Flink side-output not outputing exected results when order of processors swapped in the original stream

I have a small Flink app:
public class App {
public static final OutputTag<String> numberOutputTag = new OutputTag<String>("side-output") {
};
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> text = env.fromElements(
"abc,123"
);
// Router will split input on commas and redirect number strings to the side output
SingleOutputStreamOperator<String> ingestStream = text
.process(new RouterProcessor())
.process(new UppercaseProcessor())
;
DataStream<String> numberStream = ingestStream.getSideOutput(numberOutputTag)
// Prepends a "$" to the values.
.map(new MoneyMapper());
numberStream.print();
ingestStream.print();
env.execute();
}
}
class RouterProcessor extends ProcessFunction<String, String> {
#Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
String[] tokens = value.split(",");
for (String token : tokens) {
if (token.matches("[0-9]+")) {
ctx.output(App.numberOutputTag, token);
} else {
out.collect(token);
}
}
}
}
class MoneyMapper implements MapFunction<String, String> {
#Override
public String map(String t) throws Exception {
return "$" + t;
}
}
class UppercaseProcessor extends ProcessFunction<String, String> {
#Override
public void processElement(String value, Context ctx, Collector<String> out) throws Exception {
out.collect(value.toUpperCase());
}
}
I'd expect it to output something similar to:
18> ABC
18> $123
However, it only outputs:
10> ABC
If I swap the order of the processors to:
.process(new UppercaseProcessor())
.process(new RouterProcessor())
everything works as expected.
I've read the documentation but I don't see anything that would explain why this is as it is. I'm curious if I'm missing something or doing something wrong.
I've included a GitHub jist here for easier viewing with all the supporting files: https://gist.github.com/baelec/95f41d875dda0a2806a0fb9b9313b90e
Here is a repo if you'd prefer to download the sample project: https://github.com/baelec/flink_sample_broken_0
EDIT: I see that StackOverflow asks us to avoid comments like "Thanks!" but I don't have enough rep to visibly upvote the responses so thanks David and Jaya for your help. I had made some incorrect assumptions regarding side outputs. I appreciate the clarification.
The problem is that you are taking the side output from the UppercaseProcessor, which doesn't use a side output.
It's easier to see what's wrong if you look at the job graph, which looks like this:
If you rearrange the code to be like this:
SingleOutputStreamOperator<String> ingestStream = text
.process(new RouterProcessor());
DataStream<String> numberStream = ingestStream.getSideOutput(numberOutputTag)
.map(new MoneyMapper());
numberStream.print();
ingestStream
.process(new UppercaseProcessor())
.print();
then it works as you expected, and the job graph has become this:
numberOutputTag side output emit logic happens inside RouterProcessor. So you need to extract the side output from the SingleOutputStreamOperator returned by the RouterProcessor process function. But in your code, your side output logic extraction happens after the UppercaseProcessor function.
Change something like below,
SingleOutputStreamOperator<String> tempStream = text.process(new RouterProcessor());
SingleOutputStreamOperator<String> ingestStream = tempStream.process(new UppercaseProcessor());
DataStream<String> numberStream = tempStream.getSideOutput(numberOutputTag).map(new MoneyMapper());
numberStream.print();
ingestStream.print();
Note: Check the usage of tempStream variable in the above example.

How to control flink sending output to sideoutput in keyedbroadcastprocessfunction

I am trying to validate a data stream against a set of rules to detect patterns in flink by validating the data stream against a broadcast stream with set of rules i using for loop to collect all the patterns in map and iterating through it in processElement fn to find a pattern sample code is as below
MapState Descriptor and Side output stream as below
public static final MapStateDescriptor<String, String> ruleSetDescriptor =
new MapStateDescriptor<String, String>("RuleSet", BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO);
public final static OutputTag<Tuple2<String, String>> unMatchedSideOutput =
new OutputTag<Tuple2<String, String>>(
"unmatched-side-output") {
};
Process Function and Broadcast Function as below:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx,
Collector<Tuple2<String,
String>> out) throws Exception {
for (Map.Entry<String, String> ruleSet:
ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
//If the rule in ruleset is matched then send output to main stream and break the program
if (this.rule) {
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule is matched
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
#Override
public void processBroadcastElement(Tuple2<String, String> ruleSetConditions, Context ctx, Collector<Tuple2<String,String>> out) throws Exception {
ctx.getBroadcastState(broadcast.ruleSetDescriptor).put(ruleSetConditions.f0,
ruleSetConditions.f1);
}
I am able to detect the pattern but i am getting sideoutput also since i am trying to iterate over the rules one by one if my matched rule is present in last, the program is sending output to sideoutput since the initial set of rules won't match. I want to print sideoutput only once if none of the rules are satisfied, i am new to flink please help how can i achieve it.
It looks to me like you want to do something more like this:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx, Collector<Tuple2<String, String>> out) throws Exception {
transient boolean matched = false;
for (Map.Entry<String, String> ruleSet:
ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
if (this.rule) {
matched = true;
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule was matched
if (!matched) {
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
}

How to update the Broadcast state in KeyedBroadcastProcessFunction in flink?

I am new to Flink i am doing a pattern matching using apache flink where the list of patterns are present in broadcast state and iterating through the patterns in processElements function to find the pattern matched and i am reading this patterns from a database and its a on time activity. Below is my code
MapState Descriptor and Side output stream as below
public static final MapStateDescriptor<String, String> ruleDescriptor=
new MapStateDescriptor<String, String>("RuleSet", BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO);
public final static OutputTag<Tuple2<String, String>> unMatchedSideOutput =
new OutputTag<Tuple2<String, String>>(
"unmatched-side-output") {
};
Process Function and Broadcast Function as below:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx,Collector<Tuple2<String,String>> out) throws Exception {
for (Map.Entry<String, String> ruleSet: ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
//If the rule in ruleset is matched then send output to main stream and break the program
if (this.rule) {
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule is matched
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
#Override
public void processBroadcastElement(Tuple2<String, String> ruleSetConditions, Context ctx, Collector<Tuple2<String,String>> out) throws Exception { ctx.getBroadcastState(broadcast.ruleDescriptor).put(ruleSetConditions.f0,
ruleSetConditions.f1);
}
Main Function as below
public static void main(String[] args) throws Exception {
//Initiate a datastream environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Reads incoming data for upstream
DataStream<String> incomingSignal =
env.readTextFile(....);
//Reads the patterns available in configuration file
DataStream<String> ruleStream =
env.readTextFile();
//Generate a key,value pair of set of patterns where key is pattern name and value is pattern condition
DataStream<Tuple2<String, String>> ruleStream =
rawPatternStream.flatMap(new FlatMapFunction<String, Tuple2<String, String>>() {
#Override
public void flatMap(String ruleCondition, Collector<Tuple2<String, String>> out) throws Exception {
String rules[] = ruleCondition.split[","];
out.collect(new Tuple2<>(rules[0], rules[1]));
}
}
});
//Broadcast the patterns to all the flink operators which will be stored in flink operator memory
BroadcastStream<Tuple2<String, String>>ruleBroadcast = ruleStream.broadcast(ruleDescriptor);
/*Creating keystream based on sourceName as key */
DataStream<Tuple2<String, String>> matchSignal =
incomingSignal.map(new MapFunction<String, Tuple2<String, String>>() {
#Override
public Tuple2<String, String> map(String incomingSignal) throws Exception {
String sourceName = ingressSignal.split[","][0]
return new Tuple2<>(sourceName, incomingSignal);
}
}).keyBy(0).connect(ruleBroadcast).process(new KeyedBroadCastProcessFunction());
matchSignal.print("RuleDetected=>");
}
I have a couple of questions
1) Currently i am reading rules from a database, how can i update the broadcast state when flink job is running in cluster and if i get new set of rules from a kafka topic how can i update the broadcast state in processBroadcast method in KeyedBroadcasrProcessFunction
2)When the broadcast state is updated do we need to restart the flink job?
Please help me with above questions
The only way to either set or update broadcast state is in the processBroadcastElement method of a BroadcastProcessFunction or KeyedBroadcastProcessFunction. All you need to do is to adapt your application to stream in the rules from a streaming source, rather than reading them once from a file.
Broadcast state is a hash map. If your broadcast stream includes a new key/value pair that uses the same key as an earlier broadcast event, then the new value will replace the old one. Otherwise you'll end up with an entirely new entry.
If you use readFile with FileProcessingMode.PROCESS_CONTINUOUSLY, then every time you modify the file its entire contents will be reingested. You could use that mechanism to update your set of rules.

Flink streaming example that generates its own data

Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}

Apache Flink - use values from a data stream to dynamically create a streaming data source

I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.

Resources