I am new to Flink i am doing a pattern matching using apache flink where the list of patterns are present in broadcast state and iterating through the patterns in processElements function to find the pattern matched and i am reading this patterns from a database and its a on time activity. Below is my code
MapState Descriptor and Side output stream as below
public static final MapStateDescriptor<String, String> ruleDescriptor=
new MapStateDescriptor<String, String>("RuleSet", BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO);
public final static OutputTag<Tuple2<String, String>> unMatchedSideOutput =
new OutputTag<Tuple2<String, String>>(
"unmatched-side-output") {
};
Process Function and Broadcast Function as below:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx,Collector<Tuple2<String,String>> out) throws Exception {
for (Map.Entry<String, String> ruleSet: ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
//If the rule in ruleset is matched then send output to main stream and break the program
if (this.rule) {
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule is matched
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
#Override
public void processBroadcastElement(Tuple2<String, String> ruleSetConditions, Context ctx, Collector<Tuple2<String,String>> out) throws Exception { ctx.getBroadcastState(broadcast.ruleDescriptor).put(ruleSetConditions.f0,
ruleSetConditions.f1);
}
Main Function as below
public static void main(String[] args) throws Exception {
//Initiate a datastream environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Reads incoming data for upstream
DataStream<String> incomingSignal =
env.readTextFile(....);
//Reads the patterns available in configuration file
DataStream<String> ruleStream =
env.readTextFile();
//Generate a key,value pair of set of patterns where key is pattern name and value is pattern condition
DataStream<Tuple2<String, String>> ruleStream =
rawPatternStream.flatMap(new FlatMapFunction<String, Tuple2<String, String>>() {
#Override
public void flatMap(String ruleCondition, Collector<Tuple2<String, String>> out) throws Exception {
String rules[] = ruleCondition.split[","];
out.collect(new Tuple2<>(rules[0], rules[1]));
}
}
});
//Broadcast the patterns to all the flink operators which will be stored in flink operator memory
BroadcastStream<Tuple2<String, String>>ruleBroadcast = ruleStream.broadcast(ruleDescriptor);
/*Creating keystream based on sourceName as key */
DataStream<Tuple2<String, String>> matchSignal =
incomingSignal.map(new MapFunction<String, Tuple2<String, String>>() {
#Override
public Tuple2<String, String> map(String incomingSignal) throws Exception {
String sourceName = ingressSignal.split[","][0]
return new Tuple2<>(sourceName, incomingSignal);
}
}).keyBy(0).connect(ruleBroadcast).process(new KeyedBroadCastProcessFunction());
matchSignal.print("RuleDetected=>");
}
I have a couple of questions
1) Currently i am reading rules from a database, how can i update the broadcast state when flink job is running in cluster and if i get new set of rules from a kafka topic how can i update the broadcast state in processBroadcast method in KeyedBroadcasrProcessFunction
2)When the broadcast state is updated do we need to restart the flink job?
Please help me with above questions
The only way to either set or update broadcast state is in the processBroadcastElement method of a BroadcastProcessFunction or KeyedBroadcastProcessFunction. All you need to do is to adapt your application to stream in the rules from a streaming source, rather than reading them once from a file.
Broadcast state is a hash map. If your broadcast stream includes a new key/value pair that uses the same key as an earlier broadcast event, then the new value will replace the old one. Otherwise you'll end up with an entirely new entry.
If you use readFile with FileProcessingMode.PROCESS_CONTINUOUSLY, then every time you modify the file its entire contents will be reingested. You could use that mechanism to update your set of rules.
Related
I want to enrich my 1st stream with the help of the 2nd stream like the flowing records keep joining with the 2nd stream like a lookup which I want to keep in memory forever like a table.
Any code example or any flink API I could use which would fit in this use-case.
You can find an example of a connected stream with a shared state in the Ververica training page: https://training.ververica.com (Stateful Stream Processing, Slide 13)
public static class ControlFunction extends KeyedCoProcessFunction<String, String, String, String> {
private ValueState<Boolean> blocked;
#Override
public void open(Configuration config) {
blocked = getRuntimeContext().getState(new ValueStateDescriptor<>("blocked", Boolean.class));
}
#Override
public void processElement1(String controlValue, Context context, Collector<String> out) throws Exception {
blocked.update(Boolean.TRUE);
}
#Override
public void processElement2(String dataValue, Context context, Collector<String> out) throws Exception {
if (blocked.value() == null) {
out.collect(dataValue);
}
}
}
public class StreamingJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> control = env.fromElements("DROP", "IGNORE").keyBy(x -> x);
DataStream<String> data = env
.fromElements("Flink", "DROP", "Forward", "IGNORE")
.keyBy(x -> x);
control
.connect(data)
.process(new ControlFunction())
.print();
env.execute();
}
}
In your case, you would need to keep the contents of the 2nd stream in the KeyedCoProcessFunction state and have the 1st stream read from the state to join it with its elements. You'll need to think how to key your streams and what kind of state to use, but that would be the main idea.
I am trying to validate a data stream against a set of rules to detect patterns in flink by validating the data stream against a broadcast stream with set of rules i using for loop to collect all the patterns in map and iterating through it in processElement fn to find a pattern sample code is as below
MapState Descriptor and Side output stream as below
public static final MapStateDescriptor<String, String> ruleSetDescriptor =
new MapStateDescriptor<String, String>("RuleSet", BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO);
public final static OutputTag<Tuple2<String, String>> unMatchedSideOutput =
new OutputTag<Tuple2<String, String>>(
"unmatched-side-output") {
};
Process Function and Broadcast Function as below:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx,
Collector<Tuple2<String,
String>> out) throws Exception {
for (Map.Entry<String, String> ruleSet:
ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
//If the rule in ruleset is matched then send output to main stream and break the program
if (this.rule) {
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule is matched
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
#Override
public void processBroadcastElement(Tuple2<String, String> ruleSetConditions, Context ctx, Collector<Tuple2<String,String>> out) throws Exception {
ctx.getBroadcastState(broadcast.ruleSetDescriptor).put(ruleSetConditions.f0,
ruleSetConditions.f1);
}
I am able to detect the pattern but i am getting sideoutput also since i am trying to iterate over the rules one by one if my matched rule is present in last, the program is sending output to sideoutput since the initial set of rules won't match. I want to print sideoutput only once if none of the rules are satisfied, i am new to flink please help how can i achieve it.
It looks to me like you want to do something more like this:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx, Collector<Tuple2<String, String>> out) throws Exception {
transient boolean matched = false;
for (Map.Entry<String, String> ruleSet:
ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
if (this.rule) {
matched = true;
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule was matched
if (!matched) {
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
}
I have a collection that represents a data stream and testing StreamingFileSink to write the stream to S3. Program running successfully, but there is no data in the given S3 path.
public class S3Sink {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(100);
List<String> input = new ArrayList<>();
input.add("test");
DataStream<String> inputStream = see.fromCollection(input);
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("<S3 Path>"),
new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build();
inputStream.addSink(s3Sink);
see.execute();
}
}
Checkpointing enabled as well. Any thoughts on why Sink is not working as expected ?
UPDATE:
Based on David's answer, created custom source which generates random string continuously and I am expecting Checkpointing to trigger after configured interval to write the data to S3.
public class S3SinkCustom {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(1000);
DataStream<String> inputStream = see.addSource(new CustomSource());
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("s3://mybucket/data/"),
new SimpleStringEncoder<>("UTF-8"))
.build();
//inputStream.print();
inputStream.addSink(s3Sink);
see.execute();
}
static class CustomSource extends RichSourceFunction<String> {
private volatile boolean running = false;
final String[] strings = {"ABC", "XYZ", "DEF"};
#Override
public void open(Configuration parameters){
running = true;
}
#Override
public void run(SourceContext sourceContext) throws Exception {
while (running) {
Random random = new Random();
int index = random.nextInt(strings.length);
sourceContext.collect(strings[index]);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
running = false;
}
}
}
Still, There is no data in s3 and Flink Process is not even validating given S3 bucket is valid or not, but the process running without any issues.
Update:
Below is the custom rolling policy details:
public class CustomRollingPolicy implements RollingPolicy<Object, String> {
#Override
public boolean shouldRollOnCheckpoint(PartFileInfo partFileInfo) throws IOException {
return partFileInfo.getSize() > 1;
}
#Override
public boolean shouldRollOnEvent(PartFileInfo partFileInfo, Object o) throws IOException {
return true;
}
#Override
public boolean shouldRollOnProcessingTime(PartFileInfo partFileInfo, long l) throws IOException {
return true;
}
}
I believe the issue is that the job you've written isn't going to run long enough to actually checkpoint, so the output isn't going to be finalized.
Another potential issue is that the StreamingFileSink only works with the Hadoop-based S3 filesystem (and not the one from Presto).
Above issue is resolved after setting up flink-conf.yaml with required s3a properties like fs.s3a.access.key,fs.s3a.secret.key.
We need to let Flink know about the config location as well.
FileSystem.initialize(GlobalConfiguration.loadConfiguration(""));
With these changes, I was able to run S3 sink from local and messages persisted to S3 without any issues.
I try to calculate the highest amount of found hashtags in a given Tumbling window.
For this I do kind of a "word count" for hashtags and sum them up. This works fine. After this, I try to find the hashtag with the highest order in the given window. I use a RichFlatMapFunction for this and ValueState to save the current maximum of the appearance of a single hashtag, but this doesn't work.
I have debugged my code and find out that the value of the ValueState "maxVal" is in every flatMap step "null". So the update() and the value() method doesn't work in my scenario.
Do I misunderstood the RichFlatMap function or their usage?
Here is my code, everything except the last flatmap function is working as expected:
public class TwitterHashtagCount {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment().setParallelism(1);
DataStream<String> tweetsRaw = env.addSource(new TwitterSource(TwitterConnection.getTwitterConnectionProperties()));
DataStream<String> tweetsGerman = tweetsRaw.filter(new EnglishLangFilter());
DataStream<Tuple2<String, Integer>> tweetHashtagCount = tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.flatMap(new RichFlatMapFunction<Tuple2<String, Integer>, Tuple2<String, Integer>>() {
private transient ValueState<Integer> maxVal;
#Override
public void open(Configuration parameters) throws Exception {
ValueStateDescriptor<Integer> descriptor =
new ValueStateDescriptor<>(
// state name
"max-val",
// type information of state
TypeInformation.of(Integer.class));
maxVal = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap(Tuple2<String, Integer> value, Collector<Tuple2<String, Integer>> out) throws Exception {
Integer maxCount = maxVal.value();
if(maxCount == null) {
maxCount = 0;
maxVal.update(0);
}
if(value.f1 > maxCount) {
maxVal.update(maxCount);
out.collect(new Tuple2<String, Integer>(value.f0, value.f1));
}
}
});
tweetHashtagCount.print();
env.execute("Twitter Streaming WordCount");
}
}
I'm wondering why the code you've shared runs at all. The result of sum(1) is non-keyed stream, and the managed state interface you are using expects a keyed stream, and will keep a separate instance of the state for each key. I'm surprised you're not getting an error saying "Keyed state can only be used on a 'keyed stream', i.e., after a 'keyBy()' operation."
Since you've previously windowed the stream, if you do key it again (with the same key) before the RichFlatMapFunction, each key will occur once and the maxVal will always be null.
Something like this might do what you intend, if your goal is to find the max across all hashtags in each time window:
tweetsGerman
.flatMap(new TwitterHashtagFlatMap())
.keyBy(0)
.timeWindow(Time.seconds(15))
.sum(1)
.timeWindowAll(Time.seconds(15))
.max(1)
I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.