Earlier I asked about a simple hello world example for Flink. This gave me some good examples!
However I would like to ask for a more ‘streaming’ example where we generate an input value every second. This would ideally be random, but even just the same value each time would be fine.
The objective is to get a stream that ‘moves’ with no/minimal external touch.
Hence my question:
How to show Flink actually streaming data without external dependencies?
I found how to show this with generating data externally and writing to Kafka, or listening to a public source, however I am trying to solve it with minimal dependence (like starting with GenerateFlowFile in Nifi).
Here's an example. This was constructed as an example of how to make your sources and sinks pluggable. The idea being that in development you might use a random source and print the results, for tests you might use a hardwired list of input events and collect the results in a list, and in production you'd use the real sources and sinks.
Here's the job:
/*
* Example showing how to make sources and sinks pluggable in your application code so
* you can inject special test sources and test sinks in your tests.
*/
public class TestableStreamingJob {
private SourceFunction<Long> source;
private SinkFunction<Long> sink;
public TestableStreamingJob(SourceFunction<Long> source, SinkFunction<Long> sink) {
this.source = source;
this.sink = sink;
}
public void execute() throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Long> LongStream =
env.addSource(source)
.returns(TypeInformation.of(Long.class));
LongStream
.map(new IncrementMapFunction())
.addSink(sink);
env.execute();
}
public static void main(String[] args) throws Exception {
TestableStreamingJob job = new TestableStreamingJob(new RandomLongSource(), new PrintSinkFunction<>());
job.execute();
}
// While it's tempting for something this simple, avoid using anonymous classes or lambdas
// for any business logic you might want to unit test.
public class IncrementMapFunction implements MapFunction<Long, Long> {
#Override
public Long map(Long record) throws Exception {
return record + 1 ;
}
}
}
Here's the RandomLongSource:
public class RandomLongSource extends RichParallelSourceFunction<Long> {
private volatile boolean cancelled = false;
private Random random;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
random = new Random();
}
#Override
public void run(SourceContext<Long> ctx) throws Exception {
while (!cancelled) {
Long nextLong = random.nextLong();
synchronized (ctx.getCheckpointLock()) {
ctx.collect(nextLong);
}
}
}
#Override
public void cancel() {
cancelled = true;
}
}
Related
I want to enrich my 1st stream with the help of the 2nd stream like the flowing records keep joining with the 2nd stream like a lookup which I want to keep in memory forever like a table.
Any code example or any flink API I could use which would fit in this use-case.
You can find an example of a connected stream with a shared state in the Ververica training page: https://training.ververica.com (Stateful Stream Processing, Slide 13)
public static class ControlFunction extends KeyedCoProcessFunction<String, String, String, String> {
private ValueState<Boolean> blocked;
#Override
public void open(Configuration config) {
blocked = getRuntimeContext().getState(new ValueStateDescriptor<>("blocked", Boolean.class));
}
#Override
public void processElement1(String controlValue, Context context, Collector<String> out) throws Exception {
blocked.update(Boolean.TRUE);
}
#Override
public void processElement2(String dataValue, Context context, Collector<String> out) throws Exception {
if (blocked.value() == null) {
out.collect(dataValue);
}
}
}
public class StreamingJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> control = env.fromElements("DROP", "IGNORE").keyBy(x -> x);
DataStream<String> data = env
.fromElements("Flink", "DROP", "Forward", "IGNORE")
.keyBy(x -> x);
control
.connect(data)
.process(new ControlFunction())
.print();
env.execute();
}
}
In your case, you would need to keep the contents of the 2nd stream in the KeyedCoProcessFunction state and have the 1st stream read from the state to join it with its elements. You'll need to think how to key your streams and what kind of state to use, but that would be the main idea.
I have a simple Apache Flink job that ends with a Kafka sink. I'm using a KafkaRecordSerializationSchema<CustomType> to handle the message from the previous (RichFlatMap) operator:
public final class CustomTypeSerializationSchema implements KafkaRecordSerializationSchema<CustomType> {
private static final long serialVersionUID = 5743933755381724692L;
private final String topic;
public CustomTypeSerializationSchema(final String topic) {
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(final CustomType input, final KafkaSinkContext context,
final Long timestamp) {
final var result = new CustomMessage(input);
try {
return new ProducerRecord<>(topic,
JacksonJsonMapper.writeValueAsString(result).getBytes(StandardCharsets.UTF_8));
} catch (final Exception e) {
logger.warn("Unable to serialize message [{}]. This was the reason:", result, e);
}
return new ProducerRecord<>(topic, new byte[0]);
}
}
The problem I'm trying to avoid is to send an "empty" ProducerRecord — like the one that will be executed by default if something happens within the try-catch. Basically, I'm looking for a behavior similar to KafkaRecordDeserializationSchema, where what's put in the collector is what's going to be received in subsequent operators, and the rest is discarded.
Is there a way to achieve this with another *SerializationSchema type?
I have a flink job with parallelism 5 (for now !!). And one of the richFlatMap stream opens one file in the open(Configuration parameters) method. In the flatMapoperation there is no any open action, it just read the file to search something. (There is a utility class which has a method like utilityClass.searchText('abc')). Here is the boilerplate code:
public class MyFlatMap extends RichFlatMapFunction<...> {
private MyUtilityFile myFile;
#Override
public void open(Configuration parameters) throws Exception {
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
}
This file is being updated by the python script every day at specific time. Therefore I should also open the newly created file (by python script) in the flatMap stream.
I just though that this can be done by ScheduledExecutorService with only one thread pool.
I can not open this file every flatMap calls because it is big.
Here is the boilerplate code I am trying to write:
public class MyFlatMap extends RichFlatMapFunction<...> implements Runnable {
private final ScheduledExecutorService scheduler = Executors.newScheduledThreadPool(1);
private MyUtilityFile myFile;
#Override
public void run() {
myFile.Open("fileLocation");
}
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
}
Is this boilerplate okey for Flink environment? If not, how can i open the file with a scheduled manner? (There is no option such as "after updating file send event with kafka and read event by flink")
Perhaps you can directly implement the ProcessingTimeCallback interface, which supports timer operations
public class MyFlatMap extends RichFlatMapFunction<...> implements ProcessingTimeCallback {
private MyUtilityFile myFile;
#Override
public void open(Configuration parameters) throws Exception {
scheduler.scheduleAtFixedRate(this, 1, 1, TimeUnit.HOURS);
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
myFile.Open("fileLocation");
}
#Override
public void flatMap(...) throws Exception {
String text = myFile.searchText('abc');
if (text != null) // take an action
else // another action
}
#Override
public void onProcessingTime(long timestamp) throws Exception {
myFile.Open("fileLocation");
final long now = getProcessingTimeService().getCurrentProcessingTime();
getProcessingTimeService().registerTimer(now + 3600000, this);
}
}
I have a collection that represents a data stream and testing StreamingFileSink to write the stream to S3. Program running successfully, but there is no data in the given S3 path.
public class S3Sink {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(100);
List<String> input = new ArrayList<>();
input.add("test");
DataStream<String> inputStream = see.fromCollection(input);
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("<S3 Path>"),
new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build();
inputStream.addSink(s3Sink);
see.execute();
}
}
Checkpointing enabled as well. Any thoughts on why Sink is not working as expected ?
UPDATE:
Based on David's answer, created custom source which generates random string continuously and I am expecting Checkpointing to trigger after configured interval to write the data to S3.
public class S3SinkCustom {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(1000);
DataStream<String> inputStream = see.addSource(new CustomSource());
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("s3://mybucket/data/"),
new SimpleStringEncoder<>("UTF-8"))
.build();
//inputStream.print();
inputStream.addSink(s3Sink);
see.execute();
}
static class CustomSource extends RichSourceFunction<String> {
private volatile boolean running = false;
final String[] strings = {"ABC", "XYZ", "DEF"};
#Override
public void open(Configuration parameters){
running = true;
}
#Override
public void run(SourceContext sourceContext) throws Exception {
while (running) {
Random random = new Random();
int index = random.nextInt(strings.length);
sourceContext.collect(strings[index]);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
running = false;
}
}
}
Still, There is no data in s3 and Flink Process is not even validating given S3 bucket is valid or not, but the process running without any issues.
Update:
Below is the custom rolling policy details:
public class CustomRollingPolicy implements RollingPolicy<Object, String> {
#Override
public boolean shouldRollOnCheckpoint(PartFileInfo partFileInfo) throws IOException {
return partFileInfo.getSize() > 1;
}
#Override
public boolean shouldRollOnEvent(PartFileInfo partFileInfo, Object o) throws IOException {
return true;
}
#Override
public boolean shouldRollOnProcessingTime(PartFileInfo partFileInfo, long l) throws IOException {
return true;
}
}
I believe the issue is that the job you've written isn't going to run long enough to actually checkpoint, so the output isn't going to be finalized.
Another potential issue is that the StreamingFileSink only works with the Hadoop-based S3 filesystem (and not the one from Presto).
Above issue is resolved after setting up flink-conf.yaml with required s3a properties like fs.s3a.access.key,fs.s3a.secret.key.
We need to let Flink know about the config location as well.
FileSystem.initialize(GlobalConfiguration.loadConfiguration(""));
With these changes, I was able to run S3 sink from local and messages persisted to S3 without any issues.
I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.