Multiple Flink Window Streams to same Kinesis Sink - apache-flink

I'm trying to sink two Window Streams to the same Kinesis Sink. When
I do this, no results are making it to the sink (code below). If I
remove one of the windows from the Job, results do get published.
Adding another stream to the sink seems to void both.
How can I have results from both Window Streams go to the same sink?
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
ObjectMapper jsonParser = new ObjectMapper();
DataStream<String> inputStream = createKinesisSource(env);
FlinkKinesisProducer<String> kinesisSink = createKinesisSink();
WindowedStream oneMinStream = inputStream
.map(value -> jsonParser.readValue(value, JsonNode.class))
.keyBy(node -> node.get("accountId"))
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)));
oneMinStream
.aggregate(new LoginAggregator("k1m"))
.addSink(kinesisSink);
WindowedStream twoMinStream = inputStream
.map(value -> jsonParser.readValue(value, JsonNode.class))
.keyBy(node -> node.get("accountId"))
.window(TumblingProcessingTimeWindows.of(Time.minutes(2)));
twoMinStream
.aggregate(new LoginAggregator("k2m"))
.addSink(kinesisSink);
try {
env.execute("Flink Kinesis Streaming Sink Job");
} catch (Exception e) {
LOG.error("failed");
LOG.error(e.getLocalizedMessage());
LOG.error(e.getStackTrace().toString());
throw e;
}
}
private static DataStream<String>
createKinesisSource(StreamExecutionEnvironment env) {
Properties inputProperties = new Properties();
inputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region);
inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION,
"LATEST");
return env.addSource(new FlinkKinesisConsumer<>(inputStreamName,
new SimpleStringSchema(), inputProperties));
}
private static FlinkKinesisProducer<String> createKinesisSink() {
Properties outputProperties = new Properties();
outputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region);
outputProperties.setProperty("AggregationEnabled", "false");
FlinkKinesisProducer<String> sink = new FlinkKinesisProducer<>(new
SimpleStringSchema(), outputProperties);
sink.setDefaultStream(outputStreamName);
sink.setDefaultPartition(UUID.randomUUID().toString());
return sink;
}

You want to .union() the oneMinStream and twoMinStream together, and then add your sink to that unioned stream.

Related

Apache Flink Produce Kafka from Csv File

I want to produce kafka from CSV file but the kafka output is as follows ;
org.apache.flink.streaming.api.datastream.DataStreamSource#28aaa5a7
how can I do it?
My Code ;
public static class SimpleStringGenerator implements SourceFunction<String> {
private static final long serialVersionUID = 2174904787118597072L;
boolean running = true;
long i = 0;
#Override
public void run(SourceContext<String> ctx) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("/home/train/Desktop/yaz/aa/1");
ctx.collect(String.valueOf(text));
Thread.sleep(10);
}
text is a DataStream object which represents an unbounded stream of elements (in your code each line in the test file will be a different element) so it is not the actual file contents.
If what you want is to produce these elements to Kafka, you need to initialize a Kafka sink and connect your DataStream object to it.
From Flink docs:
DataStream<String> stream = ...;
KafkaSink<String> sink = KafkaSink.<String>builder()
.setBootstrapServers(brokers)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic("topic-name")
.setValueSerializationSchema(new SimpleStringSchema())
.build()
)
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
stream.sinkTo(sink);

Using Custom Encoder for extracting the required information from Event in Stream File Sink

//Code Snippet for sinking data into s3 using stream file sink with custom encoder
//Event consists of two objects but I need to sink only one object from the Event.
StreamingFileSink<Event> sink = StreamingFileSink
.forRowFormat(new Path(s3SinkPath), new CustomEncoder<Event>("UTF-8") )
.withBucketAssigner(new CustomBucketAssigner())
.withRollingPolicy(DefaultRollingPolicy.builder().build())
.build();
//Encoder overriden function
#Override
public void encode(Event element, OutputStream stream) throws IOException {
if (charset == null) {
charset = Charset.forName(charsetName);
}
stream.write(element.toString().getBytes(charset));
stream.write('\n');
}

Flink StreamingFileSink not writing data to AWS S3

I have a collection that represents a data stream and testing StreamingFileSink to write the stream to S3. Program running successfully, but there is no data in the given S3 path.
public class S3Sink {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(100);
List<String> input = new ArrayList<>();
input.add("test");
DataStream<String> inputStream = see.fromCollection(input);
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("<S3 Path>"),
new SimpleStringEncoder<>("UTF-8"))
.withRollingPolicy(rollingPolicy)
.build();
inputStream.addSink(s3Sink);
see.execute();
}
}
Checkpointing enabled as well. Any thoughts on why Sink is not working as expected ?
UPDATE:
Based on David's answer, created custom source which generates random string continuously and I am expecting Checkpointing to trigger after configured interval to write the data to S3.
public class S3SinkCustom {
public static void main(String args[]) throws Exception {
StreamExecutionEnvironment see = StreamExecutionEnvironment.getExecutionEnvironment();
see.enableCheckpointing(1000);
DataStream<String> inputStream = see.addSource(new CustomSource());
RollingPolicy<Object, String> rollingPolicy = new CustomRollingPolicy();
StreamingFileSink s3Sink = StreamingFileSink.
forRowFormat(new Path("s3://mybucket/data/"),
new SimpleStringEncoder<>("UTF-8"))
.build();
//inputStream.print();
inputStream.addSink(s3Sink);
see.execute();
}
static class CustomSource extends RichSourceFunction<String> {
private volatile boolean running = false;
final String[] strings = {"ABC", "XYZ", "DEF"};
#Override
public void open(Configuration parameters){
running = true;
}
#Override
public void run(SourceContext sourceContext) throws Exception {
while (running) {
Random random = new Random();
int index = random.nextInt(strings.length);
sourceContext.collect(strings[index]);
Thread.sleep(1000);
}
}
#Override
public void cancel() {
running = false;
}
}
}
Still, There is no data in s3 and Flink Process is not even validating given S3 bucket is valid or not, but the process running without any issues.
Update:
Below is the custom rolling policy details:
public class CustomRollingPolicy implements RollingPolicy<Object, String> {
#Override
public boolean shouldRollOnCheckpoint(PartFileInfo partFileInfo) throws IOException {
return partFileInfo.getSize() > 1;
}
#Override
public boolean shouldRollOnEvent(PartFileInfo partFileInfo, Object o) throws IOException {
return true;
}
#Override
public boolean shouldRollOnProcessingTime(PartFileInfo partFileInfo, long l) throws IOException {
return true;
}
}
I believe the issue is that the job you've written isn't going to run long enough to actually checkpoint, so the output isn't going to be finalized.
Another potential issue is that the StreamingFileSink only works with the Hadoop-based S3 filesystem (and not the one from Presto).
Above issue is resolved after setting up flink-conf.yaml with required s3a properties like fs.s3a.access.key,fs.s3a.secret.key.
We need to let Flink know about the config location as well.
FileSystem.initialize(GlobalConfiguration.loadConfiguration(""));
With these changes, I was able to run S3 sink from local and messages persisted to S3 without any issues.

Apache Flink - use values from a data stream to dynamically create a streaming data source

I'm trying to build a sample application using Apache Flink that does the following:
Reads a stream of stock symbols (e.g. 'CSCO', 'FB') from a Kafka queue.
For each symbol performs a real-time lookup of current prices and streams the values for downstream processing.
* Update to original post *
I moved the map function into a separate class and do not get the run-time error message "The implementation of the MapFunction is not serializable any more. The object probably contains or references non serializable fields".
The issue I'm facing now is that the Kafka topic "stockprices" I'm trying to write the prices to is not receiving them. I'm trying to trouble-shoot and will post any updates.
public class RetrieveStockPrices {
#SuppressWarnings("serial")
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "localhost:9092");
properties.setProperty("zookeeper.connect", "localhost:2181");
properties.setProperty("group.id", "stocks");
DataStream<String> streamOfStockSymbols = streamExecEnv.addSource(new FlinkKafkaConsumer08<String>("stocksymbol", new SimpleStringSchema(), properties));
DataStream<String> stockPrice =
streamOfStockSymbols
//get unique keys
.keyBy(new KeySelector<String, String>() {
#Override
public String getKey(String trend) throws Exception {
return trend;
}
})
//collect events over a window
.window(TumblingEventTimeWindows.of(Time.seconds(60)))
//return the last event from the window...all elements are the same "Symbol"
.apply(new WindowFunction<String, String, String, TimeWindow>() {
#Override
public void apply(String key, TimeWindow window, Iterable<String> input, Collector<String> out) throws Exception {
out.collect(input.iterator().next().toString());
}
})
.map(new StockSymbolToPriceMapFunction());
streamExecEnv.execute("Retrieve Stock Prices");
}
}
public class StockSymbolToPriceMapFunction extends RichMapFunction<String, String> {
#Override
public String map(String stockSymbol) throws Exception {
final StreamExecutionEnvironment streamExecEnv = StreamExecutionEnvironment.getExecutionEnvironment();
streamExecEnv.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
System.out.println("StockSymbolToPriceMapFunction: stockSymbol: " + stockSymbol);
DataStream<String> stockPrices = streamExecEnv.addSource(new LookupStockPrice(stockSymbol));
stockPrices.keyBy(new CustomKeySelector()).addSink(new FlinkKafkaProducer08<String>("localhost:9092", "stockprices", new SimpleStringSchema()));
return "100000";
}
private static class CustomKeySelector implements KeySelector<String, String> {
#Override
public String getKey(String arg0) throws Exception {
return arg0.trim();
}
}
}
public class LookupStockPrice extends RichSourceFunction<String> {
public String stockSymbol = null;
public boolean isRunning = true;
public LookupStockPrice(String inSymbol) {
stockSymbol = inSymbol;
}
#Override
public void open(Configuration parameters) throws Exception {
isRunning = true;
}
#Override
public void cancel() {
isRunning = false;
}
#Override
public void run(SourceFunction.SourceContext<String> ctx)
throws Exception {
String stockPrice = "0";
while (isRunning) {
//TODO: query Google Finance API
stockPrice = Integer.toString((new Random()).nextInt(100)+1);
ctx.collect(stockPrice);
Thread.sleep(10000);
}
}
}
StreamExecutionEnvironment are not indented to be used inside of operators of a streaming application. Not intended means, this is not tested and encouraged. It might work and do something, but will most likely not behave well and probably kill your application.
The StockSymbolToPriceMapFunction in your program specifies for each incoming record a completely new and independent new streaming application. However, since you do not call streamExecEnv.execute() the programs are not started and the map method returns without doing anything.
If you would call streamExecEnv.execute(), the function would start a new local Flink cluster in the workers JVM and start the application on this local Flink cluster. The local Flink instance will take a lot of the heap space and after a few clusters have been started, the worker will probably die due to an OutOfMemoryError which is not what you want to happen.

How do I iterate over each message in a Flink DataStream?

I have a message stream from Kafka like the following
DataStream<String> messageStream = env
.addSource(new FlinkKafkaConsumer09<>(topic, new MsgPackDeserializer(), props));
How can I iterate over each message in the stream and do something with it? I see an iterate() method on DataStream but it does not return an Iterator<String>.
I think you are looking for a MapFunction.
DataStream<String> messageStream = env.addSource(
new FlinkKafkaConsumer09<>(topic, new MsgPackDeserializer(), props));
DataStream<Y> mappedMessages = messageStream
.map(new MapFunction<String, Y>() {
public Y map(String message) {
// do something with each message and return Y
}
});
If you don't want to emit exactly one record for each incoming message, have a look at the FlatMapFunction.

Resources