How do I iterate over each message in a Flink DataStream? - apache-flink

I have a message stream from Kafka like the following
DataStream<String> messageStream = env
.addSource(new FlinkKafkaConsumer09<>(topic, new MsgPackDeserializer(), props));
How can I iterate over each message in the stream and do something with it? I see an iterate() method on DataStream but it does not return an Iterator<String>.

I think you are looking for a MapFunction.
DataStream<String> messageStream = env.addSource(
new FlinkKafkaConsumer09<>(topic, new MsgPackDeserializer(), props));
DataStream<Y> mappedMessages = messageStream
.map(new MapFunction<String, Y>() {
public Y map(String message) {
// do something with each message and return Y
}
});
If you don't want to emit exactly one record for each incoming message, have a look at the FlatMapFunction.

Related

Multiple Flink Window Streams to same Kinesis Sink

I'm trying to sink two Window Streams to the same Kinesis Sink. When
I do this, no results are making it to the sink (code below). If I
remove one of the windows from the Job, results do get published.
Adding another stream to the sink seems to void both.
How can I have results from both Window Streams go to the same sink?
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment();
ObjectMapper jsonParser = new ObjectMapper();
DataStream<String> inputStream = createKinesisSource(env);
FlinkKinesisProducer<String> kinesisSink = createKinesisSink();
WindowedStream oneMinStream = inputStream
.map(value -> jsonParser.readValue(value, JsonNode.class))
.keyBy(node -> node.get("accountId"))
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)));
oneMinStream
.aggregate(new LoginAggregator("k1m"))
.addSink(kinesisSink);
WindowedStream twoMinStream = inputStream
.map(value -> jsonParser.readValue(value, JsonNode.class))
.keyBy(node -> node.get("accountId"))
.window(TumblingProcessingTimeWindows.of(Time.minutes(2)));
twoMinStream
.aggregate(new LoginAggregator("k2m"))
.addSink(kinesisSink);
try {
env.execute("Flink Kinesis Streaming Sink Job");
} catch (Exception e) {
LOG.error("failed");
LOG.error(e.getLocalizedMessage());
LOG.error(e.getStackTrace().toString());
throw e;
}
}
private static DataStream<String>
createKinesisSource(StreamExecutionEnvironment env) {
Properties inputProperties = new Properties();
inputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region);
inputProperties.setProperty(ConsumerConfigConstants.STREAM_INITIAL_POSITION,
"LATEST");
return env.addSource(new FlinkKinesisConsumer<>(inputStreamName,
new SimpleStringSchema(), inputProperties));
}
private static FlinkKinesisProducer<String> createKinesisSink() {
Properties outputProperties = new Properties();
outputProperties.setProperty(ConsumerConfigConstants.AWS_REGION, region);
outputProperties.setProperty("AggregationEnabled", "false");
FlinkKinesisProducer<String> sink = new FlinkKinesisProducer<>(new
SimpleStringSchema(), outputProperties);
sink.setDefaultStream(outputStreamName);
sink.setDefaultPartition(UUID.randomUUID().toString());
return sink;
}
You want to .union() the oneMinStream and twoMinStream together, and then add your sink to that unioned stream.

Apache Flink Produce Kafka from Csv File

I want to produce kafka from CSV file but the kafka output is as follows ;
org.apache.flink.streaming.api.datastream.DataStreamSource#28aaa5a7
how can I do it?
My Code ;
public static class SimpleStringGenerator implements SourceFunction<String> {
private static final long serialVersionUID = 2174904787118597072L;
boolean running = true;
long i = 0;
#Override
public void run(SourceContext<String> ctx) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<String> text = env.readTextFile("/home/train/Desktop/yaz/aa/1");
ctx.collect(String.valueOf(text));
Thread.sleep(10);
}
text is a DataStream object which represents an unbounded stream of elements (in your code each line in the test file will be a different element) so it is not the actual file contents.
If what you want is to produce these elements to Kafka, you need to initialize a Kafka sink and connect your DataStream object to it.
From Flink docs:
DataStream<String> stream = ...;
KafkaSink<String> sink = KafkaSink.<String>builder()
.setBootstrapServers(brokers)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic("topic-name")
.setValueSerializationSchema(new SimpleStringSchema())
.build()
)
.setDeliverGuarantee(DeliveryGuarantee.AT_LEAST_ONCE)
.build();
stream.sinkTo(sink);

How to control flink sending output to sideoutput in keyedbroadcastprocessfunction

I am trying to validate a data stream against a set of rules to detect patterns in flink by validating the data stream against a broadcast stream with set of rules i using for loop to collect all the patterns in map and iterating through it in processElement fn to find a pattern sample code is as below
MapState Descriptor and Side output stream as below
public static final MapStateDescriptor<String, String> ruleSetDescriptor =
new MapStateDescriptor<String, String>("RuleSet", BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO);
public final static OutputTag<Tuple2<String, String>> unMatchedSideOutput =
new OutputTag<Tuple2<String, String>>(
"unmatched-side-output") {
};
Process Function and Broadcast Function as below:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx,
Collector<Tuple2<String,
String>> out) throws Exception {
for (Map.Entry<String, String> ruleSet:
ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
//If the rule in ruleset is matched then send output to main stream and break the program
if (this.rule) {
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule is matched
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
#Override
public void processBroadcastElement(Tuple2<String, String> ruleSetConditions, Context ctx, Collector<Tuple2<String,String>> out) throws Exception {
ctx.getBroadcastState(broadcast.ruleSetDescriptor).put(ruleSetConditions.f0,
ruleSetConditions.f1);
}
I am able to detect the pattern but i am getting sideoutput also since i am trying to iterate over the rules one by one if my matched rule is present in last, the program is sending output to sideoutput since the initial set of rules won't match. I want to print sideoutput only once if none of the rules are satisfied, i am new to flink please help how can i achieve it.
It looks to me like you want to do something more like this:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx, Collector<Tuple2<String, String>> out) throws Exception {
transient boolean matched = false;
for (Map.Entry<String, String> ruleSet:
ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
if (this.rule) {
matched = true;
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule was matched
if (!matched) {
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
}

How to update the Broadcast state in KeyedBroadcastProcessFunction in flink?

I am new to Flink i am doing a pattern matching using apache flink where the list of patterns are present in broadcast state and iterating through the patterns in processElements function to find the pattern matched and i am reading this patterns from a database and its a on time activity. Below is my code
MapState Descriptor and Side output stream as below
public static final MapStateDescriptor<String, String> ruleDescriptor=
new MapStateDescriptor<String, String>("RuleSet", BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.STRING_TYPE_INFO);
public final static OutputTag<Tuple2<String, String>> unMatchedSideOutput =
new OutputTag<Tuple2<String, String>>(
"unmatched-side-output") {
};
Process Function and Broadcast Function as below:
#Override
public void processElement(Tuple2<String, String> inputValue, ReadOnlyContext ctx,Collector<Tuple2<String,String>> out) throws Exception {
for (Map.Entry<String, String> ruleSet: ctx.getBroadcastState(broadcast.patternRuleDescriptor).immutableEntries()) {
String ruleName = ruleSet.getKey();
//If the rule in ruleset is matched then send output to main stream and break the program
if (this.rule) {
out.collect(new Tuple2<>(inputValue.f0, inputValue.f1));
break;
}
}
// Writing output to sideout if no rule is matched
ctx.output(Output.unMatchedSideOutput, new Tuple2<>("No Rule Detected", inputValue.f1));
}
#Override
public void processBroadcastElement(Tuple2<String, String> ruleSetConditions, Context ctx, Collector<Tuple2<String,String>> out) throws Exception { ctx.getBroadcastState(broadcast.ruleDescriptor).put(ruleSetConditions.f0,
ruleSetConditions.f1);
}
Main Function as below
public static void main(String[] args) throws Exception {
//Initiate a datastream environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//Reads incoming data for upstream
DataStream<String> incomingSignal =
env.readTextFile(....);
//Reads the patterns available in configuration file
DataStream<String> ruleStream =
env.readTextFile();
//Generate a key,value pair of set of patterns where key is pattern name and value is pattern condition
DataStream<Tuple2<String, String>> ruleStream =
rawPatternStream.flatMap(new FlatMapFunction<String, Tuple2<String, String>>() {
#Override
public void flatMap(String ruleCondition, Collector<Tuple2<String, String>> out) throws Exception {
String rules[] = ruleCondition.split[","];
out.collect(new Tuple2<>(rules[0], rules[1]));
}
}
});
//Broadcast the patterns to all the flink operators which will be stored in flink operator memory
BroadcastStream<Tuple2<String, String>>ruleBroadcast = ruleStream.broadcast(ruleDescriptor);
/*Creating keystream based on sourceName as key */
DataStream<Tuple2<String, String>> matchSignal =
incomingSignal.map(new MapFunction<String, Tuple2<String, String>>() {
#Override
public Tuple2<String, String> map(String incomingSignal) throws Exception {
String sourceName = ingressSignal.split[","][0]
return new Tuple2<>(sourceName, incomingSignal);
}
}).keyBy(0).connect(ruleBroadcast).process(new KeyedBroadCastProcessFunction());
matchSignal.print("RuleDetected=>");
}
I have a couple of questions
1) Currently i am reading rules from a database, how can i update the broadcast state when flink job is running in cluster and if i get new set of rules from a kafka topic how can i update the broadcast state in processBroadcast method in KeyedBroadcasrProcessFunction
2)When the broadcast state is updated do we need to restart the flink job?
Please help me with above questions
The only way to either set or update broadcast state is in the processBroadcastElement method of a BroadcastProcessFunction or KeyedBroadcastProcessFunction. All you need to do is to adapt your application to stream in the rules from a streaming source, rather than reading them once from a file.
Broadcast state is a hash map. If your broadcast stream includes a new key/value pair that uses the same key as an earlier broadcast event, then the new value will replace the old one. Otherwise you'll end up with an entirely new entry.
If you use readFile with FileProcessingMode.PROCESS_CONTINUOUSLY, then every time you modify the file its entire contents will be reingested. You could use that mechanism to update your set of rules.

Get stream from java.sql.Blob in Hibernate

I'm trying to use hibernate #Entity with java.sql.Blob to store some binary data. Storing doesn't throw any exceptions (however, I'm not sure if it really stores the bytes), but reading does. Here is my test:
#Test
public void shouldStoreBlob() {
InputStream readFile = getClass().getResourceAsStream("myfile");
Blob blob = dao.createBlob(readFile, readFile.available());
Ent ent = new Ent();
ent.setBlob(blob);
em.persist(ent);
long id = ent.getId();
Ent fromDb = em.find(Ent.class, id);
//Exception is thrown from getBinaryStream()
byte[] fromDbBytes = IOUtils.toByteArray(fromDb.getBlob().getBinaryStream());
}
So it throws an exception:
java.sql.SQLException: could not reset reader
at org.hibernate.engine.jdbc.BlobProxy.getStream(BlobProxy.java:86)
at org.hibernate.engine.jdbc.BlobProxy.invoke(BlobProxy.java:108)
at $Proxy81.getBinaryStream(Unknown Source)
...
Why? Shouldn't it read bytes form DB here? And what can I do for it to work?
Try to refresh entity:
em.refresh(fromDb);
Stream will be reopened. I suspect that find(...) is closing the blob stream.
It is not at all clear how you are using JPA here, but certainly you do not need to deal with Blob data type directly if you are using JPA.
You just need to declare a field in the entity in question of #Lob somewhat like this:
#Lob
#Basic(fetch = LAZY)
#Column(name = "image")
private byte[] image;
Then, when you retrieve your entity, the bytes will be read back again in the field and you will be able to put them in a stream and do whatever you want with them.
Of course you will need a getter and setter methods in your entity to do the byte conversion. In the example above it would be somewhat like:
private Image getImage() {
Image result = null;
if (this.image != null && this.image.length > 0) {
result = new ImageIcon(this.image).getImage();
}
return result;
}
And the setter somewhat like this
private void setImage(Image source) {
BufferedImage buffered = new BufferedImage(source.getWidth(null), source.getHeight(null), BufferedImage.TYPE_INT_RGB);
Graphics2D g = buffered.createGraphics();
g.drawImage(source, 0, 0, null);
g.dispose();
ByteArrayOutputStream stream = new ByteArrayOutputStream();
try {
ImageIO.write(buffered, "JPEG", stream);
this.image = stream.toByteArray();
}
catch (IOException e) {
assert (false); // should never happen
}
}
}
You need to set a breakpoint on method org.hibernate.engine.jdbc.BlobProxy#getStream on line stream.reset() and examine a reason of IOException:
private InputStream getStream() throws SQLException {
try {
if (needsReset) {
stream.reset(); // <---- Set breakpoint here
}
}
catch ( IOException ioe) {
throw new SQLException("could not reset reader");
}
needsReset = true;
return stream;
}
In my case the reason of IOException was in usage of org.apache.commons.io.input.AutoCloseInputStream as a source for Blob:
InputStream content = new AutoCloseInputStream(stream);
...
Ent ent = new Ent();
...
Blob blob = Hibernate.getLobCreator(getSession()).createBlob(content, file.getFileSize())
ent.setBlob(blob);
em.persist(ent);
While flushing a Session hibernate closes Inpustream content (or rather org.postgresql.jdbc2.AbstractJdbc2Statement#setBlob closes Inpustream in my case). And when AutoCloseInputStream is closed - it rases an IOException in method reset()
update
In your case you use a FileInputStream - this stream also throws an exception on reset method.
There is a problem in test case. You create blob and read it from database inside one transaction. When you create Ent, Postgres jdbc driver closes InputStream while flushing a session. When you load Ent (em.find(Ent.class, id)) - you get the same BlobProxy object, that stores already closed InputStream.
Try this:
TransactionTemplate tt;
#Test
public void shouldStoreBlob() {
final long id = tt.execute(new TransactionCallback<long>()
{
#Override
public long doInTransaction(TransactionStatus status)
{
try
{
InputStream readFile = getClass().getResourceAsStream("myfile");
Blob blob = dao.createBlob(readFile, readFile.available());
Ent ent = new Ent();
ent.setBlob(blob);
em.persist(ent);
return ent.getId();
}
catch (Exception e)
{
return 0;
}
}
});
byte[] fromStorage = tt.execute(new TransactionCallback<byte[]>()
{
#Override
public byte[] doInTransaction(TransactionStatus status)
{
Ent fromDb = em.find(Ent.class, id);
try
{
return IOUtils.toByteArray(fromDb.getBlob().getBinaryStream());
}
catch (IOException e)
{
return new byte[] {};
}
}
});
}
My current and only solution is closing the write session and opening new Hibernate session to get back the streamed data. It works. However I do not know what is the difference. I called inputStream.close(), but that was not enough.
Another way:
I tried to call free() method of blob after session.save(attachment) call too, but it throws another exception:
Exception in thread "main" java.lang.AbstractMethodError: org.hibernate.lob.SerializableBlob.free()V
at my.hibernatetest.HibernateTestBLOB.storeStreamInDatabase(HibernateTestBLOB.java:142)
at my.hibernatetest.HibernateTestBLOB.main(HibernateTestBLOB.java:60)
I am using PostgreSQL 8.4 + postgresql-8.4-702.jdbc4.jar, Hibernate 3.3.1.GA
Is the method IOUtils.toByteArray closing the input stream?

Resources