How can I add message key to KafkaSink in Apache Flink 1.14 - apache-flink

As stated in the title I need to set a custom message key in KafkaSink. I cannot find any indication on how to achieve this in the Apache Flink 1.14 docs.
At the moment I'm correctly setting up the KafkaSink and the data payload is correctly written in the topic, but the key is null.
Any suggestions? Thanks in advance

You should implement a KafkaRecordSerializationSchema that sets the key on the ProducerRecord returned by its serialize method.
You'll create the sink more-or-less like this:
KafkaSink<UsageRecord> sink =
KafkaSink.<UsageRecord>builder()
.setBootstrapServers(brokers)
.setKafkaProducerConfig(kafkaProps)
.setRecordSerializer(new MyRecordSerializationSchema(topic))
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("my-record-producer")
.build();
and the serializer will be something like this:
public class MyRecordSerializationSchema implements
KafkaRecordSerializationSchema<T> {
private static final long serialVersionUID = 1L;
private String topic;
private static final ObjectMapper objectMapper =
JsonMapper.builder()
.build()
.registerModule(new JavaTimeModule())
.configure(SerializationFeature.WRITE_DATES_AS_TIMESTAMPS, false);
public MyRecordSerializationSchema() {}
public MyRecordSerializationSchema(String topic) {
this.topic = topic;
}
#Override
public ProducerRecord<byte[], byte[]> serialize(
T element, KafkaSinkContext context, Long timestamp) {
try {
return new ProducerRecord<>(
topic,
null, // choosing not to specify the partition
element.ts.toEpochMilli(),
element.getKey(),
objectMapper.writeValueAsBytes(element));
} catch (JsonProcessingException e) {
throw new IllegalArgumentException(
"Could not serialize record: " + element, e);
}
}
}
Note that this example is also setting the timestamp.
FWIW, this example comes from https://github.com/alpinegizmo/flink-mobile-data-usage/blob/main/src/main/java/com/ververica/flink/example/datausage/records/UsageRecordSerializationSchema.java.

This example is for scala programmers. Here, we are defining a key by generating UUID for each events.
import org.apache.flink.connector.kafka.sink.KafkaRecordSerializationSchema
import org.apache.kafka.clients.producer.ProducerRecord
import java.lang
class MyRecordSerializationSchema extends KafkaRecordSerializationSchema[String] {
override def serialize(element: String, context: KafkaRecordSerializationSchema.KafkaSinkContext, timestamp: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = {
return new ProducerRecord(
kafkaTopicName,
java.util.UUID.randomUUID.toString.getBytes,
element.getBytes
)
}
}
In the main class, will have to pass an instance of this class while defining the kafka sink like this:
val sinkKafka: KafkaSink[String] = KafkaSink.builder()
.setBootstrapServers(bootstrapServerUrl) //Bootstrap server url
.setRecordSerializer(new MyRecordSerializationSchema())
.build()

Related

Modify the message body, or other wise modify some data in a camel route

What is the best way to make a small modification to some data in a Camel route?
I'm pulling in a BSON document from Mongo. I need to use a timestamp from it in an http call, but I need to convert it from milliseconds to seconds.
I tried setting a header.
.setHeader("test").jsonpath("$.startTime")
Which lets me add the timestamp to the URL with a Simple expression.
.toD("https://test.com/api/markets?resolution=60&start_time=${headers.test}")
But I can't find a way to modify the value of the header.
I also tried using a process
.process(new Processor() {
public void process(Exchange exchange) throws Exception {
DocumentContext message = JsonPath.parse(exchange.getMessage().getBody());
String time = message.read("$.startTime").toString();
time = "111100000";
// do something with the payload and/or exchange here
//exchange.getIn().setBody("Changed body");
}
})
But here the exchange isn't passed back out. I based this on how I used an enrich EIP, with an aggregation strategy that returned an Exchange with the changes I made. This Process doesn't seem to work that way.
You can modify body, header or property using Lambda, processor or a bean. With processor you need to use Message.setHeader method to modify the value of the header at least for value types and strings. Bean methods receive body value by default so if you want to pass in header you'll need to specify it using simple language.
import org.apache.camel.Exchange;
import org.apache.camel.Processor;
import org.apache.camel.RoutesBuilder;
import org.apache.camel.builder.RouteBuilder;
import org.apache.camel.component.mock.MockEndpoint;
import org.apache.camel.test.junit4.CamelTestSupport;
import org.junit.Test;
public class SetHeaderTest extends CamelTestSupport {
#Test
public void testGreeting() throws Exception {
MockEndpoint resultMockEndpoint = getMockEndpoint("mock:result");
resultMockEndpoint.expectedMessageCount(3);
template.sendBodyAndHeader("direct:modifyGreetingLambda",
null, "greeting", "Hello");
template.sendBodyAndHeader("direct:modifyGreetingProcessor",
null, "greeting", "Hello");
template.sendBodyAndHeader("direct:modifyGreetingBean",
null, "greeting", "Hello");
resultMockEndpoint.assertIsSatisfied();
}
#Override
protected RoutesBuilder createRouteBuilder() throws Exception {
return new RouteBuilder(){
#Override
public void configure() throws Exception {
from("direct:modifyGreetingLambda")
.routeId("modifyGreetingLambda")
.setHeader("greeting").exchange(exchange -> {
String modifiedGreeting = (String)exchange.getMessage().getHeader("greeting");
modifiedGreeting += " world!";
return modifiedGreeting;
})
.log("${headers.greeting}")
.to("mock:result");
from("direct:modifyGreetingProcessor")
.routeId("modifyGreetingProcessor")
.process(new Processor(){
#Override
public void process(Exchange exchange) throws Exception {
String modifiedGreeting = (String)exchange.getMessage().getHeader("greeting");
modifiedGreeting += " world!";
exchange.getMessage().setHeader("greeting", modifiedGreeting);
}
})
.log("${headers.greeting}")
.to("mock:result");
from("direct:modifyGreetingBean")
.routeId("modifyGreetingBean")
.setHeader("greeting").method(new ModifyGreetingBean(),
"modifyGreeting('${headers.greeting}')")
.log("${headers.greeting}")
.to("mock:result");
}
};
}
public class ModifyGreetingBean {
public String modifyGreeting(String greeting) {
return greeting + " world!";
}
}
}
Aside from these you can also use expression languages like simple or groovy.
In the route you can set the header with the milliseconds value .setHeader("test").jsonpath("$.startTime").
Then in a processor you can retrieve this value:
String milliSecondsValue = (String) exchange.getIn().getHeader("test");
Then you transform the milliSecondsValue to the value you want and you set it back on the exchange:
exchange.getIn().setHeader("test", secondsValue);
After that call .toD("https://test.com/api/markets?resolution=60&start_time=${header.test}") and it will use the seconds value

BroadcastProcessFunction Processing Delay

I'm fairly new to Flink and would be grateful for any advice with this issue.
I wrote a job that receives some input events and compares them with some rules before forwarding them on to kafka topics based on whatever rules match. I implemented this using a flatMap and found it worked well, with one downside: I was loading the rules just once, during application startup, by calling an API from my main() method, and passing the result of this API call into the flatMap function. This worked, but it means that if there are any changes to the rules I have to restart the application, so I wanted to improve it.
I found this page in the documentation which seems to be an appropriate solution to the problem. I wrote a custom source to poll my Rules API every few minutes, and then used a BroadcastProcessFunction, with the Rules added to to the broadcast state using processBroadcastElement and the events processed by processElement.
The solution is working, but with one problem. My first approach using a FlatMap would process the events almost instantly. Now that I changed to a BroadcastProcessFunction each event takes 60 seconds to process, and it seems to be more or less exactly 60 seconds every time with almost no variation. I made no changes to the rule matching logic itself.
I've had a look through the documentation and I can't seem to find a reason for this, so I'd appreciate if anyone more experienced in flink could offer a suggestion as to what might cause this delay.
The job:
public static void main(String[] args) throws Exception {
// set up the streaming execution environment
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setStreamTimeCharacteristic(TimeCharacteristic.IngestionTime);
// read the input from Kafka
DataStream<KafkaEvent> documentStream = env.addSource(
createKafkaSource(getSourceTopic(), getSourceProperties())).name("Kafka[" + getSourceTopic() + "]");
// Configure the Rules data stream
DataStream<RulesEvent> ruleStream = env.addSource(
new RulesApiHttpSource(
getApiRulesSubdomain(),
getApiBearerToken(),
DataType.DataTypeName.LOGS,
getRulesApiCacheDuration()) // Currently set to 120000
);
MapStateDescriptor<String, RulesEvent> ruleStateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<RulesEvent>() {
}));
// broadcast the rules and create the broadcast state
BroadcastStream<RulesEvent> ruleBroadcastStream = ruleStream
.broadcast(ruleStateDescriptor);
// extract the resources and attributes
documentStream
.connect(ruleBroadcastStream)
.process(new FanOutLogsRuleMapper()).name("FanOut Stream")
.addSink(createKafkaSink(getDestinationProperties()))
.name("FanOut Sink");
// run the job
env.execute(FanOutJob.class.getName());
}
The custom HTTP source which gets the rules
public class RulesApiHttpSource extends RichSourceFunction<RulesEvent> {
private static final Logger LOGGER = LoggerFactory.getLogger(RulesApiHttpSource.class);
private final long pollIntervalMillis;
private final String endpoint;
private final String bearerToken;
private final DataType.DataTypeName dataType;
private final RulesApiCaller caller;
private volatile boolean running = true;
public RulesApiHttpSource(String endpoint, String bearerToken, DataType.DataTypeName dataType, long pollIntervalMillis) {
this.pollIntervalMillis = pollIntervalMillis;
this.endpoint = endpoint;
this.bearerToken = bearerToken;
this.dataType = dataType;
this.caller = new RulesApiCaller(this.endpoint, this.bearerToken);
}
#Override
public void open(Configuration configuration) throws Exception {
// do nothing
}
#Override
public void close() throws IOException {
// do nothing
}
#Override
public void run(SourceContext<RulesEvent> ctx) throws IOException {
while (running) {
if (pollIntervalMillis > 0) {
try {
RulesEvent event = new RulesEvent();
event.setRules(getCurrentRulesList());
event.setDataType(this.dataType);
event.setRetrievedAt(Instant.now());
ctx.collect(event);
Thread.sleep(pollIntervalMillis);
} catch (InterruptedException e) {
running = false;
}
} else if (pollIntervalMillis <= 0) {
cancel();
}
}
}
public List<Rule> getCurrentRulesList() throws IOException {
// call API and get rulles
}
#Override
public void cancel() {
running = false;
}
}
The BroadcastProcessFunction
public abstract class FanOutRuleMapper extends BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent> {
protected final String RULES_EVENT_NAME = "rulesEvent";
protected final MapStateDescriptor<String, RulesEvent> ruleStateDescriptor = new MapStateDescriptor<>(
"RulesBroadcastState",
BasicTypeInfo.STRING_TYPE_INFO,
TypeInformation.of(new TypeHint<RulesEvent>() {
}));
#Override
public void processBroadcastElement(RulesEvent rulesEvent, BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent>.Context ctx, Collector<KafkaEvent> out) throws Exception {
ctx.getBroadcastState(ruleStateDescriptor).put(RULES_EVENT_NAME, rulesEvent);
LOGGER.debug("Added to broadcast state {}", rulesEvent.toString());
}
// omitted rules matching logic
}
public class FanOutLogsRuleMapper extends FanOutRuleMapper {
public FanOutLogsJobRuleMapper() {
super();
}
#Override
public void processElement(KafkaEvent in, BroadcastProcessFunction<KafkaEvent, RulesEvent, KafkaEvent>.ReadOnlyContext ctx, Collector<KafkaEvent> out) throws Exception {
RulesEvent rulesEvent = ctx.getBroadcastState(ruleStateDescriptor).get(RULES_EVENT_NAME);
ExportLogsServiceRequest otlpLog = extractOtlpMessageFromJsonPayload(in);
for (Rule rule : rulesEvent.getRules()) {
boolean match = false;
// omitted rules matching logic
if (match) {
for (RuleDestination ruleDestination : rule.getRulesDestinations()) {
out.collect(fillInTheEvent(in, rule, ruleDestination, otlpLog));
}
}
}
}
}
Maybe you can give the complete code of the FanOutLogsRuleMapper class, currently the match variable is always false

Flink How to Write DataSet As Parquet files in S3?

How to write DataSet as Parquet files in s3 bucket using Flink. Is there any direct function like spark : DF.write.parquet("write in parquet")
Please help me how to write flink Dataset in parquet format.
I am stuck when trying to convert my DataSet to (Void,GenericRecord)
DataSet<Tuple2<Void,GenericRecord>> df = allEvents.flatMap(new FlatMapFunction<Tuple2<LongWritable, Text>, Tuple2<Void, GenericRecord>>() {
#Override
public void flatMap(Tuple2<LongWritable, Text> longWritableTextTuple2, Collector<Tuple2<Void, GenericRecord>> collector) throws Exception {
JsonAvroConverter converter = new JsonAvroConverter();
Schema schema = new Schema.Parser().parse(new File("test.avsc"));
try {
GenericRecord record = converter.convertToGenericDataRecord(longWritableTextTuple2.f1.toString().getBytes(), schema);
collector.collect( new Tuple2<Void,GenericRecord>(null,record));
}
catch (Exception e) {
System.out.println("error in converting to avro")
}
}
});
Job job = Job.getInstance();
HadoopOutputFormat parquetFormat = new HadoopOutputFormat<Void, GenericRecord>(new AvroParquetOutputFormat(), job);
FileOutputFormat.setOutputPath(job, new Path(outputPath));
df.output(parquetFormat);
env.execute();
Please help me with what I am doing wrong. I am getting Exception and this
code is not working.
It's a little more complicated than that with Spark. The only way I was able to read and write Parquet data in Flink is through Hadoop & MapReduce compatibility. You need hadoop-mapreduce-client-core and flink-hadoop-compatibility in Your dependencies.
Then You need to create a proper HadoopOutoutFormat. You need to do something like this:
val job = Job.getInstance()
val hadoopOutFormat = new hadoop.mapreduce.HadoopOutputFormat[Void, SomeType](new AvroParquetOutputFormat(), job)
FileOutputFormat.setOutputPath(job, [somePath])
And then You can do:
dataStream.writeUsingOutputFormat(hadoopOutFormat)
You didn't say which exception you are getting but here is a complete example on how to achieve this.
The main points are:
Use org.apache.flink.api.java.hadoop.mapreduce.HadoopOutputFormat
From dependency org.apache.flink:flink-hadoop-compatibility_2.11:1.11.0
HadoopOutputFormat is an adapter that allows you to use output formats developed for Hadoop
You need a DataSet<Tuple2<Void,IndexedRecord>>, because hadoop's OutputFormat<K,V> works with key-value pairs, the key we are not interested in so we use Void for the key type, and the value needs to be an Avro's IndexedRecord or GenericRecord.
Use org.apache.parquet.avro.AvroParquetOutputFormat<IndexedRecord>
From dependency org.apache.parquet:parquet-avro:1.11.1
This hadoop's OutputFormat produces Parquet
This inherits from org.apache.parquet.hadoop.FileOutputFormat<Void, IndexedRecord>
Create your own subclass of IndexedRecord
You can't use new GenericData.Record(schema) because a record like this is no serializable java.io.NotSerializableException: org.apache.avro.Schema$Field is not serializable and Flink requires it to be serializable.
You still need to provide a getSchema() method, but you can either return null or return a Schema that you hold in a static member (so that it doesn't need to be serialized, and you avoid the java.io.NotSerializableException: org.apache.avro.Schema$Field is not serializable)
The source code
import org.apache.avro.Schema;
import org.apache.avro.generic.IndexedRecord;
import org.apache.commons.lang3.NotImplementedException;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.java.DataSet;
import org.apache.flink.api.java.ExecutionEnvironment;
import org.apache.flink.api.java.hadoop.mapreduce.HadoopOutputFormat;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.configuration.Configuration;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.parquet.avro.AvroParquetOutputFormat;
import org.apache.parquet.hadoop.metadata.CompressionCodecName;
import java.io.IOException;
import java.io.Serializable;
import java.util.ArrayList;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
import java.util.stream.Stream;
public class MyParquetTest implements Serializable {
public static void main(String[] args) throws Exception {
new MyParquetTest().start();
}
private void start() throws Exception {
final ExecutionEnvironment env = ExecutionEnvironment.createLocalEnvironment();
Configuration parameters = new Configuration();
Stream<String> stringStream = IntStream.range(1, 100).mapToObj(n -> String.format("Entry %d", n));
DataSet<String> text = env.fromCollection(stringStream.collect(Collectors.toCollection(ArrayList::new)));
Job job = Job.getInstance();
HadoopOutputFormat<Void, IndexedRecord> hadoopOutputFormat = new HadoopOutputFormat<>(new AvroParquetOutputFormat<IndexedRecord>(), job);
FileOutputFormat.setCompressOutput(job, true);
FileOutputFormat.setOutputCompressorClass(job, CompressionCodecName.SNAPPY.getHadoopCompressionCodecClass());
FileOutputFormat.setOutputPath(job, new org.apache.hadoop.fs.Path("./my-parquet"));
final Schema schema = new Schema.Parser().parse(MyRecord.class.getClassLoader().getResourceAsStream("schema.avsc"));
AvroParquetOutputFormat.setSchema(job, schema);
DataSet<Tuple2<Void, IndexedRecord>> text2 = text.map(new MapFunction<String, Tuple2<Void, IndexedRecord>>() {
#Override
public Tuple2<Void, IndexedRecord> map(String value) throws Exception {
return Tuple2.of(null, new MyRecord(value));
// IndexedRecord record = new GenericData.Record(schema); // won't work becuase Schema$Field is not serializable
// record.put(0, value);
// return Tuple2.of(null, record);
}
});
text2.output(hadoopOutputFormat);
env.execute("Flink Batch Java API Skeleton");
}
public static class MyRecord implements IndexedRecord {
private static Schema schema;
static {
try {
schema = new Schema.Parser().parse(MyRecord.class.getClassLoader().getResourceAsStream("schema.avsc"));
} catch (IOException e) {
e.printStackTrace();
}
}
private final String value;
public MyRecord(String value) {
this.value= value;
}
#Override
public void put(int i, Object v) {
throw new NotImplementedException("You can't update this IndexedRecord");
}
#Override
public Object get(int i) {
return this.value;
}
#Override
public Schema getSchema() {
return schema; // or just return null and remove the schema member
}
}
}
The schema.avsc is simply
{
"name": "aa",
"type": "record",
"fields": [
{"name": "value", "type": "string"}
]
}
and the dependencies:
implementation "org.apache.flink:flink-java:${flinkVersion}"
implementation "org.apache.flink:flink-avro:${flinkVersion}"
implementation "org.apache.flink:flink-streaming-java_${scalaBinaryVersion}:${flinkVersion}"
implementation "org.apache.flink:flink-hadoop-compatibility_${scalaBinaryVersion}:${flinkVersion}"
implementation "org.apache.parquet:parquet-avro:1.11.1"
implementation "org.apache.hadoop:hadoop-client:2.8.3"
You'll create a Flink OutputFormat via new HadoopOutputFormat(parquetOutputFormat, job), and then pass that to DataSet.output(xxx).
The job comes from...
import org.apache.hadoop.mapreduce.Job;
...
Job job = Job.getInstance();
The parquetOutputFormat is created via:
import org.apache.parquet.hadoop.ParquetOutputFormat;
...
ParquetOutputFormat<MyOutputType> parquetOutputFormat = new ParquetOutputFormat<>();
See https://javadoc.io/doc/org.apache.parquet/parquet-hadoop/1.10.1/org/apache/parquet/hadoop/ParquetOutputFormat.html

Does Flink SQL support Java Map types?

I'm trying to access a key from a map using Flink's SQL API. It fails with the error Exception in thread "main" org.apache.flink.table.api.TableException: Type is not supported: ANY
Please advise how i can fix it.
Here is my event class
public class EventHolder {
private Map<String,String> event;
public Map<String, String> getEvent() {
return event;
}
public void setEvent(Map<String, String> event) {
this.event = event;
}
}
Here is the main class which submits the flink job
public class MapTableSource {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<EventHolder> mapEventStream = env.fromCollection(getMaps());
// register a table and use SQL
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
tableEnv.registerDataStream("mapEvent", mapEventStream);
//tableEnv.registerFunction("orderSizeType", new OrderSizeType());
Table alerts = tableEnv.sql(
"select event['key'] from mapEvent ");
DataStream<String> alertStream = tableEnv.toAppendStream(alerts, String.class);
alertStream.filter(new FilterFunction<String>() {
private static final long serialVersionUID = -2438621539037257735L;
#Override
public boolean filter(String value) throws Exception {
System.out.println("Key value is:"+value);
return value!=null;
}
});
env.execute("map-tablsource-job");
}
private static List<EventHolder> getMaps(){
List<EventHolder> list = new ArrayList<>();
for(int i=0;i<5;i++){
EventHolder holder = new EventHolder();
Map<String,String> map = new HashMap<>();
map.put("key", "value");
holder.setEvent(map);
list.add(holder);
}
return list;
}
}
When I run it I'm getting the exception
Exception in thread "main" org.apache.flink.table.api.TableException: Type is not supported: ANY
at org.apache.flink.table.api.TableException$.apply(exceptions.scala:53)
at org.apache.flink.table.calcite.FlinkTypeFactory$.toTypeInfo(FlinkTypeFactory.scala:341)
at org.apache.flink.table.plan.logical.LogicalRelNode$$anonfun$12.apply(operators.scala:530)
at org.apache.flink.table.plan.logical.LogicalRelNode$$anonfun$12.apply(operators.scala:529)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
at scala.collection.AbstractTraversable.map(Traversable.scala:104)
at org.apache.flink.table.plan.logical.LogicalRelNode.<init>(operators.scala:529)
at org.apache.flink.table.api.TableEnvironment.sql(TableEnvironment.scala:503)
at com.c.p.flink.MapTableSource.main(MapTableSource.java:25)
I'm using flink 1.3.1
I think the problem lies in fromCollection. Flink is not able to extract the needed type information because of Java limitations (i.e. type erasure). Therefore you map is treated as black box with SQL ANY type. You can verify the types of your table by using tableEnv.scan("mapEvent").printSchema(). You can specify the type information in fromCollection with Types.MAP(Types.STRING, Types.STRING).
I solved a similar issue with the following:
//Should probably make MapVal more generic, but works for this example
public class MapVal extends ScalarFunction {
public String eval(Map<String, String> obj, String key) {
return obj.get(key);
}
}
public class Car {
private String make;
private String model;
private int year;
private Map<String, String> attributes;
//getters/setters...
}
//After registering Stream and TableEnv etc
tableEnv.registerFunction("mapval", new MapVal());
Table cars = tableEnv
.scan("Cars")
.select("make, model, year, attributes.mapval('name')");

How to filter Apache flink stream on the basis of other?

I have two stream one is of Int and other is of json .In The json Schema there is one key which is some int .So i need to filter the json stream via key comparison with the other integer stream so Is it possible in Flink?
Yes, you can do this kind of stream processing with Flink. The basic building blocks you need from Flink are connected streams, and stateful functions -- here's an example using a RichCoFlatMap:
import org.apache.flink.api.common.state.ValueState;
import org.apache.flink.api.common.state.ValueStateDescriptor;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.RichCoFlatMapFunction;
import org.apache.flink.util.Collector;
public class Connect {
public static void main(String[] args) throws Exception {
final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Event> control = env.fromElements(
new Event(17),
new Event(42))
.keyBy("key");
DataStream<Event> data = env.fromElements(
new Event(2),
new Event(42),
new Event(6),
new Event(17),
new Event(8),
new Event(42)
)
.keyBy("key");
DataStream<Event> result = control
.connect(data)
.flatMap(new MyConnectedStreams());
result.print();
env.execute();
}
static final class MyConnectedStreams
extends RichCoFlatMapFunction<Event, Event, Event> {
private ValueState<Boolean> seen = null;
#Override
public void open(Configuration config) {
ValueStateDescriptor<Boolean> descriptor = new ValueStateDescriptor<>(
// state name
"have-seen-key",
// type information of state
TypeInformation.of(new TypeHint<Boolean>() {
}));
seen = getRuntimeContext().getState(descriptor);
}
#Override
public void flatMap1(Event control, Collector<Event> out) throws Exception {
seen.update(Boolean.TRUE);
}
#Override
public void flatMap2(Event data, Collector<Event> out) throws Exception {
if (seen.value() == Boolean.TRUE) {
out.collect(data);
}
}
}
public static final class Event {
public Event() {
}
public Event(int key) {
this.key = key;
}
public int key;
public String toString() {
return String.valueOf(key);
}
}
}
In this example, only those keys that have been seen on the control stream are passed through the data stream -- all other events are filtered out. I've taken advantage of Flink's managed keyed state and connected streams.
To keep this simple I've ignored your requirement that the data stream has JSON, but you can find examples of how to work with JSON and Flink elsewhere.
Note that your results will be non-deterministic, since you have no control over the timing of the two streams relative to one another. You could manage this by adding event-time timestamps to the streams, and then using a RichCoProcessFunction instead.

Resources