How to join a stream and dataset? - apache-flink

How to join a stream and dataset?
I have a stream and I have a static data in a file. I want to enrich the data of stream using the data in the file.
Example: in stream I get airports code and in file I have the name of the airports and codes in file.
Now I want to join the stream data to the file to form a new stream with airport names. Please provide steps on how to achieve this.

There are lots of ways to approach stream enrichment with Flink, depending on the exact requirements. https://www.youtube.com/watch?v=cJS18iKLUIY is a good talk by Konstantin Knauf that covers many different approaches, and the tradeoffs between them.
In the simple case where the enrichment data is immutable and reasonably small, I would just use a RichFlatMap and load the whole file in the open() method. That would look something like this:
public class EnrichmentWithPreloading extends RichFlatMapFunction<Event, EnrichedEvent> {
private Map<Long, SensorReferenceData> referenceData;
#Override
public void open(final Configuration parameters) throws Exception {
super.open(parameters);
referenceData = loadReferenceData();
}
#Override
public void flatMap(
final Event event,
final Collector<EnrichedEvent> collector) throws Exception {
SensorReferenceData sensorReferenceData =
referenceData.get(sensorMeasurement.getSensorId());
collector.collect(new EnrichedEvent(event, sensorReferenceData));
}
}
You'll find more code examples for other approaches in https://github.com/knaufk/enrichments-with-flink.
UPDATE:
If what you'd rather do is preload some larger, partitioned reference data to join with a stream, there are a few ways to approach this, some of which are covered in the video and repo I shared above. For those specific requirements, I suggest using a custom partitioner; there's an example here in that same github repo. The idea is that the enrichment data is sharded, and each streaming event is steered toward the instance with the relevant reference data.
In my opinion, this is simpler than trying to get the Table API to do this particular enrichment as a join.

Related

Unit testing Flink Topology without using MiniClusterWithClientResource

I have a Fink topology that consists of multiple Map and FlatMap transformations. The source/sink are from/to Kafka. The Kakfa records are of type Envelope (defined by someone else), and are not marked as "serializable". I want to Unit test this topology.
I defined a simple SourceFunction that returns a list of Envelope as the source:
public class MySource extends RichParallelSourceFunction<Envelope> {
private List<Envelope> input;
public MySource(List<Envelope> input) {
this.input = input;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
#Override
public void run(SourceContext<Envelope> ctx) throws Exception {
for (Envelope listElement : inputOfSubtask) {
ctx.collect(listElement);
}
}
#Override
public void cancel() {}
}
I am using MiniClusterWithClientResource to Unit test the topology. I ran onto two problems:
I need to make MySource serializable, as Flink wants/needs to serialize the source. As a workaround, I make input transient. The allowed the code to compile.
Then I ran into the runtime error:
org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'Custom Source' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
I am trying to understand why I am getting this error, which I was not getting before when the topology is consuming from a kafka cluster using a KafkaConsumer. I found a workaround by providing the Type info using the following:
.returns(TypeInformation.of(Envelope.class))
However, during runtime, after deserialization, input is set to null (obviously, as there is no deserialization method defined.).
Questions:
Can someone please help me understand why I am getting the InvalidTypesException exception?
Why if MySource being deserialized/serialized? Is there a way I can void this while usingMiniClusterWithClientResource?
I could hack some writeObject() and readObject() method in MySource. But I prefer to avoid that route. Is it possible to use some framework / class to test the Topology without providing a Source (and Sink) that is Serializable? It would be great if I could use something like KeyedOneInputStreamOperatorTestHarness that I could pass as topology, and avoid the whole deserialization / serialization step in the beginning.
Any ideas / pointers would be greatly appreciated.
Thank you,
Ahmed.
"why I am getting the InvalidTypesException exception?"
Not sure, usually I'd need to see the workflow definition to understand where the type information is getting dropped.
"Why if MySource being deserialized/serialized?"
Because Flink distributes operators to multiple tasks on multiple machines by serializing them, then sending over the network, and then deserializing.
"Is there a way I can void this while using MiniClusterWithClientResource?"
Yes. Since the MiniCluster runs in a single JVM, you can use a static ConcurrentLinkedQueue to hold all of the Envelope records, and your MySource just reads from this queue.
Nit: Your MySource should set a transient boolean running flag to true in the open() method, false in the cancel() method, and check it in the run() method's loop.

Reading file that is being appended in Flink

We have a legacy application that is writing results as records to some local files. We want to process these records in real-time thus we are planning to use Flink as an engine. I know that I can read text files using StreamingExecutionEnvironment#readFile. It seems that we need something similar to PROCESS_CONTINUOUSLY there but this flag causes a whole file to be reprocessed on each change, what is not what we want here.
Of course, I can write my custom source that saves number of records per file in its state. But I suppose there might be some problem with such approach with checkpointing or something - my reasoning is that if that would be easy to implement reliably, it would have been already implemented in Flink.
Any tips / suggestions how to approach this?
You can do this rather easily with a custom source, so long as you are content to be reading from a single file (per source instance). You will need to use operator state and implement checkpointing. The state handling and checkpointing will look something like this:
public class CheckpointedFileSource implements SourceFunction<Event>, ListCheckpointed<Long> {
private long eventCnt = 0;
public void run(SourceContext<Event> sourceContext) throws Exception {
final Object lock = sourceContext.getCheckpointLock();
// skip over previously emitted events
...
while (not cancelled) {
read event from file;
synchronized (lock) {
eventCnt++;
sourceContext.collectWithTimestamp(event, timestamp);
}
}
}
#Override
public List<Long> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return Collections.singletonList(eventCnt);
}
#Override
public void restoreState(List<Long> state) throws Exception {
for (Long s : state)
this.eventCnt = s;
}
}
For a complete example see the checkpointed taxi ride data source used in the Flink training exercises. You’ll have to adapt it a bit, since it’s designed to read a static file, rather than one that is being appended to.

Non Serializable object in Apache Flink

I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.

Emitting the "Side Outputs" and "process output" in single sink with different path

How to emit the "Side Outputs" and "process output" using single sink. Here, in this case, both output needs emit to single sink and based on the tag folder path differs
Eg
OutputTag<String> outputTag = new OutputTag<String>("side-output") {};
SingleOutputStreamOperator<String> mainDataStream = source.process(new ProcessFunction<String, String>() {
#Override
public void processElement(String value, Context ctx, Collector<String> out) {
try {
builder.parse(new InputSource(new StringReader(value)));
out.collect(value);
} catch (SAXException | IOException e) {
ctx.output(outputTag, value);
}
}
});
DataStream<String> sideOutputStream = mainDataStream.getSideOutput(outputTag);
Is there any other better solution? Just worried about performance
If you want to use a single sink, you can add an attribute into your output format and use the attribute to identify the data source in the single sink.
You can also construct two sinks with different parameters to receive data from different sources. In my opinion, without considering the database you use, this kind of multi-thread way has better performance.
Flink's BucketingSink can use a Bucketer to determine which sub-directory inside of the base directory will be used. So you can use this to set the sub-directory based on an attribute in your record being written.
As far as using a single sink, since both the main output and the side output of your function are String objects (same type), you can mainDataStream.union(sideOutputStream) the two streams together before outputting the result.

Apache Flink flatMap with millions of outputs

Whenever i receive a message, i want to do a read from a database, possibly returning millions of rows, which i then want to pass on down the stream. Is this considered good practice in Flink?
public static class StatsReader implements FlatMapFunction<Msg, Json> {
Transactor txor =
...;
#Override
public void flatMap(Msg msg, Collector<Json> out) {
//Possibly lazy and async stream
java.util.Stream<Json> results =
txor.exec(Stats.read(msg));
results.foreach(stat->out.collect(stat));
}
}
Edit:
Background: I would like to dynamically run a report. The db basically acts as a huge window. The report is based on that window + live data. The report is highly customizable, threfore its hard to preprocess results or define pipelines a priori.
I use vanilla java today, and the pipeline is roughly like this:
ReportDefinition -> ( elasticsearch query + realtime stream ) -> ( ReportProcessingPipeline ) -> ( Websocket push )
In principle this should be possible. However, I'd recommend to use an AsyncFunction instead of a FlatMapFunction.
Please note that, such a setup might require tuning the checkpointing parameters, such as the checkpoint interval.

Resources