How can I initialize keyed state with initial values in Apache Flink? - apache-flink

I want to initialize MapState in a KeyedStream in Apache Flink with some initial values as shown in the code snipit at the bottom of the post. Unfortunately, Flink does not let you do this in the open method, as explained here: Flink keyed stream key is null. As the post I referenced was from 2020, I am hoping something may have changed since then.
The initial values that I wish to put in the MapState would be the same for all keys.
What I have tried
Overriding the initializeState function in org.apache.flink.streaming.api.checkpoint.CheckpointedFunction and adding the myState.put(...) stuff in there, but this gets the same exception as doing this in the open method
This post mentions to use OperatorState, but I don't think this works for my use-case: Flink keyed stream key is null
I realize I can do something like if (myState.isEmpty()) { addInitialStateToMyState() } inside of processElement, but hoping to avoid this
new KeyedProcessFunction<String, Row, String>() {
MapState<String,String> myState;
#Override
public void processElement(final Row event, final KeyedProcessFunction<String, Row, String>.Context context, final Collector<String> collector) throws Exception {
...
myState.put("this", "works")
...
}
#Override
public void open(Configuration configuration) throws Exception {
MapStateDescriptor<> myStateDescriptor = new MapStateDescriptor<>(
"my-state",
String.class,
String.class
);
myState = getRuntimeContext().getMapState(myStateDescriptor);
// can't initialize here; can only do this inside `processElement`
myState.put("this", "fails");
}
Thanks for any help/insights!

Sadly, I don't think there is a way to do this in a prettier way. All access to keyed state must be done in a key aware fashion, which means it needs to be done in processElement. Operator state is generally more complex, so if the only reason for picking it would be to code cleanliness I'd avoid it.

Related

Flink Shared State and race conditions

Hey there I am having a hard time understanding how shared state (ValueState, ListState, ..) work in flink. If multiple instances of a task are running in parallel how does flink prevent race conditions?
in this example from the doc, if the operator is parallelized, how does flink guarantee that there are no race conditions between the read and update of the keyHasBeenSeen value?
public static class Deduplicator extends RichFlatMapFunction<Event, Event> {
ValueState<Boolean> keyHasBeenSeen;
#Override
public void open(Configuration conf) {
ValueStateDescriptor<Boolean> desc = new ValueStateDescriptor<>("keyHasBeenSeen", Types.BOOLEAN);
keyHasBeenSeen = getRuntimeContext().getState(desc);
}
#Override
public void flatMap(Event event, Collector<Event> out) throws Exception {
if (keyHasBeenSeen.value() == null) {
out.collect(event);
keyHasBeenSeen.update(true);
}
}
}
There isn't any shared state in Flink. Having shared state would add complexity and impair scalability.
The value and update methods are scoped to the key of the current event. For any given key, all events for that key are processed by the same instance of the operator/function. And all tasks (a task is a chain of operator/function instances) are single threaded.
By keeping things simple like this, there's nothing to worry about.

Unit testing Flink Topology without using MiniClusterWithClientResource

I have a Fink topology that consists of multiple Map and FlatMap transformations. The source/sink are from/to Kafka. The Kakfa records are of type Envelope (defined by someone else), and are not marked as "serializable". I want to Unit test this topology.
I defined a simple SourceFunction that returns a list of Envelope as the source:
public class MySource extends RichParallelSourceFunction<Envelope> {
private List<Envelope> input;
public MySource(List<Envelope> input) {
this.input = input;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
#Override
public void run(SourceContext<Envelope> ctx) throws Exception {
for (Envelope listElement : inputOfSubtask) {
ctx.collect(listElement);
}
}
#Override
public void cancel() {}
}
I am using MiniClusterWithClientResource to Unit test the topology. I ran onto two problems:
I need to make MySource serializable, as Flink wants/needs to serialize the source. As a workaround, I make input transient. The allowed the code to compile.
Then I ran into the runtime error:
org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'Custom Source' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
I am trying to understand why I am getting this error, which I was not getting before when the topology is consuming from a kafka cluster using a KafkaConsumer. I found a workaround by providing the Type info using the following:
.returns(TypeInformation.of(Envelope.class))
However, during runtime, after deserialization, input is set to null (obviously, as there is no deserialization method defined.).
Questions:
Can someone please help me understand why I am getting the InvalidTypesException exception?
Why if MySource being deserialized/serialized? Is there a way I can void this while usingMiniClusterWithClientResource?
I could hack some writeObject() and readObject() method in MySource. But I prefer to avoid that route. Is it possible to use some framework / class to test the Topology without providing a Source (and Sink) that is Serializable? It would be great if I could use something like KeyedOneInputStreamOperatorTestHarness that I could pass as topology, and avoid the whole deserialization / serialization step in the beginning.
Any ideas / pointers would be greatly appreciated.
Thank you,
Ahmed.
"why I am getting the InvalidTypesException exception?"
Not sure, usually I'd need to see the workflow definition to understand where the type information is getting dropped.
"Why if MySource being deserialized/serialized?"
Because Flink distributes operators to multiple tasks on multiple machines by serializing them, then sending over the network, and then deserializing.
"Is there a way I can void this while using MiniClusterWithClientResource?"
Yes. Since the MiniCluster runs in a single JVM, you can use a static ConcurrentLinkedQueue to hold all of the Envelope records, and your MySource just reads from this queue.
Nit: Your MySource should set a transient boolean running flag to true in the open() method, false in the cancel() method, and check it in the run() method's loop.

How to get the Ingestion time of an event when the time characteristic is event-time?

I want to evaluate the time costed between an event reaches the system and get finished, and I think getting ingestion time will help, but how to do get it?
You probably want to use latency tracking. Alternatively, you can add the processing time directly after the source in a chained process function (with Context->TimerService#currentProcessingTime()).
Based on the reply from David, to get the ingest time we can chain the process method with source.
Below code shows the way to get the ingest time. Also in case the same need to be used for metrics to get the difference between ingest time & event time, I have used histogram metric group to do that.
Below code snippet might help you to better understand.
DataStream<EventDataMapping> text = env
.fromSource(source, WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)),"Kafka Source")
.process(new ProcessFunction<EventDataMapping, EventDataMapping>() {
private transient DescriptiveStatisticsHistogram eventVsIngestionTimeLag;
private static final int EVENT_TIME_LAG_WINDOW_SIZE = 10_000;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
eventVsIngestionTimeLag = getRuntimeContext().getMetricGroup().histogram("eventVsIngestionTimeLag",
new DescriptiveStatisticsHistogram(EVENT_TIME_LAG_WINDOW_SIZE));
}
#Override
public void processElement(EventDataMapping eventDataMapping, Context context, Collector<EventDataMapping> collector) throws Exception {
LOG.info("process element event time "+context.timestamp()+" current ingestTime "+context.timerService().currentProcessingTime());
eventVsIngestionTimeLag.update(context.timerService().currentProcessingTime() - context.timestamp());
}
}).returns(EventDataMapping.class);

Reading file that is being appended in Flink

We have a legacy application that is writing results as records to some local files. We want to process these records in real-time thus we are planning to use Flink as an engine. I know that I can read text files using StreamingExecutionEnvironment#readFile. It seems that we need something similar to PROCESS_CONTINUOUSLY there but this flag causes a whole file to be reprocessed on each change, what is not what we want here.
Of course, I can write my custom source that saves number of records per file in its state. But I suppose there might be some problem with such approach with checkpointing or something - my reasoning is that if that would be easy to implement reliably, it would have been already implemented in Flink.
Any tips / suggestions how to approach this?
You can do this rather easily with a custom source, so long as you are content to be reading from a single file (per source instance). You will need to use operator state and implement checkpointing. The state handling and checkpointing will look something like this:
public class CheckpointedFileSource implements SourceFunction<Event>, ListCheckpointed<Long> {
private long eventCnt = 0;
public void run(SourceContext<Event> sourceContext) throws Exception {
final Object lock = sourceContext.getCheckpointLock();
// skip over previously emitted events
...
while (not cancelled) {
read event from file;
synchronized (lock) {
eventCnt++;
sourceContext.collectWithTimestamp(event, timestamp);
}
}
}
#Override
public List<Long> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return Collections.singletonList(eventCnt);
}
#Override
public void restoreState(List<Long> state) throws Exception {
for (Long s : state)
this.eventCnt = s;
}
}
For a complete example see the checkpointed taxi ride data source used in the Flink training exercises. You’ll have to adapt it a bit, since it’s designed to read a static file, rather than one that is being appended to.

Non Serializable object in Apache Flink

I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.

Resources