Unit testing Flink Topology without using MiniClusterWithClientResource - apache-flink

I have a Fink topology that consists of multiple Map and FlatMap transformations. The source/sink are from/to Kafka. The Kakfa records are of type Envelope (defined by someone else), and are not marked as "serializable". I want to Unit test this topology.
I defined a simple SourceFunction that returns a list of Envelope as the source:
public class MySource extends RichParallelSourceFunction<Envelope> {
private List<Envelope> input;
public MySource(List<Envelope> input) {
this.input = input;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
#Override
public void run(SourceContext<Envelope> ctx) throws Exception {
for (Envelope listElement : inputOfSubtask) {
ctx.collect(listElement);
}
}
#Override
public void cancel() {}
}
I am using MiniClusterWithClientResource to Unit test the topology. I ran onto two problems:
I need to make MySource serializable, as Flink wants/needs to serialize the source. As a workaround, I make input transient. The allowed the code to compile.
Then I ran into the runtime error:
org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'Custom Source' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
I am trying to understand why I am getting this error, which I was not getting before when the topology is consuming from a kafka cluster using a KafkaConsumer. I found a workaround by providing the Type info using the following:
.returns(TypeInformation.of(Envelope.class))
However, during runtime, after deserialization, input is set to null (obviously, as there is no deserialization method defined.).
Questions:
Can someone please help me understand why I am getting the InvalidTypesException exception?
Why if MySource being deserialized/serialized? Is there a way I can void this while usingMiniClusterWithClientResource?
I could hack some writeObject() and readObject() method in MySource. But I prefer to avoid that route. Is it possible to use some framework / class to test the Topology without providing a Source (and Sink) that is Serializable? It would be great if I could use something like KeyedOneInputStreamOperatorTestHarness that I could pass as topology, and avoid the whole deserialization / serialization step in the beginning.
Any ideas / pointers would be greatly appreciated.
Thank you,
Ahmed.

"why I am getting the InvalidTypesException exception?"
Not sure, usually I'd need to see the workflow definition to understand where the type information is getting dropped.
"Why if MySource being deserialized/serialized?"
Because Flink distributes operators to multiple tasks on multiple machines by serializing them, then sending over the network, and then deserializing.
"Is there a way I can void this while using MiniClusterWithClientResource?"
Yes. Since the MiniCluster runs in a single JVM, you can use a static ConcurrentLinkedQueue to hold all of the Envelope records, and your MySource just reads from this queue.
Nit: Your MySource should set a transient boolean running flag to true in the open() method, false in the cancel() method, and check it in the run() method's loop.

Related

Flink integration test(s) with Testcontainers

I have a simple Apache Flink job that looks very much like this:
public final class Application {
public static void main(final String... args) throws Exception {
final var env = StreamExecutionEnvironment.getExecutionEnvironment();
final var executionConfig = env.getConfig();
final var params = ParameterTool.fromArgs(args);
executionConfig.setGlobalJobParameters(params);
executionConfig.setParallelism(params.getInt("application.parallelism"));
final var source = KafkaSource.<CustomKafkaMessage>builder()
.setBootstrapServers(params.get("application.kafka.bootstrap-servers"))
.setGroupId(config.get("application.kafka.consumer.group-id"))
// .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))
.setStartingOffsets(OffsetsInitializer.earliest())
.setTopics(config.getString("application.kafka.listener.topics"))
.setValueOnlyDeserializer(new MessageDeserializationSchema())
.build();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "custom.kafka-source")
.uid("custom.kafka-source")
.rebalance()
.flatMap(new CustomFlatMapFunction())
.uid("custom.flatmap-function")
.filter(new CustomFilterFunction())
.uid("custom.filter-function")
.addSink(new CustomDiscardSink()) // Will be a Kafka sink in the future
.uid("custom.discard-sink");
env.execute(config.get("application.job-name"));
}
}
Problem is that I would like to provide an integration test for the entire application — sort of like an end-to-end (set of) test(s) for the entire job. I'm using Testcontainers, but I'm not really sure how to move forward with this. For instance, this is how the test looks like (for now):
#Testcontainers
final class ApplicationTest {
private static final DockerImageName DOCKER_IMAGE = DockerImageName.parse("confluentinc/cp-kafka:7.0.1");
#Container
private static final KafkaContainer KAFKA_CONTAINER = new KafkaContainer(DOCKER_IMAGE);
#ClassRule // How come this work in JUnit Jupiter? :/
public static MiniClusterResource cluster;
#BeforeAll
static void init() {
KAFKA_CONTAINER.start();
// ...probably need to wait and create the topic(s) as well
final var config = new MiniClusterResourceConfiguration.Builder().setNumberSlotsPerTaskManager(2)
.setNumberTaskManagers(1)
.build();
cluster = new MiniClusterResource(config);
}
#Test
void main() throws Exception {
// new Application(); // ...what's next?
}
}
I'm not sure how to implement what's required to trigger the job as-is from that point on. Basically, I would like to execute what was defined before, without (almost) any modifications — I've seen plenty of examples that practically build the entire job again, so that's not an option.
Can somebody provide any pointers here?
MessageDeserializationSchema is unbounded, so isEndOfStream returns false. Not sure if that's an impediment.
In order to make the pipeline more testable, I suggest you create a method on your Application class that takes a source and a sink as parameters, and creates and executes the pipeline, using those connectors.
In your tests you can call that method with special sources and sinks that you use for testing. In particular, you will want to use a KafkaSource that uses .setBounded(...) in the tests so that it cleanly handles just the range of data intended for the test(s).
The solutions and tests for the Apache Flink training exercises are organized along these lines; for example, see RideCleansingSolution.java and RideCleansingIntegrationTest.java. These examples don't use kafka or test containers, but hopefully they'll still be helpful.
I would suggest you instrument your application as an opaque-box test by interacting with it through its public API. This can be done either as an out-process test (e.g. by running your application in a container as well, using Testcontainers) are as an in-process test (by creating your Application and calling its main() method).
Now in your comments you explained, that you want to check for the side-effects of interacting with your application (Kafka messages being published). To check this, connect to the KafkaContainer with your own KafkaConsumer from within the test and use a library such as Awaitiliy to wait until the messages have been received.

Reading file that is being appended in Flink

We have a legacy application that is writing results as records to some local files. We want to process these records in real-time thus we are planning to use Flink as an engine. I know that I can read text files using StreamingExecutionEnvironment#readFile. It seems that we need something similar to PROCESS_CONTINUOUSLY there but this flag causes a whole file to be reprocessed on each change, what is not what we want here.
Of course, I can write my custom source that saves number of records per file in its state. But I suppose there might be some problem with such approach with checkpointing or something - my reasoning is that if that would be easy to implement reliably, it would have been already implemented in Flink.
Any tips / suggestions how to approach this?
You can do this rather easily with a custom source, so long as you are content to be reading from a single file (per source instance). You will need to use operator state and implement checkpointing. The state handling and checkpointing will look something like this:
public class CheckpointedFileSource implements SourceFunction<Event>, ListCheckpointed<Long> {
private long eventCnt = 0;
public void run(SourceContext<Event> sourceContext) throws Exception {
final Object lock = sourceContext.getCheckpointLock();
// skip over previously emitted events
...
while (not cancelled) {
read event from file;
synchronized (lock) {
eventCnt++;
sourceContext.collectWithTimestamp(event, timestamp);
}
}
}
#Override
public List<Long> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return Collections.singletonList(eventCnt);
}
#Override
public void restoreState(List<Long> state) throws Exception {
for (Long s : state)
this.eventCnt = s;
}
}
For a complete example see the checkpointed taxi ride data source used in the Flink training exercises. You’ll have to adapt it a bit, since it’s designed to read a static file, rather than one that is being appended to.

Non Serializable object in Apache Flink

I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.

getting numOfRecordsIn using counters in Flink

I want to show numRecordsIn for an operator in Flink and for doing this I have been following ppt by data artisans at here. code for the counter is given below
public static class mapper extends RichMapFunction<String,String>{
public Counter counter;
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
this.counter = getRuntimeContext()
.getMetricGroup()
.counter("numRecordsIn");
}
#Override
public String map(String s) throws Exception {
counter.inc();
System.out.println("counter val " + counter.toString());
return null;
}
}
The problem is that how do I specify which operator I want to show number_of_Records_In?
Metric counter are exposed via Flink's metric system. In order to take a look at them, you have to configure a metric reporter. A description how to register a metric reporter can be found here.
Flink includes a number of built-in metrics, including numRecordsIn. So if that's what you want to measure, there's no need to write any code to implement that particular measurement. Similarly for numRecordsInPerSecond, and a host of others.
The code you asked about causes the numRecordsIn counter to be incremented for the operator in which the metric is being used.
A good way to better understand the metrics system is to bring up a simple streaming job and look at the metrics in Flink's web ui. I also found it really helpful to query the monitoring REST api while a job was running.

Is it a good practice to create lots of jobs in a flink cluster?

I am planning to create a rule engine with flink streaming.
There are some requirements on the execution :
All events to be executed against the rule set must be read from kafka.
All rules must be executed in a limited lapse of time.
The problem is that a rule can be added at runtime, so I can't simply create lots of jobs to handle all incoming messages because I risk to exceed the maximum time allowing to execute the rules.
I have the assurance that a single rule can be executed under the time limit.
So I wonder if it's a good practice to create one job by rule and add more jobs when new rules are coming ? (this may be hundred of rules).
I have the intuition that this is not the way to handle the problem but nothing really rational to explain why.
A second approach would be to maintain a queue (in zookeeper for example) to keep a trace about which rule has been executed against which event. So the work of each job only consist to :
pick a rule in the queue
execute it against the event
do it again until all rules have been executed against the event
If you want to dynamically adapt your program logic you can use the co-flatmap operator. The co-flatmap operator has two inputs, one would be the normal event input and the other one would be the rule input. Internally you will store the rules and apply them to incoming events from the other input.
The following could look like:
DataStream<Input> input = ...
DataStream<Rule> rules = ...
input
.connect(rules)
.keyBy(keySelector1, keySelector2)
.flatMap(new MyCoFlatMap());
public static class MyCoFlatMap implements CoFlatMapFunction<Input, Rule, Output> {
#Override
public void flatMap1(Input input, Collector<Output> collector) throws Exception {
// process input
}
#Override
public void flatMap2(Rule rule, Collector<Output> collector) throws Exception {
// store rules
}
}

Resources