Flink integration test(s) with Testcontainers - apache-flink

I have a simple Apache Flink job that looks very much like this:
public final class Application {
public static void main(final String... args) throws Exception {
final var env = StreamExecutionEnvironment.getExecutionEnvironment();
final var executionConfig = env.getConfig();
final var params = ParameterTool.fromArgs(args);
executionConfig.setGlobalJobParameters(params);
executionConfig.setParallelism(params.getInt("application.parallelism"));
final var source = KafkaSource.<CustomKafkaMessage>builder()
.setBootstrapServers(params.get("application.kafka.bootstrap-servers"))
.setGroupId(config.get("application.kafka.consumer.group-id"))
// .setStartingOffsets(OffsetsInitializer.committedOffsets(OffsetResetStrategy.EARLIEST))
.setStartingOffsets(OffsetsInitializer.earliest())
.setTopics(config.getString("application.kafka.listener.topics"))
.setValueOnlyDeserializer(new MessageDeserializationSchema())
.build();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "custom.kafka-source")
.uid("custom.kafka-source")
.rebalance()
.flatMap(new CustomFlatMapFunction())
.uid("custom.flatmap-function")
.filter(new CustomFilterFunction())
.uid("custom.filter-function")
.addSink(new CustomDiscardSink()) // Will be a Kafka sink in the future
.uid("custom.discard-sink");
env.execute(config.get("application.job-name"));
}
}
Problem is that I would like to provide an integration test for the entire application — sort of like an end-to-end (set of) test(s) for the entire job. I'm using Testcontainers, but I'm not really sure how to move forward with this. For instance, this is how the test looks like (for now):
#Testcontainers
final class ApplicationTest {
private static final DockerImageName DOCKER_IMAGE = DockerImageName.parse("confluentinc/cp-kafka:7.0.1");
#Container
private static final KafkaContainer KAFKA_CONTAINER = new KafkaContainer(DOCKER_IMAGE);
#ClassRule // How come this work in JUnit Jupiter? :/
public static MiniClusterResource cluster;
#BeforeAll
static void init() {
KAFKA_CONTAINER.start();
// ...probably need to wait and create the topic(s) as well
final var config = new MiniClusterResourceConfiguration.Builder().setNumberSlotsPerTaskManager(2)
.setNumberTaskManagers(1)
.build();
cluster = new MiniClusterResource(config);
}
#Test
void main() throws Exception {
// new Application(); // ...what's next?
}
}
I'm not sure how to implement what's required to trigger the job as-is from that point on. Basically, I would like to execute what was defined before, without (almost) any modifications — I've seen plenty of examples that practically build the entire job again, so that's not an option.
Can somebody provide any pointers here?
MessageDeserializationSchema is unbounded, so isEndOfStream returns false. Not sure if that's an impediment.

In order to make the pipeline more testable, I suggest you create a method on your Application class that takes a source and a sink as parameters, and creates and executes the pipeline, using those connectors.
In your tests you can call that method with special sources and sinks that you use for testing. In particular, you will want to use a KafkaSource that uses .setBounded(...) in the tests so that it cleanly handles just the range of data intended for the test(s).
The solutions and tests for the Apache Flink training exercises are organized along these lines; for example, see RideCleansingSolution.java and RideCleansingIntegrationTest.java. These examples don't use kafka or test containers, but hopefully they'll still be helpful.

I would suggest you instrument your application as an opaque-box test by interacting with it through its public API. This can be done either as an out-process test (e.g. by running your application in a container as well, using Testcontainers) are as an in-process test (by creating your Application and calling its main() method).
Now in your comments you explained, that you want to check for the side-effects of interacting with your application (Kafka messages being published). To check this, connect to the KafkaContainer with your own KafkaConsumer from within the test and use a library such as Awaitiliy to wait until the messages have been received.

Related

Unit testing Flink Topology without using MiniClusterWithClientResource

I have a Fink topology that consists of multiple Map and FlatMap transformations. The source/sink are from/to Kafka. The Kakfa records are of type Envelope (defined by someone else), and are not marked as "serializable". I want to Unit test this topology.
I defined a simple SourceFunction that returns a list of Envelope as the source:
public class MySource extends RichParallelSourceFunction<Envelope> {
private List<Envelope> input;
public MySource(List<Envelope> input) {
this.input = input;
}
#Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
}
#Override
public void run(SourceContext<Envelope> ctx) throws Exception {
for (Envelope listElement : inputOfSubtask) {
ctx.collect(listElement);
}
}
#Override
public void cancel() {}
}
I am using MiniClusterWithClientResource to Unit test the topology. I ran onto two problems:
I need to make MySource serializable, as Flink wants/needs to serialize the source. As a workaround, I make input transient. The allowed the code to compile.
Then I ran into the runtime error:
org.apache.flink.api.common.functions.InvalidTypesException: The return type of function 'Custom Source' could not be determined automatically, due to type erasure. You can give type information hints by using the returns(...) method on the result of the transformation call, or by letting your function implement the 'ResultTypeQueryable' interface.
I am trying to understand why I am getting this error, which I was not getting before when the topology is consuming from a kafka cluster using a KafkaConsumer. I found a workaround by providing the Type info using the following:
.returns(TypeInformation.of(Envelope.class))
However, during runtime, after deserialization, input is set to null (obviously, as there is no deserialization method defined.).
Questions:
Can someone please help me understand why I am getting the InvalidTypesException exception?
Why if MySource being deserialized/serialized? Is there a way I can void this while usingMiniClusterWithClientResource?
I could hack some writeObject() and readObject() method in MySource. But I prefer to avoid that route. Is it possible to use some framework / class to test the Topology without providing a Source (and Sink) that is Serializable? It would be great if I could use something like KeyedOneInputStreamOperatorTestHarness that I could pass as topology, and avoid the whole deserialization / serialization step in the beginning.
Any ideas / pointers would be greatly appreciated.
Thank you,
Ahmed.
"why I am getting the InvalidTypesException exception?"
Not sure, usually I'd need to see the workflow definition to understand where the type information is getting dropped.
"Why if MySource being deserialized/serialized?"
Because Flink distributes operators to multiple tasks on multiple machines by serializing them, then sending over the network, and then deserializing.
"Is there a way I can void this while using MiniClusterWithClientResource?"
Yes. Since the MiniCluster runs in a single JVM, you can use a static ConcurrentLinkedQueue to hold all of the Envelope records, and your MySource just reads from this queue.
Nit: Your MySource should set a transient boolean running flag to true in the open() method, false in the cancel() method, and check it in the run() method's loop.

Reading file that is being appended in Flink

We have a legacy application that is writing results as records to some local files. We want to process these records in real-time thus we are planning to use Flink as an engine. I know that I can read text files using StreamingExecutionEnvironment#readFile. It seems that we need something similar to PROCESS_CONTINUOUSLY there but this flag causes a whole file to be reprocessed on each change, what is not what we want here.
Of course, I can write my custom source that saves number of records per file in its state. But I suppose there might be some problem with such approach with checkpointing or something - my reasoning is that if that would be easy to implement reliably, it would have been already implemented in Flink.
Any tips / suggestions how to approach this?
You can do this rather easily with a custom source, so long as you are content to be reading from a single file (per source instance). You will need to use operator state and implement checkpointing. The state handling and checkpointing will look something like this:
public class CheckpointedFileSource implements SourceFunction<Event>, ListCheckpointed<Long> {
private long eventCnt = 0;
public void run(SourceContext<Event> sourceContext) throws Exception {
final Object lock = sourceContext.getCheckpointLock();
// skip over previously emitted events
...
while (not cancelled) {
read event from file;
synchronized (lock) {
eventCnt++;
sourceContext.collectWithTimestamp(event, timestamp);
}
}
}
#Override
public List<Long> snapshotState(long checkpointId, long checkpointTimestamp) throws Exception {
return Collections.singletonList(eventCnt);
}
#Override
public void restoreState(List<Long> state) throws Exception {
for (Long s : state)
this.eventCnt = s;
}
}
For a complete example see the checkpointed taxi ride data source used in the Flink training exercises. You’ll have to adapt it a bit, since it’s designed to read a static file, rather than one that is being appended to.

Non Serializable object in Apache Flink

I am using Apache Flink to perform analytics on streaming data.
I am using a dependency whose object takes more than 10 secs to create as it is reads several files present in hdfs before initialisation.
If I initialise the object in open method I get a timeout Exception and if in the constructor of a sink/flatmap, I get serialisation exception.
Currently I am using static block to initialise the object in some other class, using Preconditions.checkNotNull(MGenerator.mGenerator) in main file and then it's working if used in a flatmap of sink.
Is there a way to create a non serializable dependency's object which might take more than 10 secs to be initialised in Flink's flatmap or sink?
public class DependencyWrap {
static MGenerator mGenerator;
static {
final String configStr = "{}";
final Config config = new Gson().fromJson(config, Config.class);
mGenerator = new MGenerator(config);
}
}
public class MyStreaming {
public static void main(String[] args) throws Exception {
Preconditions.checkNotNull(MGenerator.mGenerator);
final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(parallelism);
...
input.flatMap(new RichFlatMapFunction<Map<String,Object>,List<String>>() {
#Override
public void open(Configuration parameters) {
}
#Override
public void flatMap(Map<String,Object> value, Collector<List<String>> out) throws Exception {
out.collect(MFVGenerator.mfvGenerator.generateMyResult(value.f0, value.f1));
}
});
}
}
Also, Please correct me if I am wrong about the question.
Doing it in the Open method is 100% the right way to do it. Is Flink giving you a timeout exception, or the object?
As a last ditch method, you could wrap your object in a class that contains both the object and it's JSON string or Config (is Config serializable?) with the object marked transient and then override the ReadObject/WriteObject methods to call the constructor. If the mGenerator object itself is stateless (and you'll have other problems if it's not), the serialization code should get called only once when jobs are distributed to taskmanagers.
Using open is usually the right place to load external lookup sources. The timeout is a bit odd, maybe there is a configuration around it.
However, if it's huge using a static loader (either static class as you did or singleton) has the benefit that you only need to load it once for all parallel instances of the task on the same task manager. Hence, you save memory and CPU time. This is especially true for you, as you use the same data structure in two separate tasks. Further, the static loader can be lazily initialized when it's used for the first time to avoid the timeout in open.
The clear downside of this approach is that the testability of your code suffers. There are some ways around that, which I could expand if there is interest.
I don't see a benefit of using the proxy serializer pattern. It's unnecessarily complex (custom serialization in Java) and offers little benefit.

Apache camel how to test a rollback scenario

I have a generic message router that has its routes created at run time by various route builders based on some configuration.
The configuration is stored as XML and when loaded in memory it gets converted into a RouteConfig domain object exposing via its getters how the route should be build. Such a RouteConfig will have a getFromUri(), getDestinations(), getDeadLetterUri(), isTransacted(), etc methods defined.
An example such a route builder would look like below:
public class NonTransactedRouteBuilder extends AbstractRouteBuilder {
#Override
protected void buildRoute(String endPoint, RouteConfig routeConfig) {
RouteBean routeBean = getRouteBean(routeConfig.getBean());
final String[] destinations = routeConfig.getDestinations();
from(endPoint).routeId(createRouteId())
.autoStartup(false)
.threads(routeConfig.getThreads())
.filter(body().isNotNull())
.process((Processor) routeBean)
.filter(body().isNotNull())
.choice()
.when(header("dead.letter").isNotNull())
.to(getDeadLetterUri())
.otherwise()
.loadBalance().random()
.to(destinations)
.endChoice();
}
}
The above is just an example to give you an idea about what kind of routes we build. The only one relevant thing is that the route is not transacted. Now to unit test it works as expected we extend the TestNG flavor of CamelTestSupport and pass a mocked RouteConfig instance that will result in a route being configured to move messages between a direct:test and two mock:test1 and mock:test2 end points. The route bean has a bit of logic in it and depending of the message content help us with all testing scenarios:
#Test
public void shouldDiscardNullMessages() throws Exception {
...
}
#Test
public void shouldDiscardScamMessages() throws Exception {
...
}
#Test
public void shouldRouteMessageToDeadLetterQueue() throws Exception {
...
}
#Test(timeOut = 1000)
public void shouldRouteMessagesInALoadBalancedWay()
...
}
Everything works fine and we are very happy we can make code changes and test them immediately. However most of our routes builders will build transacted routes. From our integration and end to end tests we know the transacted functionality works fine but would be so much better to be able to test this at your code changing point with an unit test.
So my question is:
Having the same route as above with just an autoStartup(false).transacted() change on it would it be a way to getting the message back in the sender direct:test end point so we can cover this part of the functionality as well. We would be happy with any work around suggestion just to prove this aspect works.
Thank you in advance for your inputs.
UPDATE 1:
One of the things I tried was to configure my test camel context with a TransactionErrorHandler that had a mock jta transaction manager injected. Something like bleow:
#Test
public void shouldBeAbleToRollback() throws Exception {
TransactionErrorHandlerBuilder errorHandlerBuilder = new TransactionErrorHandlerBuilder();
errorHandlerBuilder.setTransactionManager(jtaTransactionManagerMock);
context().setErrorHandlerBuilder(errorHandlerBuilder);
template.sendBody(FROM_1, "rollback message");
...
}
Then I hoped that I will be able to capture a jtaTransactionManagerMock.rollback() but this did not happen. Wat would be the reason not to work.
UPDATE 2:
Unable to achieve the above I stepped back and started integrating ActiveMQ as a transactional resource in my unit tests and everything worked fine. In reality our routes include also file and database end points but for the purpose of unit testing a generic queue builder using just a JMS resource is enough. I did not realized how easy would be when you deal all day long with Webshere MQ manager where you need to have MQ manager created and configured and you have to have the queues already there and all the heavy infrastructure that you have to build for this. ActiveMQ just did a very good job.
However I will still be interested in whether there would be an way to mock this transnational behavior.

Android Unit Tests Requiring Context

I am writing my first Android database backend and I'm struggling to unit test the creation of my database.
Currently the problem I am encountering is obtaining a valid Context object to pass to my implementation of SQLiteOpenHelper. Is there a way to get a Context object in a class extending TestCase? The solution I have thought of is to instantiate an Activity in the setup method of my TestCase and then assigning the Context of that Activity to a field variable which my test methods can access...but it seems like there should be an easier way.
You can use InstrumentationRegistry methods to get a Context:
InstrumentationRegistry.getTargetContext() - provides the application Context of the target application.
InstrumentationRegistry.getContext() - provides the Context of this Instrumentation’s package.
For AndroidX use InstrumentationRegistry.getInstrumentation().getTargetContext() or InstrumentationRegistry.getInstrumentation().getContext().
New API for AndroidX:
ApplicationProvider.getApplicationContext()
You might try switching to AndroidTestCase. From looking at the docs, it seems like it should be able to provide you with a valid Context to pass to SQLiteOpenHelper.
Edit:
Keep in mind that you probably have to have your tests setup in an "Android Test Project" in Eclipse, since the tests will try to execute on the emulator (or real device).
Your test is not a Unit test!!!
When you need
Context
Read or Write on storage
Access Network
Or change any config to test your function
You are not writing a unit test.
You need to write your test in androidTest package
Using the AndroidTestCase:getContext() method only gives a stub Context in my experience. For my tests, I'm using an empty activity in my main app and getting the Context via that. Am also extending the test suite class with the ActivityInstrumentationTestCase2 class. Seems to work for me.
public class DatabaseTest extends ActivityInstrumentationTestCase2<EmptyActivity>
EmptyActivity activity;
Context mContext = null;
...
#Before
public void setUp() {
activity = getActivity();
mContext = activity;
}
... //tests to follow
}
What does everyone else do?
You can derive from MockContext and return for example a MockResources on getResources(), a valid ContentResolver on getContentResolver(), etc. That allows, with some pain, some unit tests.
The alternative is to run for example Robolectric which simulates a whole Android OS. Those would be for system tests: It's a lot slower to run.
You should use ApplicationTestCase or ServiceTestCase.
Extending AndroidTestCase and calling AndroidTestCase:getContext() has worked fine for me to get Context for and use it with an SQLiteDatabase.
The only niggle is that the database it creates and/or uses will be the same as the one used by the production application so you will probably want to use a different filename for both
eg.
public static final String NOTES_DB = "notestore.db";
public static final String DEBUG_NOTES_DB = "DEBUG_notestore.db";
First Create Test Class under (androidTest).
Now use following code:
public class YourDBTest extends InstrumentationTestCase {
private DBContracts.DatabaseHelper db;
private RenamingDelegatingContext context;
#Override
public void setUp() throws Exception {
super.setUp();
context = new RenamingDelegatingContext(getInstrumentation().getTargetContext(), "test_");
db = new DBContracts.DatabaseHelper(context);
}
#Override
public void tearDown() throws Exception {
db.close();
super.tearDown();
}
#Test
public void test1() throws Exception {
// here is your context
context = context;
}}
Initialize context like this in your Test File
private val context = mock(Context::class.java)

Resources