SpringXD counter not working with Kafka source - analytics

I am using SpringXD 1.3 and Apache Kafka 0.9.0.0.
I have a functioning Kafka producer that I was able to configure as a Kafka source in Spring XD (I use a Groovy script to transform the message before logging it).
stream create --name metrics1 --definition "kafka --topic=metrics | transform --script=MetricsInterpreter.groovy | log" --deploy
I can see my Kafka messages getting printed in Spring XD logs. So this stream is working as intended.
However, the counter I create doesn't show up in the list of counters.
stream create --name metrics1tap1 --definition "tap:stream:metrics1 > counter --name=hitcount" --deploy
Although I get a success message (Created and deployed new stream 'metrics1tap1'), this counter does not show up when I try to list counters using "counter list" command.
I tried the TwitterSearch counter example from documentation and that worked fine.
Question: Is there a configuration/setup step that I am missing? Why would my own counter not work in this case?
(FYI both Kafka and SpringXD are running in dev/single-node mode)

Just to confirm the counter list will only display the specific counter if at least one value in the counter.
Are you sure you have at least one message that is possibly received by the counter?
Also, when you do stream list do you see the stream metrics1tap1 in there?

Related

No Output Received When Flink Streaming Execution Environment Passed With Custom Configuration

I'm running Apache Flink version 1.12.7 and configured Streaming Execution Environment with number of task slots for task manager = 3 (just experimenting) but unable to see the output of a file read by the environment. Instead, as seen in the logs, the Execution Graph is stuck as SCHEDULED and does not get into RUNNING state.
Note that if no configuration is passed in the properties file, everything works good and output is seen as environment is able to read the file since Execution Graph get switched to RUNNING state.
The code is as follows :
ParameterTool parameters = ParameterTool.fromPropertiesFile("src/main/resources/application.properties");
Configuration config = Configuration.fromMap(parameters.toMap());
TaskExecutorResourceUtils.adjustForLocalExecution(config);
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(config);
System.out.println("Config Params : " + config.toMap());
DataStream<String> inputStream =
env.readTextFile(FILEPATH);
DataStream<String> filteredData = inputStream.filter((String value) -> {
String[] tokens = value.split(",");
return Double.parseDouble(tokens[3]) >= 75.0;
});
filteredData.print(); // no o/p seen if configuration object is set otherwise everything works as expected
env.execute("Filter Country Details");
Need help in understanding this behaviour and what changes should be made in order that the output gets printed along with having some custom configuration. Thank you.
Okay..So found the answer to the above puzzle by referring to some links mentioned below.
Solution : So I set the parallelism (env.setParallelism) in the above code just after configuring the streaming execution environment and the file was read with output generated as expected.
Post that, experimented with a few things :
set parallelism equal to number of task slots = everything worked
set parallelism greater than number of task slots = intermittent results
set parallelism less than number of task slots = intermittent results.
As per this link corresponding to Flink Architecture,
A Flink cluster needs exactly as many task slots as the highest parallelism used in the job
So its best to go with no. of task slots for a task manager equal to the parallelism configured.

Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`

A continuation to this : Flink : Handling Keyed Streams with data older than application watermark
based on the suggestion, I have been trying to add support for Batch in the same Flink application which was using the Datastream API's.
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.readTextFile("fileName")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
Based on the public docs, my understanding was that i simply needed to change the source to a bounded one. However the above processing keeps on failing at the event trigger after the windowing step with the below exception :
java.lang.IllegalStateException: Checkpointing is not allowed with sorted inputs.
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193)
at org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.init(OneInputStreamTask.java:99)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:552)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
The input file contains the historical events for multiple keys. The data for a given key is sorted, but the overall data is not. I have also added an event at the end of each key with the timestamp = MAX_WATERMARK to indicate end of keyed Stream. I tried it for a single key as well but the processing failed with the same exception.
Note: I have not enabled checkpointing.
I have also tried explicitly disabling checkpointing to no avail.
env.getCheckpointConfig().disableCheckpointing();
EDIT - 1
Adding more details :
I tried changing and using FileSource to read files but still getting the same exception.
environment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
The first process step and key splitting works. However it fails after that. I tried removing windowing and adding a simple process step but it continues to fail.
There is no explicit Sink. The last process function simply updates a database.
Is there something I'm missing ?
That exception can only be thrown if checkpointing is enabled. Perhaps you can a checkpointing interval configured in flink-conf.yaml?

how to achieve exactly once semantics in apache kafka connector

I am using flink version 1.8.0 . My application reads data from kafka -> transform -> publish to Kafka. To avoid any duplicates during restart, i want to use kafka producer with Exactly once semantics , read about it here :
https://ci.apache.org/projects/flink/flink-docs-release-1.8/dev/connectors/kafka.html#kafka-011-and-newer
My kafka version is 1.1 .
return new FlinkKafkaProducer<String>( topic, new KeyedSerializationSchema<String>() {
public byte[] serializeKey(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public byte[] serializeValue(String element) {
// TODO Auto-generated method stub
return element.getBytes();
}
public String getTargetTopic(String element) {
// TODO Auto-generated method stub
return topic;
}
},prop, opt, FlinkKafkaProducer.Semantic.EXACTLY_ONCE, 1);
Checkpoint Code :
CheckpointConfig checkpointConfig = env.getCheckpointConfig();
checkpointConfig.setCheckpointTimeout(15 * 1000 );
checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
env.enableCheckpointing(5000 );
If I add exactly once sematics in kafka producer , my flink consumer is not reading any new data.
Can any one please share any sample code/application with Exactly once Semantics ?
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
Thanks
Can any one please share any sample code/application with Exactly once Semantics ?
An exactly once example is hidden in an end-to-end test in flink. Since it uses some convenience functions, it may be hard to follow without checking out the whole repo.
If I add exactly once sematics in kafka producer , my flink consumer
is not reading any new data.
[...]
Please find complete code here :
https://github.com/sris2/sample_flink_exactly_once
I checked out your code and found the issue (had to fix the whole setup/code to actually get it running). The sink can actually not configure the transactions correctly. As written in the Flink Kafka connector documentation, you need to adjust the transaction.timeout.ms either in your Kafka broker up to 1 hour or in your application down to 15 min:
prop.setProperty("transaction.timeout.ms", "900000");
The respective excerpt is:
Kafka brokers by default have transaction.max.timeout.ms set to 15 minutes. This property will not allow to set transaction timeouts for the producers larger than it’s value. FlinkKafkaProducer011 by default sets the transaction.timeout.ms property in producer config to 1 hour, thus transaction.max.timeout.ms should be increased before using the Semantic.EXACTLY_ONCE mode.

Apache Flink - kafka producer to sink messages to kafka topics but on different partitions

Right now my flink code is processing a file and sinking the data on kafka topic with 1 partition.
Now I have a topic with 2 partition and I want flink code to sink data on those 2 partition using DefaultPartitioner.
Could you help me with that.
Here is the code snippet of my current code:
DataStream<String> speStream = inputStream..map(new MapFunction<Row, String>(){....}
Properties props = Producer.getProducerConfig(propertiesFilePath);
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), props, FlinkKafkaProducer011.Semantic.EXACTLY_ONCE));
Solved this by changing the flinkproducer to
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,new SimpleStringSchema(),
props));
earlier i was using
speStream.addSink(new FlinkKafkaProducer011(kafkaTopicName,
new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()), props,
FlinkKafkaProducer011.Semantic.EXACTLY_ONCE));
In Flink version 1.11 (which I'm using with Java), the SimpleStringSchema needs a wrapper (ie. KeyedSerializationSchemaWrapper) which is also used by #Ankit but removed from the suggested solution as I was getting below constructor related error due to the same.
FlinkKafkaProducer<String> producer = new FlinkKafkaProducer<String>(
topic_name, new KeyedSerializationSchemaWrapper<>(new SimpleStringSchema()),
properties, FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
Error:
The constructor FlinkKafkaProducer<String>(String, SimpleStringSchema, Properties, FlinkKafkaProducer.Semantic) is undefined

How to feed an Apache Flink DataStream

I am new to Apache Flink. I wanna create a DataStream and feed it with values from another system.
I found examples how to add a "SourceFunctions", in that function I have to wait for values from a source and publish those values to Flink by calling ctx.collect and then wait again, it's polling.
But I have a data source which calls a function when values arrive (async). So, what I wanna do is: when this async call happens I wanna put the value to a Flink DataStream, Pseudo-Code:
mysystem.connect_to_values( (value) => { myflinkdatastream.put(value.toString) })
Can this be done? otherwise i have to execute my connect and callback in the SourceFunction and do a loop with sleep afterwards, but I don't wanna do it in this way...
I already have seen "Asynchronous I/O for External Data Access" in Flink, but for that I still need a source stream, which is feed with a SourceFunctions (Poll/Loop).
If you do not want to add a SourceFunction in your streaming job, I suggest using Kafka or other message queue, to which you can send the data from async source, and connect Flink Streaming Job to the message queue.

Resources