I have a Flink application that reads from a single Kafka topic.
I am trying to stop FlinkKafkaConsumer to stop pulling messages.
My final goal is to build a method to deploy my Flink application from time to time without downtime at all - how to deploy a new job without downtime.
I have tried to use "kafkaConsumer.close()" but that does not work. I am trying to stop the consumer from pulling new messages without killing the entire Job, at the same time I will upload a new Job with the updated code that reads from the same topic.
How do I do that ?
Would it be possible to send a special 'switch' message on all partitions of the kafka topic? Then you could override isEndOfStream(T nextElement) in your Kafka DeserializationSchema, and have your new job instance start working after the last switch message
Related
A newcomer to both Beam/Flink. So not sure if this question is related to Beam or to Flink. We are setting up to run Beam application using Flink runner.
I have a fairly stateless streaming application without any aggregation/states. I am basically reading from Pubsublite and do some simple transformation of data, generate a ProducerRecord of it and submit it to be two separate Kafka topics. All my experiments has been successful so far and I even got it to work locally using Minikube/Flink K8s operator etc.
Unfortunately, I am stuck in a stage where I am unable to figure out the right docs/topics to read to understand the issue. If there is any error while saving to Kafka or if Kafka is available, it seems the Pubsublite message is acked before being successfully saved into Kafka. If I restart my app after failure or anything, the original pubsublite message is not reprocessed or resent to Kafka. I am losing data in that case as it seems the message has already been acked in the previous step (I can also see there is no backlog from Google cloud console).
Ideally, my goal is that the message is only acked after we have save it to both the Kafka of if it is acked before, then the state is save locally and after restart Beam/Flink will retry just sending it to Kafka.
I initially though the way to do this is to use some form of checkpoints/savepoints but looks like they are more for stateless streaming application. Am I misunderstanding the concept?
My current code is simply:
msgs.apply("Map pubsubmessage to producerrecord", MapElements.via(new FormatPubSubMessage(options.getTopic())))
.setCoder(ProducerRecordCoder.of(VoidCoder.of(), ByteArrayCoder.of()))
.apply("Write to primary kafka topic", KafkaIO.<Void, byte[]>writeRecords()
.withBootstrapServers(options.getBootstrapServers())
.withTopic(options.getTopic())
.withKeySerializer(VoidSerializer.class)
.withValueSerializer(ByteArraySerializer.class)
);
Any pointers to docs/concepts on how one would go about achieving it?
I did configure the standalone Debezium and tested the streaming. After that I created a pipeline as follows
pipeline.apply("Read from DebeziumIO",
DebeziumIO.<String>read()
.withConnectorConfiguration(
DebeziumIO.ConnectorConfiguration.create()
.withUsername("user")
.withPassword("password")
.withHostName("hostname")
.withPort("1433")
.withConnectorClass(SqlServerConnector.class)
.withConnectionProperty("database.server.name", "customer")
.withConnectionProperty("database.dbname", "test001")
.withConnectionProperty("database.include.list", "test002")
.withConnectionProperty("include.schema.changes", "true")
.withConnectionProperty("database.history.kafka.bootstrap.servers", "kafka:9092")
.withConnectionProperty("database.history.kafka.topic", "schema-changes.inventory")
.withConnectionProperty("connect.keep.alive", "false")
.withConnectionProperty("connect.keep.alive.interval.ms", "200")
).withFormatFunction(new SourceRecordJson.SourceRecordJsonMapper()).withCoder(StringUtf8Coder.of())
)
When I start the pipeline using DirectRunner, datastream is not captured by the pipeline. In my pipeline code I just added code to dump the data into console for the time being.
Also from the log I observe that the Debezium is being started and stopped frequently. Is that by design?
Also when there is a change made into the DB (INSERT/DELETE/UPDATE), I dont find it being reflected in the logs.
So my question is,
Configuration what I provided is that sufficient?
Why is the pipeline not being triggered when there is a change?
What additional steps I need to perform to get it working?
Restarting debezium multiple times can it cause performance impacts. Since it creates a jdbc connection.
I'm experienceing some odd behaviour when writing ORC files to S3 using flinks Streaming File Sink.
StreamingFileSink<ArmadaRow> orderBookSink = StreamingFileSink
.forBulkFormat(new Path(PARAMETER_TOOL_CONFIG.get("order.book.sink")),
new OrcBulkWriterFactory<>(new OrderBookRowVectorizer(_SCHEMA), writerProperties, new Configuration()))
.withBucketAssigner(new OrderBookBucketingAssigner())
.withRollingPolicy(OnCheckpointRollingPolicy.build())
.build();
I noticed when running queries during ingest of the data, that my row counts were being decremented as the job progressed. I've had a look at S3 and I can seem multiple versions of the same part file. The example below shows part file 15-7 has two versions. The first file is 20.7mb and the last file that's committed is smaller at 5.1mb. In most cases the current file is normally larger but in my instance there are a few examples in the screenshot below where this is not the case.
I noticed from the logs on the TaskManager that Flink committed both these files at pretty much the same time. I'm not sure if this is known issue with Flinks streaming file sink or potentially some misconfiguration on my part. I'm using Flink 1.11 by the way.
2022-02-28T20:44:03.224+0000 INFO APP=${sys:AppID} COMP=${sys:CompID} APPNAME=${sys:AppName} S3Committer:64 - Committing staging/marketdata/t_stg_globex_order_book_test2/cyc_dt=2021-11-15/inst_exch_mrkt_id=XNYM/inst_ast_sub_clas=Energy/part-15-7 with MPU ID
2022-02-28T20:44:03.526+0000 INFO APP=${sys:AppID} COMP=${sys:CompID} APPNAME=${sys:AppName} S3Committer:64 - Committing staging/marketdata/t_stg_globex_order_book_test2/cyc_dt=2021-11-15/inst_exch_mrkt_id=XNYM/inst_ast_sub_clas=Energy/part-15-7 with MPU ID
Edit
I've also tested this with version 1.13 and the same problem occurs.
Upgrading To Flink 1.13 and using the new FileSink API resolved this error
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/connectors/datastream/file_sink/#bulk-encoded-formats
We have two simple Flink jobs using Kafka connector:
first is reading a file and sending its content to Kafka topic
second is reading all previously uploaded records from Kafka and counts them
When we are creating KafkaSource for consuming records we are setting Boundedness to the latest offset like this:
KafkaSource<String> source = KafkaSource
.builder()
.setTopics(...)
.setBootstrapServers(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setBounded(OffsetsInitializer.latest())
.setDeserializer(...)
.setGroupId(...)
.build();
Unfortunately, it doesn't work as expected and the consumer job just hangs.
My observation is that if the producer app is using AT_LEAST_ONCE semantic then everything is fine and the consumer works as expected, the problem occurs only when the producer is using EXACTLY_ONCE semantic.
Additionally, when I'm running a producer job locally (using IDE), everything works fine as well (in both EXACTLY_ONCE and AT_LEAST_ONCE modes), the problem is visible only when the producer job is run on the Ververica Platform.
I have a Java application to lunch a flink job to process Kafka streaming.
The application is pending here at the job submission at flinkEnv.execute("flink job name") since the job is running forever for streamings incoming from kafka.
In this case, how can I get job id returned from the execution? I see the jobid is printing in the console. Just wonder, how to get jobid is this case without flinkEnv.execute returning yet.
How I can cancel a flink job given job name from remote server in Java?
As far as I know there is currently no nice programmatic way to control Flink. But since Flink is written in Java everything you can do with the console can also be done with internal class org.apache.flink.client.CliFrontend which is invoked by the console scripts.
An alternative would be using the REST API of the Flink JobManager.
you can use rest api to consume flink job process.
check below link: https://ci.apache.org/projects/flink/flink-docs-master/monitoring/rest_api.html.
maybe you can try to request http://host:port/jobs/overview to get all job's message that contains job's name and job's id. Such as
{"jobs":[{"jid":"d6e7b76f728d6d3715bd1b95883f8465","name":"Flink Streaming Job","state":"RUNNING","start-time":1628502261163,"end-time":-1,"duration":494208,"last-modification":1628502353963,"tasks":{"total":6,"created":0,"scheduled":0,"deploying":0,"running":6,"finished":0,"canceling":0,"canceled":0,"failed":0,"reconciling":0,"initializing":0}}]}
I really hope this will help you.