Kafka: can I put a path after the zookeeper port number? - connection-string

The kafka server is configured with a path following the port number (from the server.properties)
zookeeper.connect=xxxxx007:2181/kafka
The java producer code:
Properties props = new Properties();
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("metadata.broker.list", "xxxxx007:9092");
The producer populates the topic if the broker omits /kafka
The producer gets numberFormatException when the broker contains /kafa
Properties props = new Properties();
props.put("serializer.class", "kafka.serializer.StringEncoder");
props.put("metadata.broker.list", "xxxxx007:9092/kafka");
java.lang.NumberFormatException: For input string: "9092/kafka"
The java consumer hangs (returns no data) if the zookeeper connection contains /kafka
Properties props = new Properties();
props.put("zookeeper.connect", "xxxxx007:2181/kafka");
The java consumer gets exception if the zookeeper connection omits /kakfa
Properties props = new Properties();
props.put("zookeeper.connect", "xxxxx007:2181");
Exception in thread "main" kafka.common.ConsumerRebalanceFailedException: group1_BFTSLBHW0000RGU-1397591737558-f75b6658 can't rebalance after 4 retries
at kafka.consumer.ZookeeperConsumerConnector$ZKRebalancerListener.syncedRebalance(ZookeeperConsumerConnector.scala:428)
at kafka.consumer.ZookeeperConsumerConnector.kafka$consumer$ZookeeperConsumerConnector$$reinitializeConsumer(ZookeeperConsumerConnector.scala:718)
at kafka.consumer.ZookeeperConsumerConnector.consume(ZookeeperConsumerConnector.scala:209)
at kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:80)
at kafka.javaapi.consumer.ZookeeperConsumerConnector.createMessageStreams(ZookeeperConsumerConnector.scala:92)
at kafka.examples.TitaniumConsumer.main(TitaniumConsumer.java:73)

The intention behind specifying this zookeeper path is so that for a particular cluster all data available in kafka appear under this particular path.
Note that you must create this path yourself prior to starting the broker and consumers must use the same connection string.

Related

Flink checkpoint not replaying the kafka events which were in process during the savepoint/checkpoint

I want to test end-to-end exactly once processing in flink. My job is:
Kafka-source -> mapper1 -> mapper-2 -> kafka-sink
I had put a Thread.sleep(100000) in mapper1 and then ran the job. I took the savepoint while stopping the job and then I removed the Thread.sleep(100000) form the mapper1, and I expect that the event should be replayed as it was not sinked. But that didnt happen and job is waiting for new event.
My Kafka source:
KafkaSource.<String>builder()
.setBootstrapServers(consumerConfig.getBrokers())
.setTopics(consumerConfig.getTopic())
.setGroupId(consumerConfig.getGroupId())
.setStartingOffsets(OffsetsInitializer.latest())
.setValueOnlyDeserializer(new SimpleStringSchema())
.setProperty("commit.offsets.on.checkpoint", "true")
.build();
My kafka sink:
KafkaSink.<String>builder()
.setBootstrapServers(producerConfig.getBootstrapServers())
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(producerConfig.getTopic())
.setValueSerializationSchema(new SimpleStringSchema()).build())
.build();
My environmentSetup for flink job:
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.enableCheckpointing(2000);
environment.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
environment.getCheckpointConfig().setMinPauseBetweenCheckpoints(100);
environment.getCheckpointConfig().setCheckpointTimeout(60000);
environment.getCheckpointConfig().setTolerableCheckpointFailureNumber(2);
environment.getCheckpointConfig().setExternalizedCheckpointCleanup(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
environment.getCheckpointConfig().setCheckpointTimeout(1000);
environment.getCheckpointConfig().setMaxConcurrentCheckpoints(1);
environment.getCheckpointConfig().enableUnalignedCheckpoints();
environment.getCheckpointConfig().setCheckpointStorage("file:///tmp/flink-checkpoints");
Configuration configuration = new Configuration();
configuration.set(ExecutionCheckpointingOptions.ENABLE_CHECKPOINTS_AFTER_TASKS_FINISH, true);
environment.configure(configuration);
What am I doing wrong here?
I want that any event which is in process during the cancellation/stop of the job, should restart again.
EDIT 1:
I observed that my kafka was showing offset lag for my flink's kafka-source consumer group. I am assuming it means my checkpointing is behaving right, is that correct ?
I also observed when i restarted my job from checkpoint, it didnt start to consume from the remaining offsets, while I have the consumer offset set to EARLIEST. I had to send more events to trigger the consumption on kafka-source side and then it consumed all the events.
For exactly-once, you must provide a TransactionalIdPrefix unique across all applications running against the same Kafka cluster (this is a change compared to the legacy FlinkKafkaConsumer):
KafkaSink<T> sink =
KafkaSink.<T>builder()
.setBootstrapServers(...)
.setKafkaProducerConfig(...)
.setRecordSerializer(...)
.setDeliverGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.setTransactionalIdPrefix("unique-id-for-your-app")
.build();
When resuming from a checkpoint, Flink always uses the offsets stored in the checkpoint rather than those configured in the code or stored in the broker.

Apache Flink with Kinesis Analytics : java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0

Background :
I have been trying to setup BATCH + STREAMING in the same flink application which is deployed on kinesis analytics runtime. The STREAMING part works fine, but I'm having trouble adding support for BATCH.
Flink : Handling Keyed Streams with data older than application watermark
Apache Flink : Batch Mode failing for Datastream API's with exception `IllegalStateException: Checkpointing is not allowed with sorted inputs.`
The logic is something like this :
The logic is something like this :
streamExecutionEnvironment.setRuntimeMode(RuntimeExecutionMode.BATCH);
streamExecutionEnvironment.fromSource(FileSource.forRecordStreamFormat(new TextLineFormat(), path).build(),
WatermarkStrategy.noWatermarks(),
"Text File")
.process(process function which transforms input)
.assignTimestampsAndWatermarks(WatermarkStrategy
.<DetectionEvent>forBoundedOutOfOrderness(orderness)
.withTimestampAssigner(
(SerializableTimestampAssigner<Event>) (event, l) -> event.getEventTime()))
.keyBy(keyFunction)
.window(TumblingEventWindows(Time.of(x days))
.process(processWindowFunction);
On doing this I'm getting the below exception :
java.lang.Exception: Exception while creating StreamOperatorStateContext.
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:254)
at org.apache.flink.streaming.api.operators.AbstractStreamOperator.initializeState(AbstractStreamOperator.java:272)
at org.apache.flink.streaming.runtime.tasks.OperatorChain.initializeStateAndOpenOperators(OperatorChain.java:441)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restoreGates(StreamTask.java:582)
at org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.call(StreamTaskActionExecutor.java:55)
at org.apache.flink.streaming.runtime.tasks.StreamTask.executeRestore(StreamTask.java:562)
at org.apache.flink.streaming.runtime.tasks.StreamTask.runWithCleanUpOnFail(StreamTask.java:647)
at org.apache.flink.streaming.runtime.tasks.StreamTask.restore(StreamTask.java:537)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:764)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:571)
at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: org.apache.flink.util.FlinkException: Could not restore keyed state backend for WindowOperator_90bea66de1c231edf33913ecd54406c1_(1/1) from any of the 1 provided restore options.
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:160)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.keyedStatedBackend(StreamTaskStateInitializerImpl.java:345)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.streamOperatorStateContext(StreamTaskStateInitializerImpl.java:163)
... 10 more
Caused by: java.io.IOException: Failed to acquire shared cache resource for RocksDB
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.allocateSharedCachesIfConfigured(RocksDBOperationUtils.java:306)
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:426)
at org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend.createKeyedStateBackend(EmbeddedRocksDBStateBackend.java:90)
at org.apache.flink.streaming.api.operators.StreamTaskStateInitializerImpl.lambda$keyedStatedBackend$1(StreamTaskStateInitializerImpl.java:328)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.attemptCreateAndRestore(BackendRestorerProcedure.java:168)
at org.apache.flink.streaming.api.operators.BackendRestorerProcedure.createAndRestore(BackendRestorerProcedure.java:135)
... 12 more
Caused by: java.lang.IllegalArgumentException: The fraction of memory to allocate should not be 0. Please make sure that all types of managed memory consumers contained in the job are configured with a non-negative weight via `taskmanager.memory.managed.consumer-weights`.
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:160)
at org.apache.flink.runtime.memory.MemoryManager.validateFraction(MemoryManager.java:672)
at org.apache.flink.runtime.memory.MemoryManager.computeMemorySize(MemoryManager.java:653)
at org.apache.flink.runtime.memory.MemoryManager.getSharedMemoryResourceForManagedMemory(MemoryManager.java:521)
at org.apache.flink.contrib.streaming.state.RocksDBOperationUtils.allocateSharedCachesIfConfigured(RocksDBOperationUtils.java:302)
... 17 more
Seems like kinesis-analytics does not allow clients to define a flink-conf.yaml file to define taskmanager.memory.managed.consumer-weights. Is there any way around this ?
It's not clear to me what the underlying cause of this exception is, nor how to make batch processing work on KDA.
You can try this (but I'm not sure KDA will allow it):
Configuration conf = new Configuration();
conf.setString("taskmanager.memory.managed.consumer-weights", "put-the-value-here");
StreamExecutionEnvironment env =
StreamExecutionEnvironment.getExecutionEnvironment(conf);

Unable to stream with Flink Streaming

I am new to Flink Streaming framework and I am trying to understand the components and the flow. I am trying to run the basic wordcount example using the DataStream. I am trying to run the code on my IDE. The code runs with no issues when I feed data using the collection as
DataStream<String> text = env.fromElements(
"To be, or not to be,--that is the question:--",
"Whether 'tis nobler in the mind to suffer",
"The slings and arrows of outrageous fortune",z
"Or to take arms against a sea of troubles,"
);
But, it fails everytime when I am trying to read data either from the socket or from a file as below:
DataStream<String> text = env.socketTextStream("localhost", 9999);
or
DataStream<String> text = env.readTextFile("sample_file.txt");
For both the socket and textfile, I am getting the following error:
Exception in thread "main" java.lang.reflect.InaccessibleObjectException: Unable to make field private final byte[] java.lang.String.value accessible: module java.base does not "opens java.lang" to unnamed module
Flink currently (as of Flink 1.14) only supports Java 8 and Java 11. Install a suitable JDK and try again.

Flink to Nifi the Magic Header was not present

I am trying to use this example to connect Nifi to Flink:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
SiteToSiteClientConfig clientConfig = new SiteToSiteClient.Builder()
.url("http://localhost:8090/nifi")
.portName("Data for Flink")
.requestBatchCount(5)
.buildConfig();
SourceFunction<NiFiDataPacket> nifiSource = new NiFiSource(clientConfig);
DataStream<NiFiDataPacket> streamSource = env.addSource(nifiSource).setParallelism(2);
DataStream<String> dataStream = streamSource.map(new MapFunction<NiFiDataPacket, String>() {
#Override
public String map(NiFiDataPacket value) throws Exception {
return new String(value.getContent(), Charset.defaultCharset());
}
});
dataStream.print();
env.execute();
I am running Nifi as a standalone server with default properties, except these properties:
nifi.remote.input.host=localhost
nifi.remote.input.secure=false
nifi.remote.input.socket.port=8090
nifi.remote.input.http.enabled=true
The call fails each time, with following log in Nifi:
[Site-to-Site Worker Thread-24] o.a.nifi.remote.SocketRemoteSiteListener
Unable to communicate with remote instance null due to
org.apache.nifi.remote.exception.HandshakeException: Handshake
with nifi://localhost:61680 failed because the Magic Header
was not present; closing connection
Nifi version: 1.7.1, Flink version: 1.7.1
After using the nifi-toolkit I removed the custom value of nifi.remote.input.socket.port and then added transportProtocol(SiteToSiteTransportProtocol.HTTP) to my SiteToSiteClientConfig and http://localhost:8080/nifi as the URL.
The reason why I changed the port in the first place is that without specifying the protocol HTTP it will use RAW by default.
And when using the RAW protocol from Flink side, the client cannot create Transaction and prints the following warning:
Unable to refresh Remote Group's peers due to Remote instance of NiFi
is not configured to allow RAW Socket site-to-site communications
That's why I thought it was a port issue
So now with the default config of Nifi, this works as expected:
SiteToSiteClientConfig clientConfig = new SiteToSiteClient.Builder()
.url("http://localhost:8080/nifi")
.portName("portNameAsInNifi")
.transportProtocol(SiteToSiteTransportProtocol.HTTP)
.requestBatchCount(1)
.buildConfig();

read string datastream in Flink from socket without using netcat server

I have a case scenario in which I have a stream generator client which is generating multiple streams, merging them and sending it to socket and I want Flink program to listen to it as the server. As we know that server has to be turned up first, so that it can listen to client requests. I tried to do the same by using code given below
public static void main(String[] args) throws Exception {
//setting the envrionment variable as StreamExecutionEnvironment
StreamExecutionEnvironment environment = StreamExecutionEnvironment.getExecutionEnvironment();
environment.setParallelism(1);
DataStream<String> stream1 = environment.socketTextStream("localhost", 9000);
stream1.print();
//start the execution
environment.execute(" Started the execution ");
}// main
The code for stream generator acting as client is given below
DataStream<Event> stream1 = envrionment
.addSource(new EventGenerator(2,60,1,1,100, 200 ))
.name("stream 1")
.setParallelism(parallelism_for_stream_rr);
DataStream<Event> stream2 = envrionment
.addSource(new EventGenerator(3,60,1,2,10, 20 ))
.name("stream 2")
.setParallelism(parallelism_for_stream_rr);
DataStream<Event> stream3 = envrionment
.addSource(new EventGenerator(5,60,1,3,30, 40 ))
.name("stream 3")
.setParallelism(parallelism_for_stream_rr);
DataStream<Event> merged = stream1.union(stream2,stream3);
merged.print();
// sending data to Mobile Cep via socket
merged.map(new MapFunction<Event, String>() {
#Override
public String map(Event event) throws Exception {
String tuple = event.toString();
return tuple + "\n";
}
}).writeToSocket("localhost", 9000, new SimpleStringSchema() );
Issue # 1: The issue is that client code works only when I start a Netcat server, but then Netcat server doesn't forwards the data streams.If Netcat server is not up, client code says it cant make a connection
Issue # 2: Flink program doesn't execute if Netcat server is not up
Caused by: java.net.ConnectException: Connection refused (Connection refused)
at java.net.PlainSocketImpl.socketConnect(Native Method)
at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
I know that one possible solution for this is to generate the streams within the Flink program, but I want to receive the streams via socket.
Thanks in Advance ~
Neither Flink's socket source nor its sink starts a TCP server and waits for incoming connections. They are both clients which connect against an already started TCP server. That's also why you have to start netcat before launching the jobs. If you want to write to and read from a socket, then you have to write a TCP server which can buffer the incoming data and forwards them once a client connects to it.

Resources