Searalize protobuff data to avro - Apache Flink - apache-flink

Is it possible to serialize protobuff data to Avro and write to files using Apache Flink sink?

There is currently no out-of-the-box solution for protobuf yet. It's high on the priority list.
You can use protobuf-over-kryo or parse/serialize manually in the meantime.

Related

What is the recommended way to create a Custom Sink for AWS Sagemaker Feature Store in Apache Flink?

I want to create a Custom Apache Flink Sink to AWS Sagemaker Feature store, but there is no documentation for how to create custom sinks on Flink's website. There are also multiple base classes that I can potentially extend (e.g. AsyncSinkBase, RichSinkFunction), so I'm not sure which to use.
I am looking for guidelines regarding how to implement a custom sink (both in general and for my specific use-case). For my specific use-case: Sagemaker Feature Store has a synchronous client with a putRecord call to send records to AWS Sagemaker FS, so I am ideally looking for a way to create a custom sink that would work well with this client. Note: I require at at least once processing guarantees, as Sagemaker FS is DynamoDB (a key-value store) under the hood.
Java Client: https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/sagemakerfeaturestoreruntime/AmazonSageMakerFeatureStoreRuntime.html
Example of the putRecord call using the Python client: https://github.com/aws-samples/amazon-sagemaker-feature-store-streaming-aggregation/blob/main/src/lambda/StreamingIngestAggFeatures/lambda_function.py#L31
What I've Found so Far
Some older articles which say to use org.apache.flink.streaming.api.functions.sink.RichSinkFunction and SinkFunction
Some connectors using classes in org.apache.flink.connector.base.sink.writer (e.g. AsyncSinkWriter, AsyncSinkBase)
This section of the Flink docs says to use the SourceReaderBase from org.apache.flink.connector.base.source.reader when creating custom sources; SourceBaseReader seems to be the equivalent source to the sink classes in the bullet above
Any help/guidance/insights are much appreciated, thanks.
How about extending RichAsyncFunction ?
you can find similar example here - https://nightlies.apache.org/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api

Flink 1.13.2 not updates metrics in near-real-time when connected to kafka sources/sink

I'm creating a process to handle millions of records with apache flink to support logistics data pipelines. I'm moving from kinesis sources/sink to kafka sources/sink.
However, in the flink dashboard, the job metrics are not being updated in the near-real-time. Do you know what can be wrong with the job/version?
Btw, when job is closed, then it can show all metrics... but not in near-real-time...
Job non-updating metrics picture
Fixed after cleanup conflict dependencies on "Kafka-clients" lib.
So, in my case, using also some avro & cloudevents libs with higher Kafka-clients version. Then, just need to exclude Kafka-clients from these libs and prefer flink Kafka-clients version. And this solved the issue.

How to add Kafka as bounded source with Apache Flink 1.12 with DataStream API Batch mode

I want to use Kafka source as a bounded data source with Apache Flink 1.12, I tried it using FlinkKafkaConsumer connector but it is giving me the following reason
Caused by: java.lang.IllegalStateException: Detected an UNBOUNDED source with the 'execution.runtime-mode' set to 'BATCH'. This combination is not allowed, please set the 'execution.runtime-mode' to STREAMING or AUTOMATIC
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:198) ~[flink-core-1.12.0.jar:1.12.0]
Based on the flink latest documentation we can use Kafka as a bounded source, but there is no example provided on how it is possible, also nowhere it was mentioned it is the best way to go ahead with this approach.
Can someone help me to get some example working code to achieve this usecase
Here's an example:
KafkaSource<String> source = KafkaSource
.<String>builder()
.setBootstrapServers(...)
.setGroupId(...)
.setTopics(...)
.setDeserializer(...)
.setStartingOffsets(OffsetsInitializer.earliest())
.setBounded(OffsetsInitializer.latest())
.build();
env.fromSource(source, WatermarkStrategy.forMonotonousTimestamps(), "Kafka Source"));
See the javadocs for more info.

How to integrate non-Confluent connectors with Apache Kafka Connect

There is a requirement where we get a stream of data from Kafka Stream and our objective is to push this data to SOLR.
We did some reading but we could find there are lot of Kafka Connect solutions available in the market, but the problem is we do not know which is the best solution and how to achieve.
The options are:
Use Solr connector to connect with Kafka.
Use Apache Storm as it directly provides support for integrating with Solr.
There is no much documentation or in depth information provided for the above mentioned options.
Will anyone be kind enough to let me know
How we can use a Solr connector and integrate with Kafka stream without using Confluent?
Solr-Kafka Connector: https://github.com/MSurendra/kafka-connect-solr
Also, With regard to Apache Storm,
will it be possible for Apache Storm to accept the Kafka Stream and push it to Solr, though we would need some sanitization of data before pushing it to Solr?
I am avoiding Storm here, because the question is mostly about Kafka Connect
CAVEAT - This Solr Connector in the question is using Kakfa 0.9.0.1 dependencies, therefore, it is very unlikely to work with the newest Kafka API's.
This connector is untested by me. Follow at your own risk
The following is an excerpt from Confluent's documentation on using community connectors, with some emphasis and adaptations. In other words, written for Kafka Connects not included in Confluent Platform.
1) Clone the GitHub repo for the connector
$ git clone https://github.com/MSurendra/kafka-connect-solr
2) Build the jar with maven
Change into the newly cloned repo, and checkout the version you want. (This Solr connector has no releases like the Confluent ones).
You will typically want to checkout a released version.
$ cd kafka-connect-solr; mvn package
From here, see Installing Plugins
3) Locate the connector’s uber JAR or plugin directory
We copy the resulting Maven output in the target directory into one of the directories on the Kafka Connect worker’s plugin path (the plugin.path property).
For example, if the plugin path includes the /usr/local/share/kafka/plugins directory, we can use one of the following techniques to make the connector available as a plugin.
As mentioned in the Confluent docs, the export CLASSPATH=<some path>/kafka-connect-solr-1.0.jar option would work, though plugin.path will be the way moving forward (Kafka 1.0+)
You should know which option to use based on the result of mvn package
Option 1) A single, uber JAR file
With this Solr Connector, we get a single file named kafka-connect-solr-1.0.jar.
We copy that file into the /usr/local/share/kafka/plugins directory:
$ cp target/kafka-connect-solr-1.0.jar /usr/local/share/kafka/plugins/
Option 2) A directory of dependencies
(This does not apply to the Solr Connector)
If the connector’s JARs are collected into a subdirectory of the build’s target directories, we can copy all of these JARs into a plugin directory within the /usr/local/share/kafka/plugins, for example
$ mkdir -p /usr/local/share/kafka/plugins/kafka-connect-solr
$ cp target/kafka-connect-solr-1.0.0/share/java/kafka-connect-solr/* /usr/local/share/kafka/plugins/kafka-connect-solr/
Note
Be sure to install the plugin on all of the machines where you’re running Kafka Connect distributed worker processes. It is important that every connector you use is available on all workers, since Kafka Connect will distribute the connector tasks to any of the worker
4) Running Kafka Connect
If you have properly set plugin.path or did export CLASSPATH, then you can use connect-standalone or connect-distributed with the appropriate config file for that Connect project.
Regarding,
we would need some sanitization of data before pushing it to Solr
You would need to do that with a separate process like Kafka Streams, Storm, or other process prior to Kafka Connect. Write your transformed output to a secondary topic. Or write your own Kafka Connect Transform process. Kafka Connect has very limited transformations out of the box.
Also worth mentioning - JSON seems to be the only supported Kafka message format for this Solr connector

in which type of project I could add the java program for apache solr with eclipse?

I am going to use apache solr for first time. I just want to write a basic code for apache solr in java with eclipse. Can you please suggest me that how to get familiar with apache solr.
Basically, SOLR is a full text search engine. You can use it as a data store to keep data of unstructured nature. You can use SolrJ to communicate to SOLR from Java.

Resources