To be honest, I am at a very primary level of using Apache Flink, I am looking for the Apache Flink sink connector which will send my messages to Kafka topic.
Looking forward to quick help.
The Apache Flink training has an exercise on the topic of writing to and reading from Kafka. Included are reference solutions which you can use as a guide. The link I've given you is a deep link to the relevant exercise -- you'll probably want to browse around and explore more of the material there as well.
Related
I am writing a simple connector (source/sink) for Flink 1.14.4 which mostly wraps the official Kafka Connector and automatically set ups custom serializers/deserializers. I'm a bit confused about the current state of the new source/sink interfaces introduced in FLIP-27 and FLIP-143. Is it currently possible to write unified connectors, really (that is, connectors that work across different APIs, such as DataStream/Table)? By looking at the code of the current Kafka Connector, I see it comes with both legacy and new flavours, but AFAIK the connector for the Table API still relies on the legacy API only. Also, by reading the official documentation:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sourcessinks/
It seems that the new interfaces cannot still be used for the Table API. To make it worse, I find it very confusing that only sources are mentioned in the DataStream section, which already describes the new approach:
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/datastream/sources/
but nothing is said regarding sinks. Overall, I think this leaves the user not knowing very well how to approach the creation of custom connectors as of today. In particular, I would expect having an equivalent section for the DataStream API, i.e., one covering the creation of user-defined sources & sinks, as that given above for the Table API.
The Unified Source and Sink API (FLIP-27 and FLIP-143) were created to create one interface for connectors so they could be used for both bounded (batch) and unbounded (stream) data.
Both interfaces allow for building a source/sink that you can use in either DataStream or Table/SQL API. That's currently already the case for FileSystem, Kafka and Pulsar (as of Flink 1.15 which will be released shortly).
You're absolutely right that the current documentation doesn't make this clear. At the moment, the Flink community is working on externalizing the connectors (moving each of them from the Flink repository to their own individual repository) and overhauling the documentation and guides on how to write a connector.
I have a scenario with spring batch where I need read data from ms SQL server database and write it to the Cassandra database.
Am new to batch processing not much resources getting in Google to understand more on this,
Could you please share inputs in the same
Thanks in advance
You question is very light on detail and a little too open-ended so I wanted to warn you that there's a chance the community will vote to close it for those reasons.
Based on what you've provided, it sounds like you've got a streaming use case where you have an app "service" that would be the source of the data and publish it on a messaging/event platform and other systems/services can subscribe to those events.
You can use Kafka or Pulsar as the platform and Cassandra is one of the sinks. If you're interested in trying it out, Astra Streaming is a streaming-as-a-service backed by Pulsar with Astra DB (Cassandra-as-a-service) as the sink.
Astra Streaming and DB have free tiers which don't require a credit card so you can quickly do POCs without having to worry about downloading/installing/configuring clusters.
As a side note, Astra DB comes with ready-to-use Stargate.io -- a data platform that allows you to connect to Cassandra using REST, GraphQL and JSON/Doc APIs so you can easily build applications on top of Cassandra using APIs. Cheers!
I am using Apache camel for quite long time and found it to be a fantastic solution for all kind of system integration related business need. But couple of years back I came accross the Apache Nifi solution. After some googleing I found that though Nifi can work as ETL tool but it is actually meant for stream processing.
In my opinion, "Which is better" is very bad question to ask as that depend on different things. But it will be nice if somebody can describe more about the basic comparison between the two and also the obvious question, when to use what.
It will help to take decision as per my current requirement, which will be the good option in my context or should I use both of them together.
The biggest and most obvious distinction is that NiFi is a no-code approach - 99% of NiFi users will never see a line of code. It is a web based GUI with a drag and drop interface to build pipelines.
NiFi can perform ETL, and can be used in batch use cases, but it is geared towards data streams. It is not just about moving data from A to B, it can do complex (and performant) transformations, enrichments and normalisations. It comes out of the box with support for many specific sources and endpoints (e.g. Kafka, Elastic, HDFS, S3, Postgres, Mongo, etc.) as well as generic sources and endpoints (e.g. TCP, HTTP, IMAP, etc.).
NiFi is not just about messages - it can work natively with a wide array of different formats, but can also be used for binary data and large files (e.g. moving multi-GB video files).
NiFi is deployed as a standalone application - it's not a framework or api or library or something that you integrate in to something else. It is a fully self-contained, realised application that is fully featured out of the box with no additional development. Though it can be extended with custom development if required.
NiFi is natively clustered - it expects (but isn't required) to be deployed on multiple hosts that work together as a cluster for performance, availability and redundancy.
So, the two tools are used quite differently - hopefully that helps highlight some of the key differences
It's true that there is some functional overlap between NiFi and Camel, but they were designed very differently:
Apache NiFi is a data processing and integration platform that is mostly used centrally. It has a low-code approach and prefers configuration.
Apache Camel is an integration framework which is mostly used in distributed solutions. Solutions are coded in Java. Example solutions are adapters, flows, API's, connectors, cloud functions and so on.
They can be used very well together. Especially when using a message broker like Apache ActiveMQ or Apache Kafka.
An example: A java application is enhanced with Camel so that it can send messages to Kafka. In NiFi the first step is consuming those messages from Kafka. Then in the NiFi flow the message is changed in various steps. In the middle the message is put on another Kafka topic. A Camel function (CamelK) in the cloud does various operations on the message, when it's finished it put the message on a Kafka topic. The message goes through a NiFi flow which at the end calls an API created with Camel.
In a blog I wrote in detail on the various ways to combine Camel and Nifi:
https://raymondmeester.medium.com/using-camel-and-nifi-in-one-solution-c7668fafe451
I wonder why is there an AvroKeyValueSinkWriter for Flink, but there isn't a simple AvroSinkWriter with regular Schema (non key-value).
I use this to generate near-streaming Avro files, and I batch them once an hour to Parquet files.
I use the BucktingSink of Flink.
The Key-Value Schema is giving me some hard time when generating Parquet,
did I miss something? Thanks!
You will not find much help with anything Flink.
The documentation relies on javadoc and the examples are almost one-liners, like word count and other nonsense.
I have yet to see what a "pro" flink coder can do, to learn what the right way to do some of the simplest tasks. Reading from Kafka, parsing an avro or json record, then putting in specific data on a file system or hdfs would be great. You won't find any such examples.
You would think that by now that searching the net for some solid complex examples would be available.
Most of these projects require you reading through all the source code and try and figure out an approach.
In the end it is just easier to Spring boot and jam code into a service than to buy into Flink, and to some degree Spark.
Best of luck to you.
I am fairly new to Apache Flink. I have a specific requirement were I have to use elasticsearch index as a source. I tried to figure out if flink has a source as elasticsearch but doesn't seem to be. I could see that we can have elasticsearch as a sink but direct support as a source is not there. Can anyone guide me on how we can solve this problem. I am using elasticsearch 5.5.0 and flink 1.2.
I, found one flink elasticsearcg source connector implementation on git - https://github.com/mnubo/flink-elasticsearch-source-connector. But it seems that it has not been active for almost a year now and has limited support in terms on aggregation and es version.
Thought of sharing this just in case if it meets someone's requirement.