My producer is apache Kafka and we want to listen batch of events to process them and write that processed events into the database. If I use stream/batch every event will hit one query to DB. I don't want to hit every event as one query. How can I batch some of the events and write this bulk data into DB?
Note: We are using DataStream API
No, there isn't an official Neo4j sink for Flink. If your goal is to implement exactly once end-to-end by doing buffered, batched transactional updates, you might start by reading An Overview of End-to-End Exactly-Once Processing in Apache Flink, and then reach out to the flink user mailing list for further guidance.
Related
I am researching on building a flink pipeline without a data sink. i.e my pipeline ends when it makes a successful api call to a datastore.
In that case if we don't use a sink operator how will checkpointing work ?
As checkpointing is based on the concept of pre-checkpoint epoch (all events that are persisted in state or emitted into sinks) and a post-checkpoint epoch. Is having a sink required for a flink pipeline?
Yes, sinks are required as part of Flink's execution model:
DataStream programs in Flink are regular programs that implement
transformations on data streams (e.g., filtering, updating state,
defining windows, aggregating). The data streams are initially created
from various sources (e.g., message queues, socket streams, files).
Results are returned via sinks, which may for example write the data
to files, or to standard output (for example the command line
terminal)
One could argue that your that the call to your datastore is the actual sink implementation that you could use. You could define your own sink and execute the datastore call there.
I am not keen on the details of your datastore, but one could assume that you are serializing these events and sending them to the datastore in some way. In that case, you could flow all your elements to the sink operator, and store each of these elements in some ListState which you can continuously offload and send. This way, if your application needs to be upgraded, in flight records will not be lost and will be recovered and sent once the job has restored.
I am currently working on a streaming platform that accepts an unbounded stream from a source into Kafka. I am using Flink as my stream processing engine. I am able to ingest data successfully, window it on event time and do whatever I want to do in Flink. The output of this stream currently goes into a Kafka sink, which is ok for now since this data will not be streamed anywhere. This entire setup is deployed on AWS.
A external client is now interested in the data. The client wants the data in a streaming format instead of pulling the data from Kafka. We also do not want to expose our Kafka brokers to the external world. How can we achieve this? I tried pushpin proxy to "push" the data out. However, it's a pain to setup and manage.
Any idea how to approach this? I am really open to any ideas.
Thanks
Does flink provide a way to poll from an api periodically and create a datastream object out of it for further processing?
We currently push the messages to kafka and read through kafka. Is there any way to poll the api directly through flink?
I'm not aware of such a source connector for Flink, but it would be relatively straightforward to implement one. There are examples out there that do just this but with a database query; one of those might serve as a template for getting started.
We are developing a stream processing service which coming from Kinesis to Flink.
The service will need to calling several external APIs to enrich data stream so Async/IO API will be used in our case.
In case of any exception, we would like to capture the events to a sink and selectively replaying the messages to Flink for retrying.
We have control of Kinesis, Kinesis allows replaying message up to 7 days with extended data retention but not selectively and we wish to keep for longer
we are thinking about side outputs the exception stream and save to separate sink. Is there a mechanism to pulling data from sink to Kinesis with minimum manual effort? we have not decided about what sink we would use. Is there a better solution or recommendation?
I am using Flink streaming to read the data from Kafka and process the data. Before consuming from Kafka when the application starts I need to read a file using a DataSet API and sort the file based on some criteria and create a list from it. Then starts to consume from Kafka in a streaming fashion. I have written a logic to read and sort the data from a file using DataSet API. But when I try to tun the program it is never executing and the Flink immediately starts consuming from Kafka. Is there any way I could process the data set first then streaming in Flink?
No, it is not possible to mix the DataSet and DataStream APIs. You can however, start both programs from the same main() method but would have to write the sorted result of the DataSet program into a file which is consume by the DataStream program.
Create another Flink Job for your DataSet manipulation and sink the results to the Kafka your Streaming Job is consuming from.