Periodically refreshing static data in Apache Flink? - apache-flink

I have an application that receives much of its input from a stream, but some of its data comes from both a RDBMS and also a series of static files.
The stream will continuously emit events so the flink job will never end, but how do you periodically refresh the RDBMS data and the static file to capture any updates to those sources?
I am currently using the JDBCInputFormat to read data from the database.
Below is a rough schematic of what I am trying to do:

For each of your two sources that might change (RDBMS and files), create a Flink source that uses a broadcast stream to send updates to the Flink operators that are processing the data from Kafka. Broadcast streams send each Object to each task/instance of the receiving operator.

For each of your sources, files and RDBMS, you can create a snapshot in HDFS or in a storage periodically(example at every 6 hours) and calculate the difference between to snapshots.The result will be push to Kafka. This solution works when you can not modify the database and files structure and an extra information(ex in RDBMS - a column named last_update).
Another solution is to add a column named last_update used to filter data that has changed between to queries and push the data to Kafka.

Related

Kafka Consumer to ingest data without querying DB for every message

We have a Kafka consumer service to ingest data into our DB. Whenever we receive the message from the topic we will compose an insert statement to insert this message into DB. We use DB connection pool to handle the insertion and so far so good.
Currently, we need to add a filter to select only the related message from Kafka and do the insert. There are two options in my mind to do this.
Option 1: Create a config table in the DB to define our filtering condition.
Pros
No need to make code changes or redeploy services
Just insert new filters to config table, service will pick them the next run
Cons
eed to query the DB every time we receive new messages.
Say we received 100k new messages daily and need to filter out 50k. So totally we only need to run 50k INSERT commands, while need to run 100K SELECT queries to check the filter condition for every single Kafka message.
Option 2: Use a hardcoded config file to define those filters.
Pros
Only need to read the filters once when the consumer start running
Has no burden on the DB layer
Cons
This is not a scalable way since we are planning to add a lot of filters, everytime we need to make code changes on the config file and redeploy the consumer.
My question is, is there a better option to achieve the goal? Find the filters without using hardcoded config file or without increasing the concurrency of DB queries.
Your filters could be in another Kafka topic.
Start your app and read the topic until the end, and only then start doing database inserts. Store each consumed record in some local structure such as ConcurrentHashmap, SQLite, RocksDB (provided by Kafka Streams), or DuckDB is popular recently...
When you add a new filter, your consumer would need to temporarily pause your database operations
If you use Kafka Streams, then you could lookup data from the incoming topic against your filters "table" statestore using Processor API and drop the records from the stream
This way, you separate your database reads and writes once you start inserting 50k+ records, and your app wouldn't be blocked trying to read any "external config"
You could also use Zookeeper, as that's one of its use cases

Flink pipeline without a data sink with checkpointing on

I am researching on building a flink pipeline without a data sink. i.e my pipeline ends when it makes a successful api call to a datastore.
In that case if we don't use a sink operator how will checkpointing work ?
As checkpointing is based on the concept of pre-checkpoint epoch (all events that are persisted in state or emitted into sinks) and a post-checkpoint epoch. Is having a sink required for a flink pipeline?
Yes, sinks are required as part of Flink's execution model:
DataStream programs in Flink are regular programs that implement
transformations on data streams (e.g., filtering, updating state,
defining windows, aggregating). The data streams are initially created
from various sources (e.g., message queues, socket streams, files).
Results are returned via sinks, which may for example write the data
to files, or to standard output (for example the command line
terminal)
One could argue that your that the call to your datastore is the actual sink implementation that you could use. You could define your own sink and execute the datastore call there.
I am not keen on the details of your datastore, but one could assume that you are serializing these events and sending them to the datastore in some way. In that case, you could flow all your elements to the sink operator, and store each of these elements in some ListState which you can continuously offload and send. This way, if your application needs to be upgraded, in flight records will not be lost and will be recovered and sent once the job has restored.

Transform a specific group of records from MongoDB

I've got a periodically triggered batch job which writes data into a MongoDB. The job needs about 10 minutes and after that I would like to receive this data and do some transformations with Apache Flink (Mapping, Filtering, Cleaning...). There are some dependencies between the records which means I have to process them together. For example I like to transform all records from the latest batch job where the customer id is 45666. The result would be one aggregated record.
Are there any best practices or ways to do that without implementing everything by myself (get distict customer ids from latest job, for each customer select records and transform, flag the transformed customers, etc....)?
I'm not able to stream it because I have to transform multiple records together and not one by one.
Currently I'm using Spring Batch, MongoDB, Kafka and thinking about Apache Flink.
Conceivably you could connect the MongoDB change stream to Flink and use that as the basis for the task you describe. The fact that 10-35 GB of data is involved doesn't rule out using Flink streaming, as you can configure Flink to spill to disk if its state can't fit on the heap.
I would want to understand the situation better before concluding that this is a sensible approach, however.

Database for large data files and streaming

I have a "database choice" and arhitecture question.
Use-case:
Clients will upload large .json files (or other format like .tsv, it is irrelevant) where each line is a data about their customers (e.g name, address etc.)
We need to stream this data later on to process it and store results which will also be some large file where each line is data about each customer (approximately same as uploaded file).
My requirements:
Streaming should be as fast it could (e.g > 1000 rps) and we could have multiple process running in parallel (for multiple clients)
Database should be scalable and fault tolerant. Because there could easily be uploaded many GB of data it should be easy for me to implement automatically adding new commodity instances (using AWS) if storage gets low.
Database should have kind of replication because we don't want to lose data.
No index is required since we are just streaming data.
What would you suggest for database for this problem? We tried to upload it to Amazon S3 and let them take care of scaling etc. but there is a problem of slow read/streaming.
Thanks,
Ivan
Initially uploading the files to S3 is fine, but then pick them up and push each line to Kinesis (or MSK or even Kafka on EC2s if you prefer); from there, you can hook up the stream processing framework of your choice (Flink, Spark Streaming, Samza, Kafka Streams, Kinesis KCL) to do transformations and enrichment, and finally you’ll want to pipe the results into a storage stack that will allow streaming appends. A few obvious candidates:
HBase
Druid
Keyspaces for Cassandra
Hudi (or maybe LakeFS?) on top of S3
Which one you choose is kind of up to your needs downstream in terms of query flexibility, latency, integration options/standards, etc.

Is it ok to Access database inside FlatMapFunction in a Flink App?

I am consuming a kafka topic as a datastream and using a FlatMapFunction to process the data. The processing consist of enriching the instances that comes from the stream with more data that a get from database executing a query in other to collect the output but, it feels it is not the best approach.
Reading the docs i know that i can create a DataSet from a database query but i only saw examples for Batch Processing.
Can i perform merge/reduce (or other operation) with a DataStream and a DataSet to accomplish that ?
Can i get any performance improvement using a DataSet instead accessing directly the database?
There are various approaches one can take for accomplishing this kind of enrichment with Flink's DataStream API.
(1) If you just want to fetch all the data on a one-time basis, you can use a stateful RichFlatmapFunction that does the query in its open() method.
(2) If you want to do a query for every stream element, then you could do that synchronously in a FlatmapFunction, or look at AsyncIO for a more performant approach.
(3) For best performance while also getting up-to-date values from the external database, look at streaming in the database change stream and doing a streaming join with a CoProcessFunction. Something like http://debezium.io/ could be useful here.

Resources