Kafka Consumer to ingest data without querying DB for every message - database

We have a Kafka consumer service to ingest data into our DB. Whenever we receive the message from the topic we will compose an insert statement to insert this message into DB. We use DB connection pool to handle the insertion and so far so good.
Currently, we need to add a filter to select only the related message from Kafka and do the insert. There are two options in my mind to do this.
Option 1: Create a config table in the DB to define our filtering condition.
Pros
No need to make code changes or redeploy services
Just insert new filters to config table, service will pick them the next run
Cons
eed to query the DB every time we receive new messages.
Say we received 100k new messages daily and need to filter out 50k. So totally we only need to run 50k INSERT commands, while need to run 100K SELECT queries to check the filter condition for every single Kafka message.
Option 2: Use a hardcoded config file to define those filters.
Pros
Only need to read the filters once when the consumer start running
Has no burden on the DB layer
Cons
This is not a scalable way since we are planning to add a lot of filters, everytime we need to make code changes on the config file and redeploy the consumer.
My question is, is there a better option to achieve the goal? Find the filters without using hardcoded config file or without increasing the concurrency of DB queries.

Your filters could be in another Kafka topic.
Start your app and read the topic until the end, and only then start doing database inserts. Store each consumed record in some local structure such as ConcurrentHashmap, SQLite, RocksDB (provided by Kafka Streams), or DuckDB is popular recently...
When you add a new filter, your consumer would need to temporarily pause your database operations
If you use Kafka Streams, then you could lookup data from the incoming topic against your filters "table" statestore using Processor API and drop the records from the stream
This way, you separate your database reads and writes once you start inserting 50k+ records, and your app wouldn't be blocked trying to read any "external config"
You could also use Zookeeper, as that's one of its use cases

Related

idiomatic way to do many dynamic filtered views of a Flink table?

I would like to create a per-user view of data tables stored in Flink, which is constantly updated as changes happen to the source data, so that I can have a constantly updating UI based on a toChangelogStream() of the user's view of the data. To do that, I was thinking that I could create an ad-hoc SQL query like SELECT * FROM foo WHERE userid=X and convert it to a changelog stream, which would have a bunch of inserts at the beginning of the stream to give me the initial state, followed by live updates after that point. I would leave that query running as long as the user is using the UI, and then delete the table when the user's session ends. I think this is effectively how the Flink SQL client must work, so it seem like this is possible.
However, I anticipate that there may be some large overheads associated with each ad hoc query if I do it this way. When I write a SQL query, based on the answer in Apache Flink Table 1.4: External SQL execution on Table possible?, it sounds like internally this is going to compile a new JAR file and create new pipeline stages, I assume using more JVM metaspace for each user. I can have tens of thousands of users using the UI at once, so I'm not sure that's really feasible.
What's the idiomatic way to do this? The other ways I'm looking at are:
I could maybe use queryable state since I could group the current rows behind the userid as the key, but as far as I can tell it does not provide a way to get a changelog stream, so I would have to constantly re-query the state on a periodic basis, which is not ideal for my use case (the per-user state can be large sometimes but doesn't change quickly).
Another alternative is to output the table to both a changelog stream sink and an external RDBMS sink, but if I do that, what's the best pattern for how to join those together in the client?

Periodically refreshing static data in Apache Flink?

I have an application that receives much of its input from a stream, but some of its data comes from both a RDBMS and also a series of static files.
The stream will continuously emit events so the flink job will never end, but how do you periodically refresh the RDBMS data and the static file to capture any updates to those sources?
I am currently using the JDBCInputFormat to read data from the database.
Below is a rough schematic of what I am trying to do:
For each of your two sources that might change (RDBMS and files), create a Flink source that uses a broadcast stream to send updates to the Flink operators that are processing the data from Kafka. Broadcast streams send each Object to each task/instance of the receiving operator.
For each of your sources, files and RDBMS, you can create a snapshot in HDFS or in a storage periodically(example at every 6 hours) and calculate the difference between to snapshots.The result will be push to Kafka. This solution works when you can not modify the database and files structure and an extra information(ex in RDBMS - a column named last_update).
Another solution is to add a column named last_update used to filter data that has changed between to queries and push the data to Kafka.

Data consistency across multiple microservices, which duplicate data

I am currently trying to get into microservices architecture, and I came across Data consistency issue. I've read, that duplicating data between several microservices considered a good idea, because it makes each service more independent.
However, I can't figure out what to do in the following case to provide consistency:
I have a Customer service which has a RegisterCustomer method.
When I register a customer, I want to send a message via RabbitMQ, so other services can pick up this information and store in its DB.
My code looks something like this:
...
_dbContext.Add(customer);
CustomerRegistered e = Mapper.Map<CustomerRegistered>(customer);
await _messagePublisher.PublishMessageAsync(e.MessageType, e, "");
//!!app crashes
_dbContext.SaveChanges();
...
So I would like to know, how can I handle such case, when application sends the message, but is unable to save data itself? Of course, I could swap DbContextSave and PublishMessage methods, but trouble is still there. Is there something wrong with my data storing approach?
Yes. You are doing dual persistence - persistence in DB and durable queue. If one succeeds and other fails, you'd always be in trouble. There are a few ways to handle this:
Persist in DB and then do Change Data Capture (CDC) such that the data from the DB Write Ahead Log (WAL) is used to create a materialized view in the second service DB using real time streaming
Persist in a durable queue and a cache. Using real time streaming persist the data in both the services. Read data from cache if the data is available in cache, otherwise read from DB. This will allow read after write. Even if write to cache fails in worst case, within seconds the data will be in DB through streaming
NServiceBus does support durable distributed transaction in many scenarios vs. RMQ.Maybe you can look into using that feature to ensure that both the contexts are saved or rolled back together in case of failures if you can use NServiceBus instead of RMQ.
I think the solution you're looking for is outbox pattern,
there is an event related database table in the same database as your business data,
this allows them to be committed in the same database transaction,
and then a background worker loop push the event to mq

Transform a specific group of records from MongoDB

I've got a periodically triggered batch job which writes data into a MongoDB. The job needs about 10 minutes and after that I would like to receive this data and do some transformations with Apache Flink (Mapping, Filtering, Cleaning...). There are some dependencies between the records which means I have to process them together. For example I like to transform all records from the latest batch job where the customer id is 45666. The result would be one aggregated record.
Are there any best practices or ways to do that without implementing everything by myself (get distict customer ids from latest job, for each customer select records and transform, flag the transformed customers, etc....)?
I'm not able to stream it because I have to transform multiple records together and not one by one.
Currently I'm using Spring Batch, MongoDB, Kafka and thinking about Apache Flink.
Conceivably you could connect the MongoDB change stream to Flink and use that as the basis for the task you describe. The fact that 10-35 GB of data is involved doesn't rule out using Flink streaming, as you can configure Flink to spill to disk if its state can't fit on the heap.
I would want to understand the situation better before concluding that this is a sensible approach, however.

Message Queue or DataBase insert and select

I am designing an application and I have two ideas in mind (below). I have a process that collects data appx. 30 KB and this data will be collected every 5 minutes and needs to be updated on client (web side-- 100 users at any given time). Information collected does not need to be stored for future usage.
Options:
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
Collect data and put it into Topic or Queue. Now multiple clients (consumers) can go to Queue and obtain data.
I am looking for option 2 as better solution because it is faster (no DB calls) and no redundancy of storage.
Can anyone suggest which would be ideal solution and why ?
I don't really understand the difference. The data has to be temporarily stored somewhere until the next update, right.
But all users can see it, not just the first person to get there, right? So a queue is not really an appropriate data structure from my interpretation of your system.
Whether the data is written to something persistent like a database or something less persistent like part of the web server or application server may be relevant here.
Also, you have tagged this as real-time, but I don't see how the web-clients are getting updates real-time without some kind of push/long-pull or whatever.
Seems to me that you need to use a queue and publisher/subscriber pattern.
This is an article about RabitMQ and Publish/Subscribe pattern.
I can get data and insert into database every 5 minutes. And then client call will be made to DB and retrieve data and update UI.
You can program your application to be event oriented. For ie, raise domain events and publish your message for your subscribers.
When you use a queue, the subscriber will dequeue the message addressed to him and, ofc, obeying the order (FIFO). In addition, there will be a guarantee of delivery, different from a database where the record can be delete, and yet not every 'subscriber' have gotten the message.
The pitfalls of using the database to accomplish this is:
Creation of indexes makes querying faster, but inserts slower;
Will have to control the delivery guarantee for every subscriber;
You'll need TTL (Time to Live) strategy for the records purge (considering delivery guarantee);

Resources